The idea
You deploy a cron job. It runs every hour, backs up a database, sends a report, cleans up old files. One day it stops — a permission error, a changed path, a hung process — and you find out three weeks later when someone asks where the reports went. A heartbeat monitor fixes this by inverting the check: your script pings a URL when it finishes successfully, and the service alerts you if the ping stops arriving.
This is a small, focused API you self-host (or deploy as a single container). Each registered job gets a unique URL. Hit that URL at the end of your cron script. If the service doesn't see a hit within the expected window, it fires an alert.
Why build this
Cron jobs are the dark matter of production infrastructure — invisible, assumed-working, rarely watched. Existing solutions (Dead Man's Snitch, Cronitor, Healthchecks.io) are good but charge per job or per seat, which feels wrong for something this simple. A solo developer running 10–20 personal or client cron jobs doesn't want a $30/month subscription; they want a Docker image they can docker run in five minutes.
The technology is completely standard — HTTP endpoints, a scheduler loop, SMTP — so the build risk is low. The market gap is the self-hosted tier: nobody ships a clean, maintained open-source version of this that's easy to operate.
Stack sketch
- Backend: Go — a single binary, low memory footprint, trivial to cross-compile for common VPS architectures.
- Storage: SQLite via
modernc.org/sqlite(no CGo) for job registry and ping history. Upgrade path to Postgres via aDATABASE_URLenv var. - HTTP:
net/httpstandard library; no framework needed at this scope. - Alert channels: SMTP (direct send or relay), Slack incoming webhooks, and a generic webhook POST for everything else.
- Frontend: A small HTMX + Tailwind admin page served from the same binary — no separate frontend build step.
- Packaging: Single Docker image (
scratchbase), published to GHCR. Onedocker runcommand with env vars to configure.
Scope for v1
- Job registration via web UI and API (name, expected interval, grace period).
- Unique ping URL per job (
GET /ping/<token>). - Background checker loop that runs every minute and compares last-ping timestamp against the expected window.
- Alert on first miss; re-alert if still down after 24 h; auto-resolve when ping resumes.
- Notification channels: SMTP and Slack webhook (others can be added via the generic webhook).
- Simple dashboard showing each job's status (healthy / late / down), last ping time, and a 30-day ping history sparkline.
- Bearer-token auth on the admin API.
Out of scope for v1: multi-user/team support, SSO, mobile push, on-call rotations, complex schedules (cron expressions). Those belong in a paid layer or a follow-up.
Where it could go
The obvious expansion is richer schedule awareness — let the user specify a cron expression and have the checker compute the exact next-expected window rather than a rolling interval. This catches jobs that run at irregular hours (e.g., "every weekday at 9 am") without false alerts during legitimate gaps.
A second path is turning the service into a lightweight observability layer: accept a duration alongside the ping (?duration=42s) to track how long each run takes, surface p95 runtimes, and alert when a job that normally takes 30 seconds starts taking 10 minutes. That edges into APM territory but stays within the "one binary, one config file" philosophy.
Watch out for
The grace-period UX is easy to get wrong — too tight and you flood on-call channels with noise for jobs that occasionally run 90 seconds late; too loose and the monitor isn't useful. Defaulting to a grace period of 10% of the interval and making it prominently configurable per job is probably the right call.