Cron-job heartbeat monitor

A lightweight API service developers ping from scheduled jobs so that missed or silently-failing crons trigger instant email or Slack alerts.

The idea

You deploy a cron job. It runs every hour, backs up a database, sends a report, cleans up old files. One day it stops — a permission error, a changed path, a hung process — and you find out three weeks later when someone asks where the reports went. A heartbeat monitor fixes this by inverting the check: your script pings a URL when it finishes successfully, and the service alerts you if the ping stops arriving.

This is a small, focused API you self-host (or deploy as a single container). Each registered job gets a unique URL. Hit that URL at the end of your cron script. If the service doesn't see a hit within the expected window, it fires an alert.

Why build this

Cron jobs are the dark matter of production infrastructure — invisible, assumed-working, rarely watched. Existing solutions (Dead Man's Snitch, Cronitor, Healthchecks.io) are good but charge per job or per seat, which feels wrong for something this simple. A solo developer running 10–20 personal or client cron jobs doesn't want a $30/month subscription; they want a Docker image they can docker run in five minutes.

The technology is completely standard — HTTP endpoints, a scheduler loop, SMTP — so the build risk is low. The market gap is the self-hosted tier: nobody ships a clean, maintained open-source version of this that's easy to operate.

Stack sketch

Backend: Go — a single binary, low memory footprint, trivial to cross-compile for common VPS architectures.
Storage: SQLite via modernc.org/sqlite (no CGo) for job registry and ping history. Upgrade path to Postgres via a DATABASE_URL env var.
HTTP: net/http standard library; no framework needed at this scope.
Alert channels: SMTP (direct send or relay), Slack incoming webhooks, and a generic webhook POST for everything else.
Frontend: A small HTMX + Tailwind admin page served from the same binary — no separate frontend build step.
Packaging: Single Docker image (scratch base), published to GHCR. One docker run command with env vars to configure.

Scope for v1

Job registration via web UI and API (name, expected interval, grace period).
Unique ping URL per job (GET /ping/<token>).
Background checker loop that runs every minute and compares last-ping timestamp against the expected window.
Alert on first miss; re-alert if still down after 24 h; auto-resolve when ping resumes.
Notification channels: SMTP and Slack webhook (others can be added via the generic webhook).
Simple dashboard showing each job's status (healthy / late / down), last ping time, and a 30-day ping history sparkline.
Bearer-token auth on the admin API.

Out of scope for v1: multi-user/team support, SSO, mobile push, on-call rotations, complex schedules (cron expressions). Those belong in a paid layer or a follow-up.

Where it could go

The obvious expansion is richer schedule awareness — let the user specify a cron expression and have the checker compute the exact next-expected window rather than a rolling interval. This catches jobs that run at irregular hours (e.g., "every weekday at 9 am") without false alerts during legitimate gaps.

A second path is turning the service into a lightweight observability layer: accept a duration alongside the ping (?duration=42s) to track how long each run takes, surface p95 runtimes, and alert when a job that normally takes 30 seconds starts taking 10 minutes. That edges into APM territory but stays within the "one binary, one config file" philosophy.

Watch out for

The grace-period UX is easy to get wrong — too tight and you flood on-call channels with noise for jobs that occasionally run 90 seconds late; too loose and the monitor isn't useful. Defaulting to a grace period of 10% of the interval and making it prominently configurable per job is probably the right call.