The idea
A small web app where you define your team, set a rotation interval, and get an auto-generated schedule for the next 90 days. The active on-call engineer gets a notification 24 hours before their shift starts. When an incident fires — from a monitoring webhook, a manual trigger in the UI, or an email to a dedicated address — the app logs it and alerts the on-call person via SMS or push notification. Everything runs on a single Docker container on your own VPS, with no per-seat pricing and no vendor account required.
Why build this
PagerDuty and OpsGenie start at $15–20 per user per month, which is real money for a three-person startup or a small internal tools team inside a larger company. The full feature set — ML-based noise reduction, ServiceNow integrations, enterprise SSO — is overkill for teams that need exactly two things: "who's on call this week" and "how do we wake them up when something breaks." Cheaper alternatives like Squadcast and Signl4 still require vendor-managed alerting infrastructure. A self-hosted option running on a $5 VPS closes that gap without ongoing costs.
The technical surface is small: a schedule generator, a job that fires reminders, a webhook endpoint that triggers notifications, and a basic incident log. Nothing exotic.
Stack sketch
- Backend: Python + FastAPI; APScheduler for the 24-hour shift reminder and escalation timeout jobs
- Storage: SQLite via SQLAlchemy — one file, one volume mount, trivial to back up
- Frontend: HTMX + Jinja2 templates; a calendar grid rendered server-side, no JS build step
- Notifications: Twilio for SMS; Ntfy (self-hosted or free cloud tier) for push; smtplib for email — team members configure which channels they want
- Inbound webhooks: a single POST endpoint that accepts arbitrary JSON; ships with named parsers for Grafana, Uptime Kuma, and Prometheus Alertmanager payloads
- Auth: single shared team token in a signed cookie; optional per-user bcrypt passwords for teams that need accountability in the incident log
- Deploy: single Docker image, SQLite in a named volume, Traefik label pattern for TLS
Scope for v1
- Add team members: name, phone number, email address, Ntfy topic — each member enables only the channels they want
- Define one rotation: weekly or biweekly interval, start date, ordered list of members; app generates the schedule forward from that
- Calendar grid view showing who is on call each day for the next 90 days, with a clear "on call now" banner on the home page
- Inbound webhook endpoint: any POST triggers an incident, logs the raw payload, and notifies the active on-call person
- Manual "test alert" button so teams can verify delivery before going live
- Incident log: timestamp, trigger source (webhook name or manual), optional resolution note added by the on-call engineer
- 24-hour shift reminder sent automatically before each new shift starts
Deliberately out of scope for v1: escalation policies, multiple concurrent rotations, recurring overrides, SLA tracking, mobile app, Slack integration.
Where it could go
The most-requested follow-on for any on-call tool is escalation: if the on-call person does not acknowledge within N minutes, notify a secondary. Adding this requires a state machine on the incident — triggered → acknowledged → resolved — and a timed job that fires the escalation on timeout. It is maybe two weekends of work on top of v1, and it closes the gap with PagerDuty's core feature for small teams.
A second direction is richer monitoring integrations. Pre-built parsers for the most common webhook shapes (Grafana, Alertmanager, Uptime Kuma, Checkly) mean teams drop in the URL and it just works without hand-editing a JSON mapping. Pair that with a "silence" endpoint that mutes a specific alert rule during a maintenance window and the tool starts to feel like a real replacement for the lower tiers of commercial products.
Watch out for
SMS delivery through Twilio is unreliable in some regions and subject to carrier filtering for short messages that look like alerts — test with real devices on real carriers before treating SMS as the primary channel. Build Ntfy push as the default and present SMS as a fallback so teams in areas with patchy Twilio coverage are not silently uncontactable during an incident.