How a 60-Second Script Saved Us From Countless Hours of Downtime with OpenClaw
You know that feeling when you wake up to a flood of support tickets? Or worse—you don't wake up, and your customers do the waking for you?
That was us. Repeatedly.
The Problem Nobody Warns You About
When you're running a SaaS business, there’s infrastructure stuff that “should just work.” For us, it was OpenClaw’s gateway process—the coordinator that keeps our AI agent ecosystem running.
Should work. But didn’t always.
The gateway would crash. Sometimes at 3am. Sometimes during a demo. Always at the worst time, because that’s how these things operate.
We’d get alerted (eventually). Someone would SSH in. Run some commands. Restart things. Cross fingers.
The problem wasn’t complexity. It was reliability.
Nothing about the gateway itself was fundamentally broken. But any long-running process can hit memory issues, network hiccups, or just… quit. And when it quit, everything downstream stopped.
What We Built (And Why It’s Simpler Than You Think)
We built a watchdog. A tiny script that runs every 60 seconds and asks one question:
“Is the gateway alive?”
If yes: great, do nothing.
If no: recover it automatically.
Recovery ladder:
- First try: openclaw gateway restart
- If that fails: full stop/start sequence
- If that fails: send Telegram alert with failure context
That’s it.
The script runs via launchd, so it’s always watching. No heavy monitoring stack, no extra services to babysit. Just one script with one job.
The Moments That Proved It Mattered
- The watchdog detected unhealthy gateway states and recovered them automatically with restart.
- When restart didn’t work, it escalated to stop/start.
- When auto-recovery failed, it alerted us immediately via Telegram so we could intervene.
In practice: fewer surprises, faster recovery, less firefighting.
What This Means for Your Business
If you’re a founder or operator, your highest-leverage work is strategy, customers, and execution—not manual infra babysitting.
Every minute spent firefighting avoidable downtime is a minute not spent moving the business.
The watchdog doesn’t make systems perfect. It makes them resilient.
And resilience is what lets you scale without constantly looking over your shoulder.
Three Takeaways
1) Automate recovery, not just detection
Detection without action still wakes someone up. Build auto-recovery first, alert second.
2) Simple beats clever
Our watchdog is small and understandable. That’s why it works reliably.
3) Reliability is a growth lever
Customers don’t care how elegant your architecture is. They care that things work.
Where to Start
You don’t need this exact script. You need the principle:
- Identify your single points of failure
- Add lightweight health checks
- Add an automatic recovery ladder
- Alert only when automation can’t recover
Start with one critical process and one script.
Then ship it and move on.