How a 60-Second Script Saved Us From Countless Hours of Downtime with OpenClaw

You know that feeling when you wake up to a flood of support tickets? Or worse—you don't wake up, and your customers do the waking for you?

That was us. Repeatedly.

The Problem Nobody Warns You About

When you're running a SaaS business, there’s infrastructure stuff that “should just work.” For us, it was OpenClaw’s gateway process—the coordinator that keeps our AI agent ecosystem running.

Should work. But didn’t always.

The gateway would crash. Sometimes at 3am. Sometimes during a demo. Always at the worst time, because that’s how these things operate.

We’d get alerted (eventually). Someone would SSH in. Run some commands. Restart things. Cross fingers.

The problem wasn’t complexity. It was reliability.

Nothing about the gateway itself was fundamentally broken. But any long-running process can hit memory issues, network hiccups, or just… quit. And when it quit, everything downstream stopped.

What We Built (And Why It’s Simpler Than You Think)

We built a watchdog. A tiny script that runs every 60 seconds and asks one question:

“Is the gateway alive?”

If yes: great, do nothing.

If no: recover it automatically.

Recovery ladder:

First try: openclaw gateway restart
If that fails: full stop/start sequence
If that fails: send Telegram alert with failure context

That’s it.

The script runs via launchd, so it’s always watching. No heavy monitoring stack, no extra services to babysit. Just one script with one job.

The Moments That Proved It Mattered

The watchdog detected unhealthy gateway states and recovered them automatically with restart.
When restart didn’t work, it escalated to stop/start.
When auto-recovery failed, it alerted us immediately via Telegram so we could intervene.

In practice: fewer surprises, faster recovery, less firefighting.

What This Means for Your Business

If you’re a founder or operator, your highest-leverage work is strategy, customers, and execution—not manual infra babysitting.

Every minute spent firefighting avoidable downtime is a minute not spent moving the business.

The watchdog doesn’t make systems perfect. It makes them resilient.

And resilience is what lets you scale without constantly looking over your shoulder.

Three Takeaways

1) Automate recovery, not just detection

Detection without action still wakes someone up. Build auto-recovery first, alert second.

2) Simple beats clever

Our watchdog is small and understandable. That’s why it works reliably.

3) Reliability is a growth lever

Customers don’t care how elegant your architecture is. They care that things work.

Where to Start

You don’t need this exact script. You need the principle:

Identify your single points of failure
Add lightweight health checks
Add an automatic recovery ladder
Alert only when automation can’t recover

Start with one critical process and one script.

Then ship it and move on.