Stalled AI agent / loop
The most expensive failure in AI code isn’t a crash, it’s silence. The agent thinks, calls tools, spends tokens, but never finishes on its own:
- got stuck repeating the same tool calls with different arguments;
- waiting for a response from an external service that quietly died (and the LLM has no timeout);
- stuck in “let me check one more time”, the planner stack grew to 30 levels.
Protection against this is Notifly heartbeat notifications.
Template 1: heartbeat «agent alive, step N»
Section titled “Template 1: heartbeat «agent alive, step N»”Create a heartbeat in the admin dashboard with an interval 1.5–2× longer than the maximum expected time of a single agent step. Put the resulting URL into the loop:
import os, requests
PING_URL = os.environ["AGENT_PING_URL"] # https://.../heartbeat/ping/H...
def step(state): # one step of the agent-loop: tool / model / tool / ... ... requests.get(PING_URL, timeout=3) # 'I'm alive' marker return new_stateIf a step doesn’t fit within intervalSec + graceSec — Notifly will send an alert.
You should also enable a recovery message — when the agent “comes back to life” (passes
the next ping), you’ll receive confirmation.
Template 2: heartbeat «total duration»
Section titled “Template 2: heartbeat «total duration»”If steps are short, and the problem is that the agent is going in circles — the heartbeat won’t catch it. Here you need a top-level timer:
import time, threading, requests
DEADLINE = time.time() + 30 * 60 # 30 minutes — the cap
def watchdog(): while time.time() < DEADLINE: time.sleep(30) requests.post(f"{os.environ['NOTIFLY_URL']}/message", params={"token": os.environ["NOTIFLY_TOKEN"]}, json={"title": "🤖⏳ Агент превысил 30 минут", "message": "Похоже, loop. Проверьте логи / убейте процесс.", "priority": 9}, timeout=5)
threading.Thread(target=watchdog, daemon=True).start()agent.run(...) # if it finishes in time — the thread will die with the processTemplate 3: catching trivial loops in tool calls
Section titled “Template 3: catching trivial loops in tool calls”Compare the hash of the last N tool calls: if they match — it’s almost certainly a loop. Early detection, push notification, you can stop it immediately.
import collections, hashlib, json
class LoopGuard: def __init__(self, window=6, repeats=3): self.window = window self.repeats = repeats self.hist = collections.deque(maxlen=window)
def step(self, tool_name, args): h = hashlib.sha1(json.dumps([tool_name, args], sort_keys=True).encode()).hexdigest() self.hist.append(h) if len(self.hist) == self.window and self.hist.count(h) >= self.repeats: notify("🔁 Агент в loop", f"Повторяется {tool_name} с теми же аргументами {self.repeats}+ раз", prio=9) raise RuntimeError("agent loop detected")
guard = LoopGuard()for tool_call in agent.iter(): guard.step(tool_call.name, tool_call.arguments) tool_call.execute()Template 4: external watchdog via Notifly heartbeat
Section titled “Template 4: external watchdog via Notifly heartbeat”If the agent doesn’t have a reliable place to “wrap the loop”, run it as a systemd service that pings the heartbeat:
[Service]ExecStart=/usr/local/bin/run-agent.shExecStartPost=/bin/sh -c 'while kill -0 $MAINPID; do curl -fsS $AGENT_PING_URL; sleep 60; done'Here the heartbeat is pinged from outside, and an alert will arrive on any “hang”
of the main process (including kill -STOP).
What to send in the push
Section titled “What to send in the push”So you can decide from the Telegram window “kill it or let it think more”:
- how many seconds / iterations have passed;
- the last 3 tool calls (name, length of arguments);
- the current “plan” (if the agent publishes it);
- how many tokens / money have already been spent for this run.
Related recipes
Section titled “Related recipes”- Finishing a long task — the opposite: “finished in time”.
- LLM API costs — sometimes the first signal of a loop is the bill.
- Heartbeat (dead-man-switch) — the general concept.