Skip to content

Stalled AI agent / loop

The most expensive failure in AI code isn’t a crash, it’s silence. The agent thinks, calls tools, spends tokens, but never finishes on its own:

  • got stuck repeating the same tool calls with different arguments;
  • waiting for a response from an external service that quietly died (and the LLM has no timeout);
  • stuck in “let me check one more time”, the planner stack grew to 30 levels.

Protection against this is Notifly heartbeat notifications.

Template 1: heartbeat «agent alive, step N»

Section titled “Template 1: heartbeat «agent alive, step N»”

Create a heartbeat in the admin dashboard with an interval 1.5–2× longer than the maximum expected time of a single agent step. Put the resulting URL into the loop:

import os, requests
PING_URL = os.environ["AGENT_PING_URL"] # https://.../heartbeat/ping/H...
def step(state):
# one step of the agent-loop: tool / model / tool / ...
...
requests.get(PING_URL, timeout=3) # 'I'm alive' marker
return new_state

If a step doesn’t fit within intervalSec + graceSec — Notifly will send an alert. You should also enable a recovery message — when the agent “comes back to life” (passes the next ping), you’ll receive confirmation.

If steps are short, and the problem is that the agent is going in circles — the heartbeat won’t catch it. Here you need a top-level timer:

import time, threading, requests
DEADLINE = time.time() + 30 * 60 # 30 minutes — the cap
def watchdog():
while time.time() < DEADLINE:
time.sleep(30)
requests.post(f"{os.environ['NOTIFLY_URL']}/message",
params={"token": os.environ["NOTIFLY_TOKEN"]},
json={"title": "🤖⏳ Агент превысил 30 минут",
"message": "Похоже, loop. Проверьте логи / убейте процесс.",
"priority": 9},
timeout=5)
threading.Thread(target=watchdog, daemon=True).start()
agent.run(...) # if it finishes in time — the thread will die with the process

Template 3: catching trivial loops in tool calls

Section titled “Template 3: catching trivial loops in tool calls”

Compare the hash of the last N tool calls: if they match — it’s almost certainly a loop. Early detection, push notification, you can stop it immediately.

import collections, hashlib, json
class LoopGuard:
def __init__(self, window=6, repeats=3):
self.window = window
self.repeats = repeats
self.hist = collections.deque(maxlen=window)
def step(self, tool_name, args):
h = hashlib.sha1(json.dumps([tool_name, args], sort_keys=True).encode()).hexdigest()
self.hist.append(h)
if len(self.hist) == self.window and self.hist.count(h) >= self.repeats:
notify("🔁 Агент в loop",
f"Повторяется {tool_name} с теми же аргументами {self.repeats}+ раз",
prio=9)
raise RuntimeError("agent loop detected")
guard = LoopGuard()
for tool_call in agent.iter():
guard.step(tool_call.name, tool_call.arguments)
tool_call.execute()

Template 4: external watchdog via Notifly heartbeat

Section titled “Template 4: external watchdog via Notifly heartbeat”

If the agent doesn’t have a reliable place to “wrap the loop”, run it as a systemd service that pings the heartbeat:

[Service]
ExecStart=/usr/local/bin/run-agent.sh
ExecStartPost=/bin/sh -c 'while kill -0 $MAINPID; do curl -fsS $AGENT_PING_URL; sleep 60; done'

Here the heartbeat is pinged from outside, and an alert will arrive on any “hang” of the main process (including kill -STOP).

So you can decide from the Telegram window “kill it or let it think more”:

  • how many seconds / iterations have passed;
  • the last 3 tool calls (name, length of arguments);
  • the current “plan” (if the agent publishes it);
  • how many tokens / money have already been spent for this run.