Stalled AI agent / loop

The most expensive failure in AI code isn’t a crash, it’s silence. The agent thinks, calls tools, spends tokens, but never finishes on its own:

got stuck repeating the same tool calls with different arguments;
waiting for a response from an external service that quietly died (and the LLM has no timeout);
stuck in “let me check one more time”, the planner stack grew to 30 levels.

Protection against this is Notifly heartbeat notifications.

Template 1: heartbeat «agent alive, step N»

Create a heartbeat in the admin dashboard with an interval 1.5–2× longer than the maximum expected time of a single agent step. Put the resulting URL into the loop:

import os, requests

PING_URL = os.environ["AGENT_PING_URL"]   # https://.../heartbeat/ping/H...

def step(state):
    # one step of the agent-loop: tool / model / tool / ...
    ...
    requests.get(PING_URL, timeout=3)     # 'I'm alive' marker
    return new_state

If a step doesn’t fit within intervalSec + graceSec — Notifly will send an alert. You should also enable a recovery message — when the agent “comes back to life” (passes the next ping), you’ll receive confirmation.

Template 2: heartbeat «total duration»

If steps are short, and the problem is that the agent is going in circles — the heartbeat won’t catch it. Here you need a top-level timer:

import time, threading, requests

DEADLINE = time.time() + 30 * 60   # 30 minutes — the cap

def watchdog():
    while time.time() < DEADLINE:
        time.sleep(30)
    requests.post(f"{os.environ['NOTIFLY_URL']}/message",
                  params={"token": os.environ["NOTIFLY_TOKEN"]},
                  json={"title":   "🤖⏳ Агент превысил 30 минут",
                        "message": "Похоже, loop. Проверьте логи / убейте процесс.",
                        "priority": 9},
                  timeout=5)

threading.Thread(target=watchdog, daemon=True).start()
agent.run(...)   # if it finishes in time — the thread will die with the process

Template 3: catching trivial loops in tool calls

Compare the hash of the last N tool calls: if they match — it’s almost certainly a loop. Early detection, push notification, you can stop it immediately.

import collections, hashlib, json

class LoopGuard:
    def __init__(self, window=6, repeats=3):
        self.window  = window
        self.repeats = repeats
        self.hist    = collections.deque(maxlen=window)

    def step(self, tool_name, args):
        h = hashlib.sha1(json.dumps([tool_name, args], sort_keys=True).encode()).hexdigest()
        self.hist.append(h)
        if len(self.hist) == self.window and self.hist.count(h) >= self.repeats:
            notify("🔁 Агент в loop",
                   f"Повторяется {tool_name} с теми же аргументами {self.repeats}+ раз",
                   prio=9)
            raise RuntimeError("agent loop detected")

guard = LoopGuard()
for tool_call in agent.iter():
    guard.step(tool_call.name, tool_call.arguments)
    tool_call.execute()

Template 4: external watchdog via Notifly heartbeat

If the agent doesn’t have a reliable place to “wrap the loop”, run it as a systemd service that pings the heartbeat:

[Service]
ExecStart=/usr/local/bin/run-agent.sh
ExecStartPost=/bin/sh -c 'while kill -0 $MAINPID; do curl -fsS $AGENT_PING_URL; sleep 60; done'

Here the heartbeat is pinged from outside, and an alert will arrive on any “hang” of the main process (including kill -STOP).

What to send in the push

So you can decide from the Telegram window “kill it or let it think more”:

how many seconds / iterations have passed;
the last 3 tool calls (name, length of arguments);
the current “plan” (if the agent publishes it);
how many tokens / money have already been spent for this run.

Finishing a long task — the opposite: “finished in time”.
LLM API costs — sometimes the first signal of a loop is the bill.
Heartbeat (dead-man-switch) — the general concept.