Skip to content

LLM latency degradation

LLM providers rarely go down completely. Much more often you get 200 OK, but time-to-first-token instead of the usual 300 ms becomes 5 seconds, and your agent starts timing out and spinning retries. Standard uptime monitors don’t notice this: 200 OK, status green.

We use an active monitor with our own cloud function: hit a cheap endpoint and alert when a latency threshold is exceeded.

Put this code into a Yandex Cloud Function with a timer-trigger * * * * ? * (once per minute):

import os, time, statistics, requests
import anthropic, openai
NOTIFLY_URL = os.environ["NOTIFLY_URL"]
NOTIFLY_TOKEN = os.environ["NOTIFLY_TOKEN"]
THRESHOLD_MS = 3000 # порог латентности
WINDOW = 5 # сколько подряд медленных запросов = алёрт
def measure_anthropic():
c = anthropic.Anthropic()
t0 = time.time()
c.messages.create(model="claude-haiku-4-5", max_tokens=1,
messages=[{"role": "user", "content": "ping"}])
return (time.time() - t0) * 1000
def measure_openai():
c = openai.OpenAI()
t0 = time.time()
c.chat.completions.create(model="gpt-4o-mini", max_tokens=1,
messages=[{"role": "user", "content": "ping"}])
return (time.time() - t0) * 1000
PROVIDERS = {"anthropic": measure_anthropic, "openai": measure_openai}
# Простое окно в /tmp — подойдёт, потому что инстанс YC
# обычно живёт несколько минут между холодными стартами.
def state_path(name):
return f"/tmp/latency-{name}.txt"
def push_window(name, value):
p = state_path(name)
arr = []
if os.path.exists(p):
arr = [float(x) for x in open(p).read().split() if x]
arr.append(value)
arr = arr[-WINDOW:]
open(p, "w").write(" ".join(map(str, arr)))
return arr
def notify(title, msg, prio):
requests.post(f"{NOTIFLY_URL}/message",
params={"token": NOTIFLY_TOKEN},
json={"title": title, "message": msg, "priority": prio},
timeout=5)
def handler(event, context):
for name, fn in PROVIDERS.items():
try:
ms = fn()
except Exception as e:
notify(f"❌ {name} error", str(e), 9)
continue
arr = push_window(name, ms)
if len(arr) >= WINDOW and statistics.median(arr) > THRESHOLD_MS:
notify(
f"⏱️ {name} latency деградация",
f"Медиана за {WINDOW} запросов: {int(statistics.median(arr))} мс "
f"(порог {THRESHOLD_MS}). Последний: {int(ms)} мс.",
priority=7,
)
return {"statusCode": 200}

Instead of /tmp you can store the window in YDB — guaranteed independent of cold starts; see architecture.

Time-to-first-token (TTFT) matters more than total

Section titled “Time-to-first-token (TTFT) matters more than total”

For streaming agents the important metric is not the total time, but the time to the first token. If you work with SSE — measure that exactly:

t0 = time.time()
ttft = None
with client.messages.stream(...) as stream:
for ev in stream.text_stream:
if ttft is None:
ttft = (time.time() - t0) * 1000
break

Alerting on ttft > N ms is much more sensitive than using “full” latency, because the total response time is dominated by the response size.

Embeddings endpoints fail separately from chat (often a different deployment). If you have RAG — monitor /v1/embeddings with a separate check, otherwise indexing and search will hang silently.