LLM latency degradation
LLM providers rarely go down completely. Much more often you get 200 OK, but
time-to-first-token instead of the usual 300 ms becomes 5 seconds, and your
agent starts timing out and spinning retries. Standard uptime monitors
don’t notice this: 200 OK, status green.
We use an active monitor with our own cloud function: hit a cheap endpoint and alert when a latency threshold is exceeded.
Skeleton: scheduled function on YC
Section titled “Skeleton: scheduled function on YC”Put this code into a Yandex Cloud Function with a timer-trigger
* * * * ? * (once per minute):
import os, time, statistics, requestsimport anthropic, openai
NOTIFLY_URL = os.environ["NOTIFLY_URL"]NOTIFLY_TOKEN = os.environ["NOTIFLY_TOKEN"]THRESHOLD_MS = 3000 # порог латентностиWINDOW = 5 # сколько подряд медленных запросов = алёрт
def measure_anthropic(): c = anthropic.Anthropic() t0 = time.time() c.messages.create(model="claude-haiku-4-5", max_tokens=1, messages=[{"role": "user", "content": "ping"}]) return (time.time() - t0) * 1000
def measure_openai(): c = openai.OpenAI() t0 = time.time() c.chat.completions.create(model="gpt-4o-mini", max_tokens=1, messages=[{"role": "user", "content": "ping"}]) return (time.time() - t0) * 1000
PROVIDERS = {"anthropic": measure_anthropic, "openai": measure_openai}
# Простое окно в /tmp — подойдёт, потому что инстанс YC# обычно живёт несколько минут между холодными стартами.def state_path(name): return f"/tmp/latency-{name}.txt"
def push_window(name, value): p = state_path(name) arr = [] if os.path.exists(p): arr = [float(x) for x in open(p).read().split() if x] arr.append(value) arr = arr[-WINDOW:] open(p, "w").write(" ".join(map(str, arr))) return arr
def notify(title, msg, prio): requests.post(f"{NOTIFLY_URL}/message", params={"token": NOTIFLY_TOKEN}, json={"title": title, "message": msg, "priority": prio}, timeout=5)
def handler(event, context): for name, fn in PROVIDERS.items(): try: ms = fn() except Exception as e: notify(f"❌ {name} error", str(e), 9) continue arr = push_window(name, ms) if len(arr) >= WINDOW and statistics.median(arr) > THRESHOLD_MS: notify( f"⏱️ {name} latency деградация", f"Медиана за {WINDOW} запросов: {int(statistics.median(arr))} мс " f"(порог {THRESHOLD_MS}). Последний: {int(ms)} мс.", priority=7, ) return {"statusCode": 200}Instead of /tmp you can store the window in YDB — guaranteed independent of cold
starts; see architecture.
Time-to-first-token (TTFT) matters more than total
Section titled “Time-to-first-token (TTFT) matters more than total”For streaming agents the important metric is not the total time, but the time to the first token. If you work with SSE — measure that exactly:
t0 = time.time()ttft = Nonewith client.messages.stream(...) as stream: for ev in stream.text_stream: if ttft is None: ttft = (time.time() - t0) * 1000 breakAlerting on ttft > N ms is much more sensitive than using “full” latency,
because the total response time is dominated by the response size.
Don’t forget about embeddings
Section titled “Don’t forget about embeddings”Embeddings endpoints fail separately from chat (often a different deployment).
If you have RAG — monitor /v1/embeddings with a separate check, otherwise
indexing and search will hang silently.
Related recipes
Section titled “Related recipes”- Доступность LLM-провайдеров — простой uptime + 5xx;
- Vector DB / RAG — другая половина latency-цепочки;
- Своя cloud-функция integrity-проверки — общий скелет.