LLM latency degradation

LLM providers rarely fail completely. Much more often you’ll get 200 OK, but time-to-first-token instead of the usual 300 ms becomes 5 seconds, and your agent starts timing out and spamming retries. Standard uptime monitors don’t notice this: 200 OK, status green.

Use an active monitor with a custom cloud function: hit a cheap endpoint and alert when latency exceeds a threshold.

Skeleton: scheduled function on YC

Place this code in a Yandex Cloud Function with a timer-trigger * * * * ? * (once per minute):

import os, time, statistics, requests
import anthropic, openai

NOTIFLY_URL   = os.environ["NOTIFLY_URL"]
NOTIFLY_TOKEN = os.environ["NOTIFLY_TOKEN"]
THRESHOLD_MS  = 3000           # latency threshold
WINDOW        = 5              # how many consecutive slow requests = alert

def measure_anthropic():
    c = anthropic.Anthropic()
    t0 = time.time()
    c.messages.create(model="claude-haiku-4-5", max_tokens=1,
                      messages=[{"role": "user", "content": "ping"}])
    return (time.time() - t0) * 1000

def measure_openai():
    c = openai.OpenAI()
    t0 = time.time()
    c.chat.completions.create(model="gpt-4o-mini", max_tokens=1,
                              messages=[{"role": "user", "content": "ping"}])
    return (time.time() - t0) * 1000

PROVIDERS = {"anthropic": measure_anthropic, "openai": measure_openai}

# Simple window in /tmp — suitable because a YC instance
# usually lives a few minutes between cold starts.
def state_path(name):
    return f"/tmp/latency-{name}.txt"

def push_window(name, value):
    p = state_path(name)
    arr = []
    if os.path.exists(p):
        arr = [float(x) for x in open(p).read().split() if x]
    arr.append(value)
    arr = arr[-WINDOW:]
    open(p, "w").write(" ".join(map(str, arr)))
    return arr

def notify(title, msg, prio):
    requests.post(f"{NOTIFLY_URL}/message",
                  params={"token": NOTIFLY_TOKEN},
                  json={"title": title, "message": msg, "priority": prio},
                  timeout=5)

def handler(event, context):
    for name, fn in PROVIDERS.items():
        try:
            ms = fn()
        except Exception as e:
            notify(f"❌ {name} error", str(e), 9)
            continue
        arr = push_window(name, ms)
        if len(arr) >= WINDOW and statistics.median(arr) > THRESHOLD_MS:
            notify(
                f"⏱️ {name} latency деградация",
                f"Медиана за {WINDOW} запросов: {int(statistics.median(arr))} мс "
                f"(порог {THRESHOLD_MS}). Последний: {int(ms)} мс.",
                priority=7,
            )
    return {"statusCode": 200}

Instead of /tmp you can store the window in YDB — guarantees persistence regardless of cold starts; see the architecture.

Time-to-first-token (TTFT) is more important than total

For streaming agents the time to the first token matters more than the total time. If you work with SSE — measure that specifically:

t0 = time.time()
ttft = None
with client.messages.stream(...) as stream:
    for ev in stream.text_stream:
        if ttft is None:
            ttft = (time.time() - t0) * 1000
            break

Alerting on ttft > N ms is much more sensitive than using “full” latency, because the full response time is dominated by the response size.

Don’t forget about embeddings

Embeddings endpoints break separately from chat (often a different deployment). If you have RAG — monitor /v1/embeddings with a separate check, otherwise indexing and search will “hang” silently.

LLM provider availability — basic uptime + 5xx;
Vector DB / RAG — the other half of the latency chain;
Custom cloud function for integrity checks — general skeleton.

LLM latency degradation

Skeleton: scheduled function on YC

Time-to-first-token (TTFT) is more important than total

Don’t forget about embeddings

Related recipes