Quality drop (eval regression)

In solo AI development there’s no QA team and no metrics dashboard — but you do have an eval set of 20–100 examples that can check “the model works the way we want.” Run it automatically and alert on drops — this will detect:

a new model version release that broke your prompt;
a provider that quietly swapped the endpoint to a cheaper model;
your own commit that touched the format instruction;
RAG quality degradation (see vector-db).

Minimal eval-runner

import os, json, time, statistics, requests, anthropic

EVAL_FILE = "evals/qa.jsonl"            # {"input":..., "expected":...}
THRESHOLD_DROP = 0.10                   # 10% drop → alert

def grade(actual, expected):
    # The cheapest "scoring" — an LLM judge.
    judge = anthropic.Anthropic().messages.create(
        model="claude-haiku-4-5",
        max_tokens=4,
        messages=[{"role": "user", "content":
            f"Ответ модели: {actual}\nЭталон: {expected}\n\n"
            "Ответ корректен по смыслу? Ответь '1' или '0'."}],
    )
    return 1.0 if judge.content[0].text.strip().startswith("1") else 0.0

def run_eval():
    cases = [json.loads(l) for l in open(EVAL_FILE)]
    scores = []
    for c in cases:
        out = my_app.answer(c["input"])    # your actual pipeline
        scores.append(grade(out, c["expected"]))
    return statistics.mean(scores)

STATE = "/tmp/last_eval.json"

def main():
    score = run_eval()
    prev  = (json.load(open(STATE)) if os.path.exists(STATE) else {}).get("score", score)

    msg_lines = [f"Текущий score: {score:.3f}",
                 f"Прошлый score: {prev:.3f}"]

    if prev - score >= THRESHOLD_DROP:
        notify("📉 Eval-регрессия", "\n".join(msg_lines + [
            "Возможные причины: смена модели, новый промпт, RAG-индекс.",
        ]), priority=9)
    elif score - prev >= THRESHOLD_DROP:
        notify("📈 Eval улучшение", "\n".join(msg_lines), priority=4)

    json.dump({"score": score, "ts": time.time()}, open(STATE, "w"))

def notify(t, m, prio):
    requests.post(f"{os.environ['NOTIFLY_URL']}/message",
                  params={"token": os.environ["NOTIFLY_TOKEN"]},
                  json={"title": t, "message": m, "priority": prio},
                  timeout=5)

if __name__ == "__main__":
    main()

Run via cron / systemd-timer / GitHub Action once a day, or via a scheduled cloud function on YC.

Compare models and prompt variants

An eval-runner is also useful as a “cheap A/B before rollout”: run the same set on two configurations and send a push with the difference.

score_a = run_eval(prompt=PROMPT_OLD, model="claude-sonnet")
score_b = run_eval(prompt=PROMPT_NEW, model="claude-sonnet")

notify(
    "🧪 A/B prompts",
    f"OLD: {score_a:.3f}\nNEW: {score_b:.3f}\nΔ: {score_b - score_a:+.3f}",
    priority=5,
)

This is especially valuable when working with an agent — you asked Claude to “simplify the prompt”, got a commit, and want to see before merging that quality didn’t drop.

Alert on a new model version

Anthropic / OpenAI release on Wednesdays — and that’s a common time for silent breakages. A simple trap:

import requests
prev = open("/tmp/anthropic-models.txt").read() if os.path.exists("...") else ""
cur  = requests.get("https://api.anthropic.com/v1/models",
                    headers={"x-api-key": KEY, "anthropic-version": "2023-06-01"}).text
if cur != prev:
    notify("🆕 Anthropic models changed",
           "Список моделей изменился. Если используете latest-alias, прогоните eval.",
           priority=6)
    open("/tmp/anthropic-models.txt", "w").write(cur)

Same for OpenAI (/v1/models) and any provider with a /models API.

What to put in the alert text

old/new score;
how many cases dropped (top-3 with input + expected + actual in a gist link);
model / prompt version / commit — so you can roll back later.