Skip to content

Drop in prompt-cache hit-rate

Prompt-cache works only with a bit-for-bit identical prefix. One extra space, a new date in the system prompt, a new tool version in MCP-tools — and the entire prefix is recalculated from scratch. We analyze cached_tokens / total_tokens from the response:

import os, time, statistics, requests, json
W = "/tmp/cache-hit.json"
def observe(usage):
cached = (usage.cache_read_input_tokens or 0)
total = usage.input_tokens or 1
ratio = cached / total
s = (json.load(open(W)) if os.path.exists(W) else {"r": []})
s["r"] = (s["r"] + [ratio])[-300:]
json.dump(s, open(W, "w"))
if len(s["r"]) >= 100:
old, new = statistics.mean(s["r"][:50]), statistics.mean(s["r"][-50:])
if old > 0.3 and new < old / 2:
push("❄️ prompt-cache hit упал",
f"Было: {int(old*100)}% → стало: {int(new*100)}%\n"
"Проверьте, не вставили ли в системный промпт переменную (дата, request-id).",
priority=7)
def push(t, m, p):
requests.post(f"{os.environ['NOTIFLY_URL']}/message",
params={"token": os.environ["NOTIFLY_TOKEN"]},
json={"title": t, "message": m, "priority": p}, timeout=5)

Top 3 reasons for hit-rate drop (from experience):

  1. the system prompt started including time.now().isoformat() or a request-id;
  2. the MCP server returned tools in a different order;
  3. the chat history began including a timestamp for every message.