Safety / prompt injection triggered
In an app with user input, safety incidents are the quietest class. The model responded I cannot help with that, the Azure content-filter returned 400 content_filter, a jailbreak exposed the system prompt — you usually learn about this from a user complaint a week later, when it’s already too late.
Sending a push for every case is noisy, but for the first case in a day / hour or for a sharp spike in frequency — that’s ideal.
1. Alert on model refusal
Section titled “1. Alert on model refusal”REFUSAL_MARKERS = ( "i cannot", "i can't", "i'm not able to", "я не могу", "я не имею", "as an ai", "this request violates",)
def looks_like_refusal(text: str) -> bool: t = text.lower()[:300] return any(m in t for m in REFUSAL_MARKERS)
resp = client.messages.create(...)text = resp.content[0].textif looks_like_refusal(text): notify("🛑 Safety: модель отказалась", f"Запрос:\n{user_input[:600]}\n\nОтвет:\n{text[:400]}", priority=7)For production, instead of “send every time” use a sliding-window: if there are more than N refusals in 10 minutes — send an aggregated push.
2. Provider content-filter
Section titled “2. Provider content-filter”OpenAI / Azure / Vertex return explicit errors — they’re easy to catch:
import openaitry: resp = client.chat.completions.create(...)except openai.BadRequestError as e: body = getattr(e, "body", {}) or {} code = body.get("code") or body.get("error", {}).get("code") if code in ("content_filter", "responsible_ai_policy_violation"): notify("🛑 Content-filter сработал", f"Provider: openai\nCode: {code}\n\nInput:\n{user_input[:800]}", priority=7) raiseFor Anthropic — the stop_reason == "refusal" field in the response and/or
HTTP 400 with error.type == "invalid_request_error".
3. Heuristics for prompt injection
Section titled “3. Heuristics for prompt injection”A simple set of signal strings that rarely appear in normal requests but almost always in jailbreak attempts:
INJECTION_PATTERNS = ( "ignore (all )?previous instructions", "you are now", "act as", "pretend to be", "system prompt", "show your prompt", "забудь все инструкции", "представь, что ты", "jailbreak", "DAN mode",)
import reRX = re.compile("|".join(INJECTION_PATTERNS), re.I)
def check_injection(text: str): m = RX.search(text) if m: notify("🚨 Prompt-injection попытка", f"Совпадение: «{m.group(0)}»\n\nВвод:\n{text[:1000]}", priority=8) # then — refuse / pass through a guard modelHeuristics are a first approximation. A serious prod setup should run input through a separate guard model (Llama Guard, Prompt Guard, your fine-tune). Notifly remains the alerting layer on top of it.
4. Alert on “user sees system prompt”
Section titled “4. Alert on “user sees system prompt””If fragments of your system prompt appear in the model’s reply — almost guaranteed successful jailbreak. Compare on the fly:
SYSTEM_FRAGMENTS = [ "You are a helpful assistant for FooCorp", # unique phrase "Internal tool: search_internal",]
if any(f in answer for f in SYSTEM_FRAGMENTS): notify("🚨 LEAK: системный промпт в ответе", f"Запрос:\n{user_input[:600]}\n\nОтвет:\n{answer[:1000]}", priority=10)An alert with priority: 10 is the loudest push — usually that’s where it goes.
5. Deduplication: don’t drown the channel in noise
Section titled “5. Deduplication: don’t drown the channel in noise”A useful pattern is to send the first case in the window and then periodic aggregates:
import timeWINDOW_SEC = 600state = {"first_ts": 0, "count": 0}
def maybe_alert(kind, payload): now = time.time() if now - state["first_ts"] > WINDOW_SEC: state["first_ts"] = now state["count"] = 0 notify(f"🛑 {kind} (первый случай)", payload, priority=7) state["count"] += 1 if state["count"] in (10, 100): # stepwise aggregation notify(f"🛑 {kind}: {state['count']} за окно", "Что-то происходит — посмотрите логи.", priority=9)What to put in the alert text
Section titled “What to put in the alert text”- classification (refusal / content_filter / injection / leak);
- truncated user input (without PII, if possible);
- a link to the full log in S3/Sentry/CloudWatch;
- session/chat identifier so it can be quickly found.