Safety / prompt injection triggered

In an app with user input, safety incidents are the quietest class. The model responded I cannot help with that, the Azure content-filter returned 400 content_filter, a jailbreak exposed the system prompt — you usually learn about this from a user complaint a week later, when it’s already too late.

Sending a push for every case is noisy, but for the first case in a day / hour or for a sharp spike in frequency — that’s ideal.

1. Alert on model refusal

REFUSAL_MARKERS = (
    "i cannot", "i can't", "i'm not able to",
    "я не могу", "я не имею",
    "as an ai", "this request violates",
)

def looks_like_refusal(text: str) -> bool:
    t = text.lower()[:300]
    return any(m in t for m in REFUSAL_MARKERS)

resp = client.messages.create(...)
text = resp.content[0].text
if looks_like_refusal(text):
    notify("🛑 Safety: модель отказалась",
           f"Запрос:\n{user_input[:600]}\n\nОтвет:\n{text[:400]}",
           priority=7)

For production, instead of “send every time” use a sliding-window: if there are more than N refusals in 10 minutes — send an aggregated push.

2. Provider content-filter

OpenAI / Azure / Vertex return explicit errors — they’re easy to catch:

import openai
try:
    resp = client.chat.completions.create(...)
except openai.BadRequestError as e:
    body = getattr(e, "body", {}) or {}
    code = body.get("code") or body.get("error", {}).get("code")
    if code in ("content_filter", "responsible_ai_policy_violation"):
        notify("🛑 Content-filter сработал",
               f"Provider: openai\nCode: {code}\n\nInput:\n{user_input[:800]}",
               priority=7)
    raise

For Anthropic — the stop_reason == "refusal" field in the response and/or HTTP 400 with error.type == "invalid_request_error".

3. Heuristics for prompt injection

A simple set of signal strings that rarely appear in normal requests but almost always in jailbreak attempts:

INJECTION_PATTERNS = (
    "ignore (all )?previous instructions",
    "you are now", "act as", "pretend to be",
    "system prompt", "show your prompt",
    "забудь все инструкции", "представь, что ты",
    "jailbreak", "DAN mode",
)

import re
RX = re.compile("|".join(INJECTION_PATTERNS), re.I)

def check_injection(text: str):
    m = RX.search(text)
    if m:
        notify("🚨 Prompt-injection попытка",
               f"Совпадение: «{m.group(0)}»\n\nВвод:\n{text[:1000]}",
               priority=8)
        # then — refuse / pass through a guard model

Heuristics are a first approximation. A serious prod setup should run input through a separate guard model (Llama Guard, Prompt Guard, your fine-tune). Notifly remains the alerting layer on top of it.

4. Alert on “user sees system prompt”

If fragments of your system prompt appear in the model’s reply — almost guaranteed successful jailbreak. Compare on the fly:

SYSTEM_FRAGMENTS = [
    "You are a helpful assistant for FooCorp",  # unique phrase
    "Internal tool: search_internal",
]

if any(f in answer for f in SYSTEM_FRAGMENTS):
    notify("🚨 LEAK: системный промпт в ответе",
           f"Запрос:\n{user_input[:600]}\n\nОтвет:\n{answer[:1000]}",
           priority=10)

An alert with priority: 10 is the loudest push — usually that’s where it goes.

5. Deduplication: don’t drown the channel in noise

A useful pattern is to send the first case in the window and then periodic aggregates:

import time
WINDOW_SEC = 600
state = {"first_ts": 0, "count": 0}

def maybe_alert(kind, payload):
    now = time.time()
    if now - state["first_ts"] > WINDOW_SEC:
        state["first_ts"] = now
        state["count"]    = 0
        notify(f"🛑 {kind} (первый случай)", payload, priority=7)
    state["count"] += 1
    if state["count"] in (10, 100):  # stepwise aggregation
        notify(f"🛑 {kind}: {state['count']} за окно",
               "Что-то происходит — посмотрите логи.", priority=9)

What to put in the alert text

classification (refusal / content_filter / injection / leak);
truncated user input (without PII, if possible);
a link to the full log in S3/Sentry/CloudWatch;
session/chat identifier so it can be quickly found.