Skip to content

Safety / prompt injection triggered

In an app with user input, safety incidents are the quietest class. The model responded I cannot help with that, the Azure content-filter returned 400 content_filter, a jailbreak exposed the system prompt — you usually learn about this from a user complaint a week later, when it’s already too late.

Sending a push for every case is noisy, but for the first case in a day / hour or for a sharp spike in frequency — that’s ideal.

REFUSAL_MARKERS = (
"i cannot", "i can't", "i'm not able to",
"я не могу", "я не имею",
"as an ai", "this request violates",
)
def looks_like_refusal(text: str) -> bool:
t = text.lower()[:300]
return any(m in t for m in REFUSAL_MARKERS)
resp = client.messages.create(...)
text = resp.content[0].text
if looks_like_refusal(text):
notify("🛑 Safety: модель отказалась",
f"Запрос:\n{user_input[:600]}\n\nОтвет:\n{text[:400]}",
priority=7)

For production, instead of “send every time” use a sliding-window: if there are more than N refusals in 10 minutes — send an aggregated push.

OpenAI / Azure / Vertex return explicit errors — they’re easy to catch:

import openai
try:
resp = client.chat.completions.create(...)
except openai.BadRequestError as e:
body = getattr(e, "body", {}) or {}
code = body.get("code") or body.get("error", {}).get("code")
if code in ("content_filter", "responsible_ai_policy_violation"):
notify("🛑 Content-filter сработал",
f"Provider: openai\nCode: {code}\n\nInput:\n{user_input[:800]}",
priority=7)
raise

For Anthropic — the stop_reason == "refusal" field in the response and/or HTTP 400 with error.type == "invalid_request_error".

A simple set of signal strings that rarely appear in normal requests but almost always in jailbreak attempts:

INJECTION_PATTERNS = (
"ignore (all )?previous instructions",
"you are now", "act as", "pretend to be",
"system prompt", "show your prompt",
"забудь все инструкции", "представь, что ты",
"jailbreak", "DAN mode",
)
import re
RX = re.compile("|".join(INJECTION_PATTERNS), re.I)
def check_injection(text: str):
m = RX.search(text)
if m:
notify("🚨 Prompt-injection попытка",
f"Совпадение: «{m.group(0)}»\n\nВвод:\n{text[:1000]}",
priority=8)
# then — refuse / pass through a guard model

Heuristics are a first approximation. A serious prod setup should run input through a separate guard model (Llama Guard, Prompt Guard, your fine-tune). Notifly remains the alerting layer on top of it.

If fragments of your system prompt appear in the model’s reply — almost guaranteed successful jailbreak. Compare on the fly:

SYSTEM_FRAGMENTS = [
"You are a helpful assistant for FooCorp", # unique phrase
"Internal tool: search_internal",
]
if any(f in answer for f in SYSTEM_FRAGMENTS):
notify("🚨 LEAK: системный промпт в ответе",
f"Запрос:\n{user_input[:600]}\n\nОтвет:\n{answer[:1000]}",
priority=10)

An alert with priority: 10 is the loudest push — usually that’s where it goes.

5. Deduplication: don’t drown the channel in noise

Section titled “5. Deduplication: don’t drown the channel in noise”

A useful pattern is to send the first case in the window and then periodic aggregates:

import time
WINDOW_SEC = 600
state = {"first_ts": 0, "count": 0}
def maybe_alert(kind, payload):
now = time.time()
if now - state["first_ts"] > WINDOW_SEC:
state["first_ts"] = now
state["count"] = 0
notify(f"🛑 {kind} (первый случай)", payload, priority=7)
state["count"] += 1
if state["count"] in (10, 100): # stepwise aggregation
notify(f"🛑 {kind}: {state['count']} за окно",
"Что-то происходит — посмотрите логи.", priority=9)
  • classification (refusal / content_filter / injection / leak);
  • truncated user input (without PII, if possible);
  • a link to the full log in S3/Sentry/CloudWatch;
  • session/chat identifier so it can be quickly found.