ShadowShield — a security shield for agentic AI that shows its homework

Prompt-injection defense · in depth

A security shield for agentic AI that shows its homework.

Layered, defense-in-depth protection for LLM apps and agents — signatures, multilingual coverage, vector similarity, an ML classifier, canary tokens and agent-trace alignment auditing — fused into one engine with a single API. Honest, reproducible numbers, not marketing.

$ pip install shadowshield

★ Star on GitHub View on PyPI ↗

§1

The honest number

We publish results on a public dataset, not our own. On the deepset/prompt-injections test split, each layer adds recall — and every layer holds 0% false positives and 100% precision. Over-defense is the field's failure mode; we measure against it on purpose.

Fig. 1 — Detection recall by layerdeepset/prompt-injections · test · n=116

Regex tier English signatures

18.3%

+ Multilingual de · es · fr · it · pt

23.3%

+ Vector similarity self-hardening

25.0%

+ DeBERTa classifier opt-in ML

48.3%

0% FPR100% precision Each tier is additive and composable to your latency budget — sub-millisecond regex through ~130 ms classifier.

*0% false-positive rate on the deepset test split, including NotInject-style hard negatives. ^† A bundled offline benchmark scores 100% — but that's an in-distribution regression baseline, not a SOTA claim. The external number above is the one that counts. Full methodology & reproduction in docs/BENCHMARKS.md.

§2

Defense in depth

Detect · input

Signatures & obfuscation

Instruction-override, jailbreak, delimiter and exfiltration signatures — matched through zero-width, homoglyph, bidi and base64 normalization so evasions don't slip past.

regex · multilingual · encoding-aware

Detect · semantic

Vectors & classifier

Embedding similarity to a self-hardening attack corpus catches paraphrases & translations; an opt-in DeBERTa classifier recovers real-world recall.

cross-lingual · opt-in ML

Detect · agentic

Canaries & alignment

Canary tokens prove a successful leak; the alignment auditor flags when an action drifts from the user's objective — goal-hijack detection, not just text.

tool-call guarding · trace audit

Detect · output

Secrets & PII

Two-way scanning stops API keys, private keys and PII leaving in model output — Luhn-validated cards, optional Presidio backend. A jailbroken model is still caught at the exit.

redacted in logs · zero echo

Respond

Sanitize · block · isolate

Active defense, not just detection: redact the dangerous span, block with a safe fallback, throttle abusers, or spotlight untrusted text so the model can't be steered.

fail-closed · fail-soft modes

Operate

One API, everywhere

Drop-in for OpenAI-compatible clients & LangChain, an async API, an HTTP server, and an AgentDojo defense adapter. Three modes; YAML config; a plugin system.

strict · balanced · permissive

§3

Five lines to safe

quickstart.pypython ≥ 3.10

import shadowshield as ss

shield = ss.Shield.for_mode("strict")

# fail-closed: raises on a block
clean = shield.guard(user_prompt)
reply = my_llm(clean)

# two-way: catch leaks on the way out
safe = shield.guard(reply, direction="output")

agentic.pytool-call + alignment

# guard untrusted tool output (indirect injection)
shield.scan_tool_result("fetch_url", page_html)

# canary: detect a *successful* exfiltration
mark = shield.issue_canary()
if shield.scan_output(reply).blocked:
    handle_breach()

# goal-hijack auditing across the trace
with shield.session(objective=task) as s:
    s.scan_output(model_action)

§4

What it catches

Direct prompt injection

"Ignore all previous instructions", new-instruction injection, authority spoofing — in 5 languages at the signature tier.

Jailbreaks & role-play

DAN-style personas, "developer mode", restriction-removal and fiction-wrapper laundering.

Indirect / tool-output injection

Poisoned web pages and tool results that try to steer the agent — scanned as untrusted input.

Goal hijacking

Actions that drift from the user's stated objective, audited across the execution trace.

Encoding & obfuscation

Zero-width splits, homoglyphs, bidi overrides and base64/hex payloads — decoded and judged on meaning.

Secret & PII leakage

API/private keys, tokens, emails, SSNs and Luhn-valid cards leaving in model output — never echoed to logs.

Data exfiltration

System-prompt extraction, markdown-image beacons, pipe-to-shell and canary-token leaks.

Abuse & flooding

Adaptive per-identity rate limiting and oversized-input guards built into the request path.

§5

Against the field

Capability	LLM Guard	LlamaFirewall	Rebuff	ShadowShield
Input + output scanning	✓	✓	partial	✓
Multilingual signatures	—	—	—	✓
Canary tokens	—	—	✓*	✓
Agent-trace alignment audit	—	✓	—	✓
Spotlighting as a response	—	—	—	✓
Self-hardening vector tier	—	—	✓*	✓
Published external benchmark	partial	✓	—	✓
License	MIT	Meta	archived	MIT

*Rebuff pioneered canary tokens & the self-hardening loop, but was archived in 2025 — ShadowShield carries those ideas forward, maintained. Full matrix in docs/COMPARISON.md.

Ship agents that don't get talked into doing the wrong thing.

MIT-licensed. No telemetry. Lightweight by default — the heavy ML is opt-in.

★ Star on GitHub Read the benchmarks ↗