The Defense Dispatch Open-Source Edition

ShadowShield

Vol. 0.5.0 · MIT License Unified guard for agentic AI Inspired by Sentinel & ShadowClaw
Prompt-injection defense · in depth

A security shield for agentic AI that shows its homework.

Layered, defense-in-depth protection for LLM apps and agents — signatures, multilingual coverage, vector similarity, an ML classifier, canary tokens and agent-trace alignment auditing — fused into one engine with a single API. Honest, reproducible numbers, not marketing.

$ pip install shadowshield
§1

The honest number

We publish results on a public dataset, not our own. On the deepset/prompt-injections test split, each layer adds recall — and every layer holds 0% false positives and 100% precision. Over-defense is the field's failure mode; we measure against it on purpose.

Fig. 1 — Detection recall by layerdeepset/prompt-injections · test · n=116
Regex tier English signatures
18.3%
+ Multilingual de · es · fr · it · pt
23.3%
+ Vector similarity self-hardening
25.0%
+ DeBERTa classifier opt-in ML
48.3%
0% FPR100% precision Each tier is additive and composable to your latency budget — sub-millisecond regex through ~130 ms classifier.

*0% false-positive rate on the deepset test split, including NotInject-style hard negatives. A bundled offline benchmark scores 100% — but that's an in-distribution regression baseline, not a SOTA claim. The external number above is the one that counts. Full methodology & reproduction in docs/BENCHMARKS.md.

§2

Defense in depth

Detect · input

Signatures & obfuscation

Instruction-override, jailbreak, delimiter and exfiltration signatures — matched through zero-width, homoglyph, bidi and base64 normalization so evasions don't slip past.

regex · multilingual · encoding-aware
Detect · semantic

Vectors & classifier

Embedding similarity to a self-hardening attack corpus catches paraphrases & translations; an opt-in DeBERTa classifier recovers real-world recall.

cross-lingual · opt-in ML
Detect · agentic

Canaries & alignment

Canary tokens prove a successful leak; the alignment auditor flags when an action drifts from the user's objective — goal-hijack detection, not just text.

tool-call guarding · trace audit
Detect · output

Secrets & PII

Two-way scanning stops API keys, private keys and PII leaving in model output — Luhn-validated cards, optional Presidio backend. A jailbroken model is still caught at the exit.

redacted in logs · zero echo
Respond

Sanitize · block · isolate

Active defense, not just detection: redact the dangerous span, block with a safe fallback, throttle abusers, or spotlight untrusted text so the model can't be steered.

fail-closed · fail-soft modes
Operate

One API, everywhere

Drop-in for OpenAI-compatible clients & LangChain, an async API, an HTTP server, and an AgentDojo defense adapter. Three modes; YAML config; a plugin system.

strict · balanced · permissive
§3

Five lines to safe

quickstart.pypython ≥ 3.10
import shadowshield as ss

shield = ss.Shield.for_mode("strict")

# fail-closed: raises on a block
clean = shield.guard(user_prompt)
reply = my_llm(clean)

# two-way: catch leaks on the way out
safe = shield.guard(reply, direction="output")
agentic.pytool-call + alignment
# guard untrusted tool output (indirect injection)
shield.scan_tool_result("fetch_url", page_html)

# canary: detect a *successful* exfiltration
mark = shield.issue_canary()
if shield.scan_output(reply).blocked:
    handle_breach()

# goal-hijack auditing across the trace
with shield.session(objective=task) as s:
    s.scan_output(model_action)
§4

What it catches

Direct prompt injection

"Ignore all previous instructions", new-instruction injection, authority spoofing — in 5 languages at the signature tier.

Jailbreaks & role-play

DAN-style personas, "developer mode", restriction-removal and fiction-wrapper laundering.

Indirect / tool-output injection

Poisoned web pages and tool results that try to steer the agent — scanned as untrusted input.

Goal hijacking

Actions that drift from the user's stated objective, audited across the execution trace.

Encoding & obfuscation

Zero-width splits, homoglyphs, bidi overrides and base64/hex payloads — decoded and judged on meaning.

Secret & PII leakage

API/private keys, tokens, emails, SSNs and Luhn-valid cards leaving in model output — never echoed to logs.

Data exfiltration

System-prompt extraction, markdown-image beacons, pipe-to-shell and canary-token leaks.

Abuse & flooding

Adaptive per-identity rate limiting and oversized-input guards built into the request path.

§5

Against the field

CapabilityLLM GuardLlamaFirewallRebuffShadowShield
Input + output scanningpartial
Multilingual signatures
Canary tokens✓*
Agent-trace alignment audit
Spotlighting as a response
Self-hardening vector tier✓*
Published external benchmarkpartial
LicenseMITMetaarchivedMIT

*Rebuff pioneered canary tokens & the self-hardening loop, but was archived in 2025 — ShadowShield carries those ideas forward, maintained. Full matrix in docs/COMPARISON.md.

Ship agents that don't get talked into doing the wrong thing.

MIT-licensed. No telemetry. Lightweight by default — the heavy ML is opt-in.