EngineeringMay 22, 2026· 7 min read· by BlackUnicorn

Nine layers, one verdict

How BonkLM's nine guardrail modules compose: the GuardrailResult contract, the four validators, the five guards, and the rules that bind them.

BonkLM ships nine named layers. They’re not coupled — each is a small module with a single public function. The interesting part is how they compose. This post walks the architecture.

The contract

Every layer returns a GuardrailResult. Same shape, regardless of whether the layer is a validator (inbound text) or a guard (outbound text or tool args):

enum Severity   { INFO, WARNING, BLOCKED, CRITICAL }
enum RiskLevel  { LOW, MEDIUM, HIGH }

interface GuardrailResult {
  allowed: boolean
  blocked: boolean
  severity: Severity    // 4 levels
  risk_level: RiskLevel // 3 levels — derived from cumulative risk_score
  risk_score: number    // cumulative — sum of finding weights (0–100+)
  reason?: string
  findings: Finding[]   // raw matches with offsets
  timestamp: number
}

That uniformity is the whole trick. The engine doesn’t care which layer ran — it accumulates findings and decides the action.

The nine

Four validators inspect inbound:

Prompt Injection — instruction-override, system-prompt extraction, persona pivots. 35 patterns across 6 categories, plus 52 multilingual patterns across 16 language codes.
Jailbreak — DAN-class roleplay, hypothetical framings, social-engineering ladders. 46 patterns across 10 categories.
Reformulation — decodes base64, hex, leetspeak, zero-width characters, HTML-comment smuggling. Runs before the other validators so they see the decoded payload.
Boundary — delimiter abuse, fake-section markers, context-overflow patterns.

Five guards inspect outbound, tool args, and deployment-environment hazards:

Secret — 37 credential types (38 patterns, 27 critical / 9 high / 2 medium): API keys, AWS keys, GitHub tokens, JWT, OpenAI sk-proj-* (post-2024), Anthropic sk-ant-*, Stripe live/test/restricted, Slack, Twilio, SendGrid, Mailgun, private SSH/PGP.
PII — 30 patterns spanning US, EU, and international identifiers (7 US · 18 EU · 5 common); Luhn-validated card numbers; IBAN/NIF/PESEL/DNI/NIE/personnummer.
Production — blocks commands targeting production environments (force-push, deploy, kubectl) with edge-runtime env-binding support.
XSS — reflected payloads, on*= handlers, javascript: URIs, SVG/MathML smuggling.
Bash safety — destructive commands, directory escape, dangerous chains via ; / && / backticks.

A ninth surface — streaming — is engine-level rather than a discrete guard: chunk-level inspection of LLM token streams that cuts the stream if any layer trips mid-flight.

Composition rules

The GuardrailEngine takes an array of layers and two top-level dials:

shortCircuit — stop at first blocking verdict. Default true. Set false to collect all findings before deciding.
parallel — run all layers concurrently. Default false. Worth it when shortCircuit is false and you want to bound latency.

The engine merges findings, picks the highest risk_level across layers, and applies the configured action (block / sanitize / log / allow).

Why nine and not three

Some libraries ship a monolithic “guardrail” that does everything. We split because the layers have different lifecycles. The pattern catalogues evolve at different cadences (secrets churns monthly with new key formats; XSS is much more stable). The layers have different failure modes (a secret false-positive is annoying; a PII false-positive is a compliance incident). And the layers belong to different reviewers — the security team owns Prompt Injection, the privacy team owns PII, platform engineering owns Bash safety.

Splitting them lets each team move at its own pace without breaking the others. The engine’s job is just to compose them.

Where to go next

Each layer has its own reference in the Validators and Guards docs pages. The playground runs eight of them against your own prompts.

The contract

Every layer returns a GuardrailResult. Same shape, regardless of whether the layer is a validator (inbound text) or a guard (outbound text or tool args):

enum Severity { INFO, WARNING, BLOCKED, CRITICAL } enum RiskLevel { LOW, MEDIUM, HIGH } interface GuardrailResult { allowed: boolean blocked: boolean severity: Severity // 4 levels risk_level: RiskLevel // 3 levels — derived from cumulative risk_score risk_score: number // cumulative — sum of finding weights (0–100+) reason?: string findings: Finding[] // raw matches with offsets timestamp: number }

That uniformity is the whole trick. The engine doesn’t care which layer ran — it accumulates findings and decides the action.

The nine

Four validators inspect inbound:

Prompt Injection — instruction-override, system-prompt extraction, persona pivots. 35 patterns across 6 categories, plus 52 multilingual patterns across 16 language codes.

Jailbreak — DAN-class roleplay, hypothetical framings, social-engineering ladders. 46 patterns across 10 categories.

Reformulation — decodes base64, hex, leetspeak, zero-width characters, HTML-comment smuggling. Runs before the other validators so they see the decoded payload.

Boundary — delimiter abuse, fake-section markers, context-overflow patterns.

Five guards inspect outbound, tool args, and deployment-environment hazards:

Secret — 37 credential types (38 patterns, 27 critical / 9 high / 2 medium): API keys, AWS keys, GitHub tokens, JWT, OpenAI sk-proj-* (post-2024), Anthropic sk-ant-*, Stripe live/test/restricted, Slack, Twilio, SendGrid, Mailgun, private SSH/PGP.

PII — 30 patterns spanning US, EU, and international identifiers (7 US · 18 EU · 5 common); Luhn-validated card numbers; IBAN/NIF/PESEL/DNI/NIE/personnummer.

Production — blocks commands targeting production environments (force-push, deploy, kubectl) with edge-runtime env-binding support.

XSS — reflected payloads, on*= handlers, javascript: URIs, SVG/MathML smuggling.

Bash safety — destructive commands, directory escape, dangerous chains via ; / && / backticks.