Original Reddit post

One thing that has always bothered me about games like Gandalf is that they’re mostly black boxes. You either get the password or you don’t, but you rarely learn: what defense fired why it fired where the defense failed how those failures improve the system So for the Hugging Face Build Small Hackathon I built Whisperkey . On the surface it’s a jailbreak game: convince a small AI guardian to reveal a secret key. Under the hood it’s really an experiment in open-source LLM security. The guardian is protected by multiple layers: Regex-based injection detection Prompt hardening Output redaction unplug-tiny, a fine-tuned DeBERTa-v3-xsmall classifier (~22M parameters) Unlike most guardrail systems, when a defense triggers it exposes its reasoning: which stage fired attack category evidence string detection trajectory The more interesting part is the feedback loop. All attack attempts are logged to a public dataset with secrets and PII removed. The highest-value examples are the false negatives: attacks that successfully bypass the firewall. Those examples represent the model’s exact blind spots and become new training data and detection patterns. In other words, successful jailbreaks improve the firewall. Current benchmark (18 attacks, 12 benign prompts): Regex only: 39% attack detection Regex + unplug-tiny: 83% attack detection 0% false positives on benign inputs The remaining failures are mostly novel or disguised attacks, which is exactly what the project is trying to surface. Everything is open source: Play: https://build-small-hackathon-whisperkey.hf.space/ Code: https://github.com/chiruu12/jailbreak-dojo Model: https://huggingface.co/Unplug-AI/unplug-tiny-v1 I’m particularly interested in feedback from people working on agent security, prompt injection defenses, guardrails, and adversarial evaluation. submitted by /u/Junior_Bake5120

Originally posted by u/Junior_Bake5120 on r/ArtificialInteligence