PromptGuard

December 19, 2025 · Allan Karanja

Twitter/X is full of leaked systems prompts, I wanted to explore to guard against these attacks. My first intuition was to use a classifier model. Gather data from popular jail break websites and use those to train a simple model. The issue is that they do not generalize enough and seem to overfit on specific words like ignore or override. These could be legitimate instructions but will be flagged and lead to a poor user experience. They also perform poorly with 60% accuracy on edge cases and poor generalization.

LLMs cannot be reliably used to detect prompt injection attacks because as they’re susceptible to the same blind spots they’re trying to evaluate. I choose Deberta, a model from the BERT class, which reads text bidirectionally rather than left-to-right like LLMs do. These are well-suited for text understanding tasks like classification. They understand patterns rather than keywords and have better context understanding.

Running on the CPU takes under 5ms per inference, with GPU doing <1 ms per inference. This is viable in a production setting. But the issue is the short context length of 512 tokens. An attacker could simply post a longer, safe looking prompt for the first 512 tokens and thereafter their jailbreak prompt. This attack would succeed and not be caught as the model(prompt guard) truncates input. The solution could be to switch to a model with a longer context length, or feed the prompt as chunks with overlap into the model. Feeding 10 chunks could only add ~50ms of latency on CPU!

After 3 epochs (4,689 steps), the model converged to a final training loss of 0.0499. While overall accuracy reached 98.3%, the model achieved 98.5% precision, meaning false positives on legitimate prompts were rare and a recall of 97.8% showing a strong ability to catch jailbreak attacks.

These evaluations primarily cover known jailbreak patterns. While the model generalizes well, novel attack strategies may emerge over time, requiring periodic retraining or dataset updates to mitigate distribution shift and model drift.

Link to model