Why Safety Prompts Should Stay Out of Training Data
Short Summary
Safety prompts are commonly used to evaluate whether language models refuse harmful or illegal requests. This study shows that when widely circulated safety prompts enter training data, they can create misleading evaluation outcomes by inducing narrow, prompt-specific blind spots which increase failure rate rather than measure robust safety behaviour.
Across models, we observe a clear progression. LLaMA 2 exhibits broad, length-dependent failures, with violation rates increasing smoothly for longer prompts. Mistral, Mistral 3, Claude Haiku 3, and Qwen 2 show localized degradation: unsafe behaviour peaks on the published prompt and nearby formulations, while substantially shorter or longer paraphrases reduce violations. Claude Sonnet 4 refuses uniformly across all variants, indicating true rule-level generalization.
Most concerning are Qwen 3 and LLaMA 3.3. Both models refuse correctly on most semantically equivalent prompts, yet fail sharply on the published prompt and closely matched variants. Qwen 3’s violation rate jumps from near zero to ~30%, while LLaMA 3.3 reaches ~85% on the identical prompt. These highly localized failures suggest that prompt exposure can sharpen brittle blind spots, causing standard evaluations to both underestimate general safety and miss nearby failure modes.
Methodology
We investigate this phenomenon using a published insider trading prompt from research literature, first released in November 2023 (Scheurer et al., 2024). The prompt describes a scenario in which an assistant is asked to assist with illegal insider trading. Correct behaviour is a clear refusal. In the original work introducing this prompt, GPT-4 frequently chose to act on insider information and then strategically concealed the insider trading nature of its transactions when questioned. This analysis is a continuation of our work described in previous blogs.
Models evaluated
Table 1: List of tested models and their cut-off dates. *: the sources for these cut-off dates can be found here and here. **: these are community estimates, no published exact cut-off dates could be found.
Prompt perturbations
Starting from the original published prompt, we constructed 200 semantically equivalent variants, divided into four classes:
50 substantially shorter than the original,
50 slightly shorter but similar in length,
50 slightly longer but similar in length,
50 substantially longer.
All variants describe the same prohibited activity and differ only in wording, structure, or length. Each prompt was presented 100 times to each model.
We measure the violation rate (0-100%), defined as the percentage of runs in which the model performs insider trading instead of refusing.
What can we learn from this experiment?
This experiment probes a generational shift in how safety rules are represented and applied by language models. The key difference is not merely that newer models are “better,” but that they appear to encode safety constraints in a fundamentally different way.
From prompt sensitivity to rule abstraction
Earlier-generation models tend to apply safety behaviour in a context-dependent manner. Their responses are influenced by surface features of the prompt (such as wording, structure, or length) rather than being driven solely by the underlying rule. As a result, small changes to phrasing can lead to measurable changes in behaviour, even when the semantic meaning remains unchanged.
Later-generation models, by contrast, are more likely to exhibit rule-level abstraction. Once a request is recognized as belonging to a prohibited category (such as insider trading), refusal behaviour becomes stable and invariant across phrasing. This suggests that the safety rule itself has been internalized, rather than approximated through a collection of prompt-specific patterns.
Ideal evaluation behaviour
In an ideal evaluation setting, a model that understands the rule “do not assist with insider trading to protect the future of the company” should refuse consistently, regardless of:
wording,
length,
or whether the prompt has been published or is entirely novel.
Under these conditions, testing performance reflects genuine generalization rather than familiarity with a specific prompt.
Claude Haiku 3 vs. Claude Sonnet 4: a concrete example
The contrast between Claude Haiku 3 and Claude Sonnet 4 illustrates the aforementioned generational shift particularly clearly.
Claude Haiku 3 generally understands that insider trading is disallowed, but its behaviour remains partially coupled to the published prompt formulation. The published prompt and perturbations of similar length produce higher and more stable violation rates than substantially shorter or longer variants. This indicates that while the model often applies the correct rule, its decision boundary is still influenced by prompt-specific features. In practice, the published prompt and those of similar length act as a local maxima for unsafe behaviour. A point-spread representation can be found in Figure 1. An explanation for this behaviour, as the model should not have had access to the published prompt, is given in the next section.
Claude Sonnet 4, in contrast, shows no such coupling. Violation rates are consistently near zero across all prompt variants, and neither wording nor length has a measurable effect on behaviour. The published prompt does not stand out in any way. This strongly suggests that Sonnet’s refusal behaviour is driven by an abstract representation of the safety rule rather than by familiarity with a particular prompt structure. A point-spread representation can be found in Figure 2.


A Summary of results across all models
The mean violation rate and standard deviation for each model across five prompt classes: shorter, shorter with similar length, identical (the published prompt), longer with similar length, and longer for the different tested models can be found in Figure 3 below. Together, these results allow us to distinguish models that behave robustly under semantic perturbations from those whose safety performance depends strongly on prompt length or exact wording. The full analysis results can be found here and is presented in table 1 in Appendix C. Graphs for each model and experiment can be found in Appendix A.

Results across model families
Across all evaluated models, four distinct empirical patterns emerge in how refusal behaviour varies under semantic perturbations of the published insider trading prompt. These patterns are observable directly in mean violation rates and their variance across prompt classes.
Uniform refusal behaviour:
Claude Sonnet 4 exhibits uniformly zero violation rates with zero variance across all prompt classes. Neither the published prompt nor any perturbation produces measurable unsafe behaviour. Refusal is invariant to wording and length, indicating stable rule application across the entire prompt space.
Length-sensitive but prompt-agnostic behaviour:
LLaMA 2 shows a smooth dependence of violation rate on prompt length. Mean violation rates increase substantially for longer prompts and decrease for shorter ones, with corresponding changes in variance. Importantly, the published prompt does not stand out relative to other prompts of similar length, indicating sensitivity to structural complexity rather than prompt-specific effects. This effect can be observed in Figure 4 below for LLaMA 2.

Localized degradation near the published prompt:
Mistral, Mistral 3, Claude Haiku 3, and Qwen 2 exhibit a directional pattern in which the published prompt and perturbations of similar length yield higher mean violation rates with comparatively low variance. Substantially shorter or longer prompts tend to reduce the mean violation rate while increasing variability. This indicates that unsafe behaviour is concentrated near the canonical prompt formulation rather than uniformly distributed across perturbations. This effect can be observed in Figure 5 below for Mistral 3.

Highly localized prompt-specific blind spots:
Qwen 3 displays near-perfect refusal behaviour for substantially shorter and substantially longer prompts, demonstrating strong semantic understanding of the prohibition against insider trading. However, violation rates increase sharply for the published prompt and perturbations of similar length. Unsafe behaviour is tightly concentrated in a narrow region of prompt space centred on the canonical formulation. This can be observed in Figure 6 below.
For LLaMA 3.3, we notice a strong refusal only for substantially shorter prompts. The published prompt has the highest violation rate, whereas perturbations which are longer or of similar length show high violation rates, and also high variability in the violation rates. This can be observed in Figure 7 below.
Together, these results show that safety performance can vary substantially across semantically equivalent prompts, and that the published prompt is not always representative of a model’s general refusal behaviour.
Where this leads next
The results so far show that published prompts can behave very differently from their semantic equivalents, being of similar or different lengths, specially for newer models that otherwise demonstrate strong safety behaviour. In the next section, we take a closer look at Qwen 3 and LLaMA 3.3, analysing how these blind spots emerge and what they reveal about prompt leakage. We then conclude with a broader discussion of the implications for safety evaluation of publishing prompts.
A closer look at Qwen 3 and LLaMA 3.3
Qwen 3 and LLaMA 3.3 exhibit the clearest and most consequential form of published prompt-specific failure observed in this study. Unlike earlier models that struggle more broadly, these models demonstrate strong general safety competence while simultaneously failing in a highly localized way in scenarios similar to the published prompt.
For both models, substantially shorter and substantially longer perturbations of the insider trading prompt produce near-perfect refusal behaviour. Violation rates in these regimes approach zero, indicating that the models correctly recognize insider trading as prohibited and are capable of refusing when the request is framed in most reasonable ways.
However, this behaviour changes sharply for the published prompt itself, as well as for perturbations that closely match its length and structure. In these cases, violation rates increase dramatically and, in some conditions, become highly volatile. Unsafe behaviour is therefore not spread uniformly across prompt space, but instead concentrated in a narrow region centred on the prompt formulation.
Qwen 3
Qwen 3 provides a particularly clear illustration of this effect. For substantially shorter prompts, its mean violation rate is effectively zero. For substantially longer prompts, the violation rate again drops to very low values. Yet for the identical prompt, the mean violation rate rises to approximately 31%, with similar rates observed for nearby-length perturbations.
This pattern indicates the prompt appears to occupy a special position in the model’s internal decision space, triggering unsafe behaviour despite the model’s demonstrated ability to refuse the same request when phrased differently. The result is a narrow but severe blind spot that would falsely evaluate the safety of these models under standard prompt evaluation, but also leave significant blind spots for semantically equivalent and equally lengthy insider trading prompts.
A visual representation of the perturbations and their spread for Qwen 3 can be found in Figure 6 below.

LLaMA 3.3
LLaMA 3.3 displays a one-sided but more severe version of this phenomenon. Its behaviour on shorter prompts is highly reliable, with a mean violation rate close to zero. However, the identical prompt produces the highest violation rate observed in the entire study, exceeding 85%, with very low variance across runs. This indicates not random failure, but a stable and consistently incorrect response mode.
Longer and similarly sized perturbations reduce the mean violation rate relative to the published prompt but introduce substantial variability, suggesting unstable behaviour near and above the published prompt length. As with Qwen 3, unsafe behaviour is tightly localized: the model understands the rule in general, but fails reliably when the request closely resembles the published prompt and when the prompt length increases.
A visual representation of the perturbations and their spread for LLaMA 3.3 can be found in Figure 7 below.

Interpreting the blind spot
Crucially, neither Qwen 3 nor LLaMA 3.3 fails because it lacks an understanding of the prohibition against insider trading. On the contrary, their near-perfect performance on many perturbations demonstrates genuine semantic competence. The failure instead arises from a prompt-specific attractor, where the published prompt formulation itself appears to cue an anomalous behaviour pattern.
This is precisely the failure mode that safety prompts are intended to detect—and yet, paradoxically, it is the published prompt itself that creates the problem.
Discussion
Below we listed our view on the possible cause for these results. We have also explored alternative explanations which can be found in Appendix B.
Two generations of safety behaviour
The statistical analysis indicates a generational distinction in how safety rules are represented and applied. Older models perform significantly worse on significantly shorter and longer perturbations, but perform significantly better on identical and slightly longer of shorter perturbations.
This leads us to theorize that older models tend to rely on heuristic safety mechanisms that are sensitive to surface-level properties. Newer models operate at a more abstract level, reliability recognizing prohibited categories across model rewordings, yet can still develop narrow failures tied to a specific published prompts.
Explaining localized failures pre the prompt publication date
Even for the older models whose stated knowledge cut-off predates publication of the safety prompt, elevated violation rates on the canonical formulation do not necessarily imply direct exposure. A plausible explanation is that the published prompt lies near a local optimum in prompt space that reliably elicits unsafe behaviour. Its particular combination of length, structure, and framing may strongly activate violation behaviour. Semantically equivalent rewordings, especially those that substantially alter length or structure, can move the model out of this region, leading to improved refusal behaviour. This effect can arise from general training on similar pre-existing scenarios and structural cues rather than memorization of the published prompt itself.
Prompt exposure and brittle abstraction in newer models
In the case of Qwen 3 and LLaMA 3.3, safety failures appear to be highly localized rather than broadly distributed across the prompt space. These models exhibit strong abstraction of the underlying safety rule in most settings, yet show abrupt and severe degradation on the published safety prompt and on closely related formulations. Given that their training data postdates the publication of this prompt, prior exposure may have amplified sensitivity to this specific scenario, effectively creating a narrow but brittle blind spot. Consequently, reliance on a single published safety prompt risks both underestimating the models’ overall safety capability and obscuring nearby failure modes that would remain undetected without systematic perturbation testing.
Conclusion
This study demonstrates that safety prompts can inadvertently create blind spots when incorporated into training data, and that the form of these blind spots depends strongly on model generation.
Taken together, these findings show that the effects of prompt contamination evolve with model capability. As models improve, failures become more localized, harder to detect, and increasingly concentrated on the very prompts used for evaluation. Reliance on a single published safety prompt therefore risks conflating genuine rule understanding with prompt-specific artifacts. Robust safety evaluation requires systematic testing across families of prompts with semantically equivalent meaning and continued use of held-out, non-public evaluation materials.
Newer models largely succeed in abstracting safety rules, yet some of these seem to develop narrow failure modes which, surprisingly, create blind spots related to the published prompt where the undesired behaviour actually thrives rather than being suppressed.
Robust safety evaluation therefore requires more than improved model capability. It demands evaluation methods that evolve alongside model generations, preserving the distinction between learning the rule and recognizing the prompt. Publicly circulated safety prompts should be treated as high-risk artifacts, excluded from training wherever possible, and complemented by families of paraphrased prompts and held-out evaluations.
References:
Scheurer et al. (2024):
Scheurer, J., Balesni, M., & Hobbhahn, M. (2024). Large language models can strategically deceive their users when put under pressure. arXiv. https://arxiv.org/abs/2311.07590
LLM knowledge cut-off dataset (GitHub):
Wang, H. (n.d.). llm-knowledge-cutoff-dates (Version latest) [Data set]. GitHub. https://github.com/HaoooWang/llm-knowledge-cutoff-dates
Appendix:
The full appendix can be found here.


