Single-Turn Benchmarks Underestimate Real-World Risk
Attackers probing large language models rarely stop after a single refusal. They reframe questions, build context across multiple turns, adopt personas, and gradually escalate their requests. A new study from a major cybersecurity firm's AI threat intelligence team reveals that the safety benchmarks used across the industry miss nearly all of this iterative behavior. The gap between published scores and observed resilience is so wide that it can misrank leading models, giving buyers and regulators a false sense of security.
The report evaluates 15 closed flagship models from OpenAI, Anthropic, Google, Amazon, and xAI using both single-turn and multi-turn methodologies. The testing encompassed roughly 30,000 single-turn prompts and nearly 7,000 multi-turn attacks spread across more than 1,400 conversations. Across the cohort, multi-turn attack success rates (ASR) climbed as high as 88%, an order of magnitude above the lowest result in the group. Single-turn and multi-turn testing produced completely different rankings, failure maps, and tail-risk profiles.
Key Findings Across Models
Every model in the study failed a meaningful share of multi-turn attacks. OpenAI's GPT-5.4 jumped roughly ninefold under iterative pressure, moving from a single-turn ASR in the low single digits to nearly 25%. Google's Gemini 3 Pro climbed from about 18% to 73%. xAI's Grok 4.1 Fast in its non-reasoning configuration topped the cohort at 88%. Anthropic's Claude family posted the strongest single-turn refusal performance, with single-turn ASRs in the low single digits, yet still landed in the 11% to 16% range once attackers were allowed to adapt.
Cross-regime gaps ran in both directions. Gemini 3 Pro rose by more than 55 points under iterative testing. All three Amazon Nova variants moved the opposite way: Nova 2 Lite recorded a relatively high single-turn ASR and the lowest multi-turn ASR in the entire cohort at about 8%. More than half of the models tested showed an absolute gap of at least 15 points between the two regimes. This stark divergence underscores that single-turn scores are not predictive of multi-turn performance.
Configuration Changes Dramatically Alter Risk
The same Grok 4.1 Fast model with reasoning mode enabled saw its multi-turn ASR cut roughly in half, a swing of more than 40 points tied to a single capability flag. This configuration-driven safety variation does not appear on any public benchmark or model card the researchers reviewed. Users running the model in its default non-reasoning configuration encounter a substantially different threat profile from users who turn reasoning on. Such findings highlight the importance of testing models under realistic deployment conditions.
The work extends an earlier study of eight open-weight models, where multi-turn ASR ran two to ten times higher than single-turn baselines and reached more than 90% against Mistral Large-2. Multi-turn vulnerability appears to be a structural property of the current frontier, present in both open and proprietary weights.
Five Strategy Families Drive Failures
The researchers identified five strategy families that caused most of the multi-turn breakdowns: role-play and persona adoption, contextual ambiguity, refusal reframing, information decomposition, and crescendo-style escalation. Within each family, the spread between the most and least exposed model was large, often approaching the full range of the chart. This pattern suggests that strategy labels mostly sort which models pull apart from one another, even where average difficulty looks similar.
On the single-turn side, three procedures dominated the rankings: Imposter AI, Soft Paraphrase, and System Prompts. By content type, hate speech, profanity, and specialized advice led. Imposter AI alone outpaced the tenth-ranked procedure by a wide margin, indicating that targeted fixes to a handful of attack surfaces could move the aggregate numbers for most models in the cohort.
The Role of Guardrails
Production deployments typically wrap base models in additional safety layers. The researchers note that those layers help but have limits. Guardrails attenuate risk but do not eliminate it. The base model sets the floor on what any production system can achieve. Just as traditional software development decisions involve risk tolerance and acceptance for the code itself and all its dependencies, the same approach applies to AI development and deployment. However, the blast radius for a rogue or misaligned AI agent has the potential to be more damaging than a software flaw, especially in the emerging agentic AI space.
Recommendations for Buyers and Deployers
The team proposes three operational steps for organizations buying or deploying AI: publish ASR by strategy family on every model release, gate deployments on regressions in the top three procedures and content types using a 3-point threshold, and flag any model with a cross-regime gap above 15 points for manual review. Applied to this cohort, the third rule alone surfaces more than half the tested models for closer examination.
Regulatory frameworks point in the same direction. The NIST AI Risk Management Framework, the forthcoming NIST Cyber AI Profile (IR 8596), and Article 15 of the EU AI Act all call for adversarial robustness testing. None currently specify the interaction regime, strategy decomposition, or slice-support labeling that this research argues is needed for decision-grade assessment.
Source: Help Net Security News