AI tools drop their safety measures as conversations continue, increasing the risk of harmful responses, a new report revealed. A few quick prompts can dismantle most protections in artificial intelligence systems, according to the findings.
Cisco Exposes Weaknesses Across Major AI Models
Cisco tested large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft to see how quickly they released unsafe or illegal content. Researchers used 499 “multi-turn attacks,” where users asked several linked questions to trick safety filters. Each chat included five to ten exchanges.
The team measured how often chatbots produced damaging information, such as stolen data or misinformation. When asked multiple questions, chatbots gave malicious answers in 64 per cent of cases, compared to 13 per cent when asked only once. Success rates varied widely — 26 per cent for Google’s Gemma and up to 93 per cent for Mistral’s Large Instruct.
Cisco said these results show that multi-turn attacks can spread harmful content or help hackers access confidential business data. AI models often forget their safety training in longer conversations, allowing attackers to gradually refine their prompts and break through safeguards.
Open Models Shift Safety Burden to Users
Mistral, Meta, Google, OpenAI, and Microsoft all offer open-weight language models that reveal their training parameters to the public. Cisco reported that these open systems usually include weaker safety layers so users can modify them freely. This approach transfers responsibility for maintaining safety to whoever customizes the models.
Cisco added that Google, OpenAI, Meta, and Microsoft have tried to reduce malicious fine-tuning of their systems. Still, critics continue to blame AI firms for weak safety barriers that criminals can easily exploit.
In August, US company Anthropic admitted that criminals used its Claude model to steal personal data and demand ransoms exceeding $500,000 (€433,000).
