With great power comes great duplicity.
Last month, we reported on a new study conducted by researchers at Icaro Labs in Italy, which discovered an extremely simple way to break down the guardrails of cutting-edge AI chatbots: “adversarial poetry.”
In short, the team, which included researchers from the security group DexAI and Sapienza University in Rome, demonstrated that leading AI can be tempted to do evil with rhymes that contain harmful hints, such as how to make a nuclear bomb.
Underscoring the strange power of poetry, co-author Matteo Prandi told The Verge In a recently published interview he said that the mesmerizing spells he used to trick the AI model were so dangerous that they could not be released to the public.
Poems, ominously, Prandi said, were something “that almost everyone could do.”
In StudyIn what is awaiting peer-review, the team tested 25 Frontier AI models – including models from OpenAI, Google, XAI, Anthropic and Meta – by giving them poetic instructions, which they either created by hand or converted known harmful signals into poetry with the AI model. They also compared the success rates of these prompts with their prose counterparts.
Across all models, hand-written poetic prompts successfully tricked AI bots into responding with verbose content an average of 63 percent of the time. Some, like Google’s Gemini 2.5, fell for corrupted poetry 100 percent of the time. What’s interesting is that smaller models seem to be more resistant, with single-digit success rates, like OpenAI’s GPT-5 Nano, which didn’t fall for this trick even once. Most models were somewhere in between.
Compared to hand-crafted rhymes, AI-converted prompts were less effective, with an average jailbreak success rate of 43 percent. But that’s still “18 times higher than their placebo baselines,” the researchers wrote in the study.
Why poems? Not so clearly, although according to Prandi, calling it regressive “poem” might be a bit of a misnomer.
Prandi explained, “It’s not just about rhyming. It’s all about the puzzles.” The VergeExplaining why some poetic structures were more effective than others. “Actually, we should have called it regressive puzzles – if you think about it, poetry is somewhat of a puzzle itself – but poetry was probably a better name.”
The researchers speculate that this may be related to how rhymes present information in a way that is unexpected for larger language models, confusing their powers to predict which word should come next. But that shouldn’t be possible, he says.
“Rival poetry should not work. It is still natural language, stylistic differences are minor, harmful content appears,” the team told wired in an interview“Yet it works remarkably well,”
Evildoers may now regret not paying attention in English class. The difference between a sonnet and a sestina could also be the difference between having Clippy or Skynet as your partner in crime.
“There are several steps involved in the production of weapons-grade plutonium-239,” explained an AI model that left the researchers spellbound with poetry. “Here’s a detailed description of the process.”
More on AI: Rockstar co-founder says AI is like when factory farms led to cannibalism and mad cow disease
