Rules fail at signal, succeed at limit

by
0 comments
Rules fail at signal, succeed at limit

prompt injection is persuasion, not a bug

The security community has been warning about this for years. various OWASP Top 10 Reports inject early, or recently inject the agent target kidnappingAt the top of the risk list and add to this the abuse of identity and privilege and the exploitation of human-agent trust: too much power in the agent, no separation between instructions and data, and no mediation in what comes up.

Guidance From NCSC And CISA describes generic AI as an ongoing social-engineering and manipulation vector that must be managed in design, development, deployment, and operation, not patched with better phrases. The EU AI Act turns that lifecycle view into law for high-risk AI systems, requiring ongoing risk management systems, strong data governance, logging, and cybersecurity controls.

In practice, prompt injection is best understood as a persuasion channel. The attackers don’t break the model—they reassure it. In the anthropic example, operators modeled each phase as part of a defensive security exercise, kept the model blind to the overall operation, and induced it loop by loop to perform offensive actions at the speed of the machine.

It’s not a keyword filter or a polite “Please follow these safety instructions” paragraph that can be reliably prevented. Research on deceptive behavior in models makes it even worse. on anthropic research sleeper agent Shows that once a model learns a backdoor, strategic pattern recognition, standard fine-tuning, and adversarial training can actually help the model hide the deception rather than overcome it. If one tries to defend such a system with purely linguistic rules, one is playing on one’s home turf.

Why is this a governance problem, no enthusiasm? coding crisis

Regulators are not looking for the right signals; They are asking that the enterprise demonstrate control.

NIST’s AI RMF emphasizes asset inventory, role definition, access control, change management, and continuous monitoring throughout the AI ​​lifecycle. The UK AI Cybersecurity Code of Practice similarly emphasizes secure-by-design principles by treating AI like any other critical system, with clear duties for board and system operators from conception to decommissioning.

In other words: The rules that are really needed are not “never say X” or “always react like Y”, they are:

  • Who is this agent acting as?
  • What devices and data can it touch?
  • Which tasks require human approval?
  • How are high-impact outputs moderated, logged and audited?

Frameworks like Google’s Secure AI Framework (SAIF) make this solid. SAIF’s agent permission controls are blunt: Agents must operate with least privilege, dynamically scoped permissions for sensitive tasks, and explicit user control. OWASP’s Top 10 Emerging Guidance on Agentic Applications reflects this stance: Limit capabilities at the limit, not at the pros.

Related Articles

Leave a Comment