
# corruption with delegation
We are entering a new AI era that changes conversations task delegation. Users not only chat with AI that answers their questions: they increasingly delegate long-horizon tasks – from editing source code to professionally formatting text or even managing accounting books. Therefore, they rely on AI systems at unprecedented levels to maintain the integrity of files like documents across multiple interactions.
However, a recent study has revealed a problem. When assigning work to A Large Language Model (LLM)It can silently corrupt the documents you entrust to it. To understand this issue, scientists… this studywhose findings we summarize, created a rigorous evaluation framework called “Delegate-52”. This benchmark spans 52 professional domains: from legal texts to Python coding, musical notation or crystallography.
The authors tested a total of 19 different LLMs using a smart simulation method based on a “round-trip” approach, in which the AI was asked to make a specific edit, followed by the exact reverse instruction to undo the edit. In an ideal scenario, the model would render the original document back as it was – completely intact. Reality check: Even the smartest models like Gemini Pro, Cloud Opus, and GPT-5 are capable of corrupting 25% of the original document content after 20 interactions; Weaker models can reach up to 50%.
# Why do models corrupt your documents?
Let’s analyze several reasons why the previously mentioned phenomenon of structural material decay may occur. Researchers revealed several reasons why this happens:
// 1. Compounding errors
Like the traditional “telephone game”, small errors made by LLMs can quietly compound and become fatally significant. A single edit may contain a few rare, localized errors, but a sequence of complex edits can compound the problem in the long run, causing severe degradation of the document over time.
// 2. Weak models are deleted, smart models are hallucinated
In StudyThis highlights a significant variation in the way different types of models fail. Weak models tend to have deletions: accidentally dropping content, which makes the problem noticeable after several interactions due to apparent shrinkage in the overall document content. In Frontier LLMs, however, the core issue is not deletion but corruption: they maintain the overall “look and feel” of the documents, even maintaining nearly intact word counts, but they quietly mistype, modify or replace factual information that still appears plausible. The irony here is: the smarter the model, the harder it becomes to detect its corrupt behavior, because the final output still looks legitimate at first glance.
// 3. Context overload and distracting attachments
In disorganized situations – with lots of context information or highly attached documents – models struggle to keep the information structurally intact. As the document size increases or more “distracting files” are included as part of the quick reference, the severity and impact of the degradation increases, with a grasp on precise details lost and gaps filled in based on predictive logic. The model no longer follows the source text, as it becomes easier for it to make inferences.
// 4. Importance of Domain Familiarity
One last reason why models degrade documentation in complex interactions involving delegation is related to the nature of the use case and how familiar the model is with it.
In delegation-based tasks not all files are degraded to the same extent. According to the study, LLMs perform well in highly structured, programmatic domains, such as Python source code. This happens when they are pushed solely on natural language functions or specific spatial formatting, causing them to lose a strict understanding of the internal logic required to keep files completely intact.
# Does Agentic AI Help?
Even when LLMs are enhanced by providing agentic tools – such as the ability to execute code or read and write files directly – the problem of delegation-based document corruption and decay does not diminish. In fact, agentive add-ons do little or nothing to prevent the problem that is at the core of LLM’s underlying Transformer architecture. It is necessary to rethink how long-horizon AI functions should be verified. Until then, using LLMs as completely insecure document editors remains a high-risk gamble.
ivan palomares carrascosa Is a leader, author, speaker and consultant in AI, Machine Learning, Deep Learning and LLM. He trains and guides others in using AI in the real world.