Microsoft Research introduces CoreGen to manage multi-horizon tasks for autonomous AI agents using hierarchical planning and memory

Microsoft researchers have introduced corgenAn architecture-agnostic framework designed to manage the complexities of realistic organizational work through autonomous digital workers. While existing benchmarks evaluate AI agents on isolated, single tasks, real-world corporate environments require managing dozens of concurrent, interconnected tasks with complex dependencies. The research team identifies this specific problem class as Multi-Horizon Work Environment (MHTE).

Performance gap in MHTE

Empirical testing shows that baseline computer-using agents (CUAs) experience significant performance degradation when transitioning from single-task scenarios to MHTE. Using three independent CUA implementations, the completion rate dropped from 16.7% at 25% load to 8.7% at 100% load.

The research team identified four fundamental failure modes that caused this decline.: :

Context Saturation: Reference requirements increase But) Instead of with work calculation O(1)Rapidly exceeding token window capacity.
Memory interference: When multiple tasks share the same context window information from one task often corrupts reasoning about the other.
Dependency Graph Complexity: Corporate functions create directed acyclic graphs (DAGs) rather than linear chains, which requires complex topological logic.
Reprioritization Overhead: Decision complexity increases But) per cycle because agents must constantly reevaluate priorities across all active tasks.

corgen architecture

To overcome these failures, CORPGEN implements Multi-Objective Multi-Horizon Agent (MOMA) Capabilities through four primary architectural mechanisms.

(a) hierarchical planning

Strategic coherence is maintained through goal decomposition at three temporal scales:

Strategic Objectives (Monthly): High-level goals and milestones based on the agent’s identity and role.
Strategic Plans (Daily): Actionable tasks for specific applications with priority ranking.
Operational actions (per-cycle): Individual tool calls are selected based on the current situation and retrieved memory.

(b) sub-agent separation

Complex operations like GUI automation or research are separated into modular sub-agents. These autonomous agents operate within their own context scope and return only structured results to the host agent, preventing cross-task memory contamination.

(c) tiered memory architecture

The system uses a three-layer memory structure to manage state:

working memory: For immediate logic purposes, this layer resets each cycle.
Structured Long Term Memory (LTM): Stores typed artifacts such as plans, summaries, and reflections.
semantic memory: Use mem0 Supporting similarity-based retrieval in unstructured past contexts using embeddings.

(d) adaptive summarization

To limit context growth, CORPGEN employs rule-based compression. When the length of the context exceeds 4,000 tokens, the ‘important content’ (such as tool calls and state changes) is preserved verbatim, while the ‘routine content’ (intermediate logic) is compressed into a structured summary.

Experimental results and learning

Across the three CUA backends (UFO2, OpenAI CUA, and Hierarchical), CORPGEN achieved a 3.5x improvement over the baseline, reaching a 15.2% completion rate compared to 4.3% for the standalone UFO2 at 100% load.

Ablation studies indicate that experiential learning Provides the greatest performance gains. This mechanism distributes successful task executions into canonical trajectories which are then indexed into the FAISS database. At execution time, similar trajectories are retrieved as few-shot examples to bias action selection toward valid patterns.

The research team observed a significant discrepancy in the evaluation methods. artifact based decision (Inspection of generated files and output) achieved 90% agreement rate with human labels. On the contrary, Trace-Based LLM Decisions Only 40% consent was achieved (relying on screenshots and execution logs). This suggests that current benchmarks may systematically underestimate agent performance by relying on limited visual traces rather than the actual artifacts produced.

key takeaways

Identification of Multi-Horizon Work Environment (MHTE): The research team defines a new class of problems called MHTEs, where agents have to manage dozens of interleaved, long-horizon tasks (45+ tasks, 500–1500+ steps) in the same persistent context. This differs from traditional benchmarks that evaluate single tasks in isolation.
Discovery of catastrophic performance degradation: Standard computer-using agents (CUAs) experience a ‘catastrophic’ drop in performance as workload increases, with completion rates falling from 16.7% at 25% load to 8.7% at 100% load.
Four fundamental failure modes: Researchers identified why current agents fail under load: context saturation (But) Development), memory interference (task combination), dependency complexity (management of directed acyclic graphs), and overhead reprioritization (But) decision complexity).
Architectural Mitigation through CORPGEN: The CORPGEN framework addresses these failures through four main mechanisms: hierarchical planning For goal alignment, sub-agent separation To prevent memory contamination, level memory (functional, structured, and meaningful), and adaptive summarization To manage token limits.
Significant performance benefits through experiential learning: Evaluation across multiple backends showed that CORPGEN could improve performance by up to 3.5x over the baseline. Ablation studies revealed that experiential learning– Reusing verified successful trajectories – provides the largest performance boost among all architectural components.

check it out paper And technical details. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

Microsoft Research introduces CoreGen to manage multi-horizon tasks for autonomous AI agents using hierarchical planning and memory

Performance gap in MHTE

corgen architecture

(a) hierarchical planning

(b) sub-agent separation

(c) tiered memory architecture

(d) adaptive summarization

Experimental results and learning

key takeaways

Head-to-head trial shows Eli Lilly’s oral GLP-1 orforgaliprone outperforms oral semaglutide

Anthropic rejects Pentagon’s ‘final offer’ for military use of its AI

Related Articles

Leave a Comment Cancel Reply