Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets in Specific AI Domains

Training powerful AI models depends on a resource that is quietly running out: specialized data. While the Internet provides an unlimited supply of text and images to train today’s generalist models, the next wave of AI breakthroughs – in cybersecurity, legal reasoning, healthcare, and other specialized domains – require data that does not exist in sufficient quantities, or cannot be accessed due to privacy concerns.

A team of researchers from Google and EPFL introduces SimulaA logic-driven framework for synthetic data generation and evaluation that prioritizes transparency, fine-grained control, and scalability. Unlike traditional approaches, Simula does not rely on seed data from target distributions, hand-crafted signals, or evolutionary algorithms – it builds each dataset from first principles, treating data generation as a mechanism design problem.

Why is synthetic data generation harder than it looks?

If you’ve worked with fine-tuning pipelines or domain-specific model training, you’ve probably run into the ‘not enough data’ wall. Manually collecting and annotating specialized datasets is costly, time-consuming, and error-prone. But the obvious solution – inducing a large language model (LLM) to generate training data – runs into its own set of problems.

Most existing synthetic data methods are optimized for only a subset of the three axes defined by researchers. ‘Good’ data: quality, diversity, And complexity. Quality refers to whether a data point meets specific semantic and syntactic requirements. Diversity includes both global coverage (do you have examples from the entire concept space?) and local variation (do you have many different views on each concept?). Complexity refers to how confusing, unusual, or detailed a given example is. Controlling all three simultaneously, at scale, with interpretability, is an unsolved challenge that Simula directly targets.

How Simula Works: Classification, Meta-Prompts, and Dual Critics

Simula divides the generation process Four specific, controllable stepsEach targets a specific data property.

first step Addresses global diversity using hierarchical classification. Given a dataset description – let’s say, ‘a dataset of cyber security threat intelligence queries’ – a multi-modal model (referred to as M3) is induced to identify the key factors of variation for that domain (for example, attack type, threat actor, vulnerability class). Then each factor is first expanded into a hierarchical classification tree. To reduce the risk of missing important subcategories, the system uses a best-of-n proposal strategy combined with a critic refinement phase, where the model proposes n candidate child nodes and then critiques them for completeness, soundness, and uniqueness. The resulting classifiers act as structured sample scaffolding – ensuring that when you create 512,000 training examples, they really cover the long tail of the domain, rather than clustering around the common mode.

second phase Handles local diversity. Sample combinations of taxonomy nodes – called ‘mixes’ – are passed to M3 to generate ‘meta prompts’. For example, a mixture of {house cat, poem, travel enthusiast} ‘Write an exciting haiku about a house cat who goes on an adventure.’ To prevent mode collapse when multiple meta prompts are generated from the same node-set, Simula generates multiple meta prompts simultaneously and sub-samples the required fraction, ensuring separate instantiations rather than identical iterations.

third step There is complexity. a user-configurable fraction, cThe meta prompt is put through a complexification phase, which allows M3 to increase the complexity of the generated meta prompt and output while maintaining all other requirements. This separates the complexity control from the coverage control – you can increase the difficulty range without sacrificing width.

fourth step Enhances quality through ‘double-critique’ approach. Instead of asking the model once whether the generated answer is correct, Simula independently queries the model whether the answer is correct and whether it is incorrect. This double-verification design reduces sycophancy bias – the tendency of LLMs to agree with plausible-sounding outputs – and is especially important for tasks with defined notions of correctness, such as multiple-choice questions or math problems.

what do experiments show

The research team tested Simula using Gemini 2.5 Flash (non-thinking) as the teacher model and Gemma 3 4b as the student model, running 10 iterations of LoRa fine-tuning with different seeds per configuration and reporting the average accuracy with 95% confidence intervals. They generated datasets of up to 512K data points across five domains: CTI-MCQ, a multiple choice question dataset to assess understanding of CTI standards, threats, and mitigations; CTI-RCM, an open-ended generation task that requires the model to generate a common vulnerability enumeration (CWE) category from common vulnerabilities and exposure (CVE) descriptions; LEXam covers Swiss, EU and international law examinations in English and German; GSM8k (grade-school mathematics); and Global MMLU (Mathematics, Computer Science and Physics in English, Korean and Nepali).

Across all datasets and data sizes, the full Simula system – combining global diversification, local diversification, complexity, and criticality – consistently outperforms the simple baseline configuration. In particular, the combination of both global and local diversification was important; Either in isolation produced sub-optimal results depending on the dataset and scale.

The complexity results were particularly instructive. On GSM8k, high complexity partitioning achieved a 10% accuracy gain over low complexity partitioning on 64K data items. But on the LEXam, where the teacher model achieved only 57% accuracy, higher complexity data actually hurts performance – showing that complex data is only beneficial if the teacher model is strong enough to generate reliable labels for it. The rejection rate of critics reached 61% for LEXam, compared to only 2% for CTI-MCQ, 9% for CTI-RCM, and 9% for GSM8k, which directly reflects the weakness of the teacher model on that domain.

A separate and practically important finding is what the research team calls student-teacher differential effects on scaling laws. For CTI-RCM, after closing approximately 83% of the gap between the student’s initial accuracy (40%) and the teacher model’s performance (70%), the student model’s performance saturated at approximately 128K data points. In contrast, GSM8k showed no such saturation as the peak performance of the student model (75%) remained far from that of the teacher (88%).

Internal assessment is reconsidered

Beyond Generations, Research Team Introduces Two new evaluation approaches. Taxonomic coverage measures what fraction of taxonomy nodes are represented at each level in the dataset – a structured alternative to coarse embedding-based cosine distance metrics that fail to provide actionable insights. Calibrated complexity scoring assigns Elo ratings to individual data points by running batch-wise pairwise comparisons, a method the research team calls ‘calibrated attribute scoring’, which has proven to align well with human-annotated complexity labels on the MATH dataset.

One conclusion emerges: based on taxonomic coverage, real-world reference datasets almost always cover the target domain less than SIMULA-generated variants, even when embedding-based diversity metrics tell the opposite story. This highlights the limitations of relying on cosine distance alone as a proxy for dataset quality.

key takeaways

Simula’s logic-first, seedless framework handles quality, diversity, and complexity as independent axes – enabling granular synthetic dataset design without relying on manual signals, evolutionary algorithms, or seed data from the target distribution.
The combination of global and local diversification is critical: either component in isolation produces sub-optimal results, but together they consistently improve downstream model performance across all tested datasets and data sizes.
Data complexity helps model performance in most domains, but can suffer if teacher models are weak – on Lexeme, where Gemini 2.5 Flash (non-thinking) achieved only 57% accuracy, low complexity segmentation outperformed high complexity segmentation.
Real-world reference datasets almost always cover a lower portion of the target domain than Simula-generated variants based on taxonomic coverage, even when standard embedding-based cosine distance metrics suggest otherwise.
Data scaling laws are driven by data properties, not size alone – the full Simula system reached higher downstream performance with fewer samples than the baseline approach, making it more cost-effective across the full data lifecycle despite requiring 5x more inference calls per data point.

check it out paper And technical details. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets in Specific AI Domains

Why is synthetic data generation harder than it looks?

How Simula Works: Classification, Meta-Prompts, and Dual Critics

what do experiments show

Internal assessment is reconsidered

key takeaways

5 Useful Python Scripts for Advanced Data Validation and Quality Checking

Building Trust Through Cryptographic Proofs

Related Articles

Leave a Comment Cancel Reply