Liquid biopsy unlocks non-invasive cancer screening and monitoring by analyzing cancer biomarkers in the blood, but signals can be sparse and noisy. Axi Bio has taken the lead AI-powered liquid biopsy Using new small RNA biomarkers. In recent work, Exai-1 and Orion – two new Generative AI For cell-free RNA – make breakthroughs in signal denoising and early cancer detection. These advancements were made possible by Databricks’ Lakehouse architecture and cloud AI infrastructure. By integrating large genomic datasets and providing managed ML tools (MLFlow, Workflow, Scalable Cluster), Databricks enables XE researchers to train large multimodal models on thousands of patient samples. In this joint effort, we highlight Axi Bio’s technological breakthroughs and show how Databricks’ Lakehouse and MLOps ecosystem accelerates cutting-edge biomedical AI.
Multimodal Foundation Model for Liquid Biopsy
Axi Bio’s Latest Research Introduces large generative models Consistent with liquid biopsy data. These models integrate sequence information, molecular abundance, and rich metadata to learn high-quality representations of cancer-associated RNAs.
- Exai-1 (cfRNA Foundation Model): A Transformer-based variational autoencoder that combines RNA sequence embeddings with cell-free RNA (cfRNA) abundance profiles. Exai-1 is pre trained Learning the biologically meaningful latent structure of cfRNA expression – over 306 billion sequence tokens from 13,014 blood samples – on a large-scale dataset. By leveraging both sequences (via embeddings from the RNA-FM language model) and expression data, Exy-1 “enhances signal fidelity, reduces technical noise, and improves disease detection by generating synthetic cfRNA profiles”. In practice, Exai-1 can denoise Sparse cfRNA measurements and even Increase Dataset: Classifiers trained on the reconstructed profiles of Exai-1 consistently outperform classifiers trained on raw data. This generative transfer-learning approach effectively creates Foundation Model For any cfRNA-based diagnostic task – for example using similar pre-trained embeddings to detect other cancers or new biomarkers.
- Orion (OnsiRNA Generative Classifier): A special variational-autoencoder (VAE) for convolution. Orphan Non-coding RNAs (onsiRNAs), which are small RNAs secreted by tumors. Orion has twin vae Architecture: It takes as input a count vector of cancer-associated onsiRNAs And A vector of control RNA (such as endogenous housekeeping RNA). Each input feeds a separate encoder; Their outputs allow a robust classifier to be trained and the underlying onsiRNA distribution to be reconstructed. Importantly, Orion’s training incorporates adversarial and classification losses: a triple margin loss pulls together samples with the same phenotype (cancer vs. control) and separates different phenotypes, removing batch effects and technical variations. The learned embeddings are used by downstream classifiers to predict the presence of cancer. On a group of 1,050 lung cancer patients and controls, Orion achieved 94% sensitivity at 87% specificity performed ~30% better than standard methods on pooled data for detecting NSCLC across all stages. This generative, semi-supervised model automatically maps cfRNA signals and generates a compact cancer-specific fingerprint, enabling more accurate early detection than previous tests.
Figure 1: Architecture of Axi Bio’s Orion model for liquid biopsy. from image karimzadeh et al., Nat Commun.
These models together form one scalable ai framework For liquid biopsy. Exai-1 provides a general-purpose cfRNA “language model” that can generate realistic RNA profiles and promote downstream classifiers. Orion optimizes this approach for the specific problem of lung cancer screening. In both cases, the model generalization Exai-1 “facilitates cross-biofluid translation and assay compatibility” by separating genuine biological signals from confounders – across different conditions. The result is a new generation of AI tools that can mine subtle cfRNA biomarker patterns for early cancer detection and biomarker discovery.
Databricks Data Intelligence and AI Platform: Enabling Infrastructure
These AI breakthroughs are powered by Databricks’ integrated data analytics platform. Key capabilities include:
- Integrated Lakehouse (Delta) Storage: We store all metadata (sample information, lab and experiment data) in Databricks delta tables. this single Lakehouse Prevents data silos and enables real-time analysis. As Databricks Healthcare Solutions notes, Lakehouse “brings together patient, research and operational data at scale” and eliminates legacy silos, making genomic and clinical data instantly queryable. For example, Axi’s 13,000+ blood samples (in serum and plasma) and over 10,000 prior small-RNA-seq datasets are all registered in Delta tables, which can be rapidly filtered and combined for model training.
- Scalable Compute and Cluster: Databricks’ cloud-native cluster lets researchers Spin up GPU or high-memory instances Without intensive DevOps effort. Databricks allows us to move faster. Cluster management is intuitive, and features like auto-termination and cost dashboards keep budgets under control. This on-demand scaling enabled the optimization and training of AXY-1 and ORION on hundreds of CPU cores/GPUs. Databricks Workflows (formerly Jobs) organize “compute”: Researchers can launch multi-stage ETL and training pipelines with defined dependencies, parallelizing tasks without writing complex orchestration code.
- mlflow for mlops: Every experiment run (hyperparameter, dataset, metrics, artifact) is tracked in MLflow, which is tightly integrated into Databricks. Databricks provides all the MLflow environment setup like tracking server and makes it available without any setup. MLflow’s experiment tracking and model registry ensure reproducibility and collaboration. With managed MLflow, logging metrics and artifacts from tens of models made it possible to truly perform ablation studies and customize features that improve various aspects of model performance.
- reproducible environment: Databricks Container Services and Git-based repos (with CI/CD) lock down software dependencies for each pipeline. This has been critical to Axie Bio’s research stack (including custom bioinformatics tools), ensuring that every member of the team runs models in the same environment. In short, Databricks provides a turnkey MLops platform: data ingestion with Spark, experiment tracking with MLflow, orchestration with Jobs/Workflows, and elastic compute with auto-scaling.
Impact on cancer detection and biomarker discovery
The major implications of combined scientific and engineering progress are:
- better preliminary investigation – By amplifying cfRNA cancer signal against the background of blood RNA molecules, our AI models can detect cancer at an early stage. Denoising of Exai-1 yields clear signals even in small volume blood samples, while generative embedding of Orion achieves high sensitivity (94%) for early-stage lung cancer. Such improvements could translate into more reliable screening tests (e.g. annual blood tests) that catch tumors at treatable stages.
- New Biomarker Insights – Models learn from raw RNA data, reducing bias of targeted panels. For example, Orion identified hundreds Novel oncRNAs from TCGA and tissue data, then validated their significance in blood. The latent locus of Exai-1 combines RNA sequence, structure, and abundance information that may uncover previously overlooked biomarkers. Importantly, the transfer-learning paradigm enables us to quickly incorporate new discoveries (for example, swapping in new sequence tokens) and fine-tune on the unified platform.
- Generative Data Augmentation – Exai-1 can Simulate realistic cfRNA profiles By taking samples from its decoder. This synthetic data boosts classifier training, as shown by higher AUCs when using the Exai-1 reconstruction. In practice, this means that rare cancer signatures can be learned more robustly despite limited real samples. In other words, the foundation model minimizes data scarcity – an important factor because “detecting rare cancers requires foundational models and adequate training data”.
- Scalable Research Collaboration – By building on Databricks, Axi’s multidisciplinary team (biologists, bioinformaticians, biostatisticians, ML scientists and data engineers) can seamlessly collaborate. Data scientists run PyTorch and Spark side by side; Biostatisticians query groups with R; Biologists log newly processed samples, and reports/dashboards are automatically refreshed. This rapid feedback loop has allowed the Axi team to demonstrate applications of their liquid biopsy and AI system in multiple cancer types, resulting in seven conference publications in 18 months. It exemplifies how enterprise-grade AI infrastructure accelerates life-sciences R&D.
looking ahead
The collaboration between Axi Bio and Databricks shows how State-of-the-art AI models And modern cloud architecture Together, push the boundaries of cancer diagnosis. Exai Bio’s foundation and generative AI models (Exai-1 and Orion) demonstrate that deep generative learning can extract powerful signals from liquid biopsies. Behind these advances is Databricks’ Lakehouse – integrating heterogeneous biomedical data – and its managed ML tools (MLFlow, Workflow, Pipelines) that make large-scale experiments practical and reproducible. Looking ahead, we will continue to refine our models and pipelines. Together, Axi Bio and Databricks are building the foundation for AI-powered precision oncology that is scalable and clinically efficacious.
Source: xy bio And others.“A multi-modal cfRNA language model for liquid biopsy” (Nature Machine Intelligence2025); xy bio And others., Nature community. (2024) “Deep generic AI models analyze circulating orphan non-coding RNAs…”; Databricks Documentation and Blog.
