The Google Health AI team has released MedASR, an open source medical speech to text model that targets clinical dictation and doctor patient conversations and is designed to plug directly into modern AI workflows.
What is MedASR and where does it fit in?,
MedASR is a speech to text model based on Conformer architecture and pre-trained for medical dictation and transcription. It is positioned as a starting point for developers who want to create healthcare based voice applications such as radiology dictation tools or visit note capture systems.
The model has 105 million parameters and accepts mono channel audio at 16000 Hz with 16 bit integer waveforms. It only generates text output, so it goes directly into downstream natural language processing or generative models like MedGemma.
MedASR sits within the Health AI Developer Foundation portfolio alongside MedGemma, MedSigLIP and other domain specific medical models that share common terms of use and a consistent governance story.
Training Data and Domain Expertise
MedASR is trained on a diverse collection of recognized medical speech. The dataset contains approximately 5000 hours of physician instructions and clinical conversations in radiology, internal medicine, and family medicine.
Combines training audio segments with transcripts and metadata. Subsets of the conversation data are annotated with medically named entities, including symptoms, medications, and conditions. This gives the model strong coverage of clinical terminology and phrase patterns that appear in routine documentation.
The model is English only, and most of the training audio comes from speakers for whom English is a first language and who grew up in the United States. The documentation notes that performance may be reduced for other speaker profiles or noisy microphones and that fine tuning is recommended for such settings.
Architecture and decoding
MedASR follows the conformer encoder design. Conformer combines convolution blocks with self-attention layers so that it can capture local acoustic patterns and long-range temporal dependencies in a single stack.
The model is exposed as an automatic speech detector with a CTC style interface. In the reference implementation, developers use AutoProcessor Create input features from waveform audio and AutoModelForCTC To generate the token sequence. Decoding uses greedy decoding by default. The model can also be combined with an external six-gram language model with a beam search of size 8 to improve the word error rate.
MedASR training uses JAX and ML pathways on TPUv4p, TPUv5p, and TPUv5e hardware. These systems provide the scale needed for larger speech models and align with Google’s broader Foundation Model training stack.
Performance on medical speech tasks
The main results with greedy decoding and the six-gram language model are:
- Red Dict, Radiologist Dictation: MedASR Greedy 6.6 percent, MedASR Plus Language Model 4.6 percent, Gemini 2.5 Pro 10.0 percent, Gemini 2.5 Flash 24.4 percent, Whisper V3 Large 25.3 percent.
- General Instructions, General and Internal Medicine: MedASR Greedy 9.3 percent, MedASR Plus Language Model 6.9 percent, Gemini 2.5 Pro 16.4 percent, Gemini 2.5 Flash 27.1 percent, Whisper V3 Large 33.1 percent.
- FM DICT, Family Medicine: MedASR Greedy 8.1 percent, MedASR Plus Language Model 5.8 percent, Gemini 2.5 Pro 14.6 percent, Gemini 2.5 Flash 19.9 percent, Whisper V3 Large 32.5 percent.
- Eye gaze, dictation on 998 MIMIC chest
Developer workflow and deployment options
A minimal pipeline example is:
from transformers import pipeline
import huggingface_hub
audio = huggingface_hub.hf_hub_download("google/medasr", "test_audio.wav")
pipe = pipeline("automatic-speech-recognition", model="google/medasr")
result = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(result)
For more control, developers load AutoProcessor And AutoModelForCTCResample audio to 16000Hz librosaIf available move the tensor to CUDA and call model.generate After processor.batch_decode,
key takeaways
- MedASR is a lightweight, open-weight conformer based medical ASR model: It has 105M parameters, is specifically trained for medical dictation and transcription, and is released as an English-only model for healthcare developers under the Health AI Developer Foundation Program.
- Domain specific training on approximately 5000 hours of deidentified medical audio: MedASR is pre-trained on physician instructions and clinical interactions in specialties such as radiology, internal medicine and family medicine, giving it stronger coverage of clinical terminology than general-purpose ASR systems.
- Competitive or better word error rates on medical dictation benchmarks: On internal radiology, general medicine, family medicine and eye gaze datasets, MedASR with greedy or language model decoding matches larger general models such as Gemini 2.5 Pro, Gemini 2.5 Flash and Whisper v3 in word error rate for English medical speech.
check it out repo, Model on HF And technical detailsAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.
