MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

MedicalNarratives leverages educational medical videos to create a comprehensive dataset with spatially grounded concepts. The dataset spans across multiple medical imaging modalities including CT, MRI, X-ray, and more.

Key Takeaways

  • Created a large-scale dataset of 4.7M medical image-text pairs with 1M samples containing spatial annotations from educational medical videos
  • Developed novel methods to automatically extract and align narrators' cursor movements with speech, creating dense visual groundings
  • Dataset spans 12 medical domains including CT, MRI, X-ray, ultrasound, surgery, and more
  • Demonstrates strong performance on downstream tasks like classification and retrieval

Abstract

We propose MedicalNarratives, a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies and inspired by Localized Narratives, which collects grounded image-text data by curating instructors' speech and mouse cursor movements synchronized in time. MedicalNarratives enables pretraining of both semantic and dense objectives, alleviating the need to train medical semantic and dense tasks disparately due to the lack of reasonably sized datasets. Our dataset contains 4.7M image-text pairs from videos and articles, with 1M samples containing dense annotations in the form of traces and bounding boxes. To evaluate the utility of MedicalNarratives, we train GenMedCLIP based on the CLIP architecture using our dataset spanning 12 medical domains and demonstrate that it outperforms previous state-of-the-art models on a newly constructed medical imaging benchmark that comprehensively evaluates performance across all modalities.

Method Overview

We present MedNarratives, which is a framework for creating a comprehensive medical vision-language dataset by leveraging educational videos. Overall MedNarratives does the following:

  • Extracts spatially-grounded medical concepts from educational videos using narrator cursor movements
  • Processes videos to identify stable segments where medical content is being explained
  • Aligns spoken narration with visual content and cursor traces
  • Generates high-quality image-text pairs with spatial grounding
Video processing pipeline

Dataset Overview

MedicalNarratives contains:

  • 4.7M total image-text pairs across 12 medical domains
  • 1M samples with spatial annotations from cursor traces
  • Coverage of CT, MRI, X-ray, ultrasound, surgery, and more
  • Rich metadata including UMLS entities and subdomain labels

Examples of different medical imaging modalities in our dataset.

Results

We evaluate our approach through multiple experiments:

Zero-shot classification results

Zero-shot classification performance across different medical domains

BibTeX

@article{medicalnarratives2025,
  title={MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives},
  author={Author Names},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}