AI in Medicine: Foundation Models

Reading time ~4 minutes

Foundation Models (FMs) are large neural networks pre-trained on vast, diverse datasets via self-supervision.
They learn general-purpose patterns transferable across tasks and modalities (language, vision, multimodal).
Often based on transformer architectures with hundreds of billions of parameters.
Opportunities & Challenges in Medicine
Opportunities: enable zero-shot and few-shot learning for imaging, text, and multimodal clinical data.
Challenges: medical data are private, decentralized, heterogeneous, and often limited in size/type.
Biobank sources (e.g., UK Biobank, NAKO, Million Veteran Program) offer large-scale cohorts but remain domain-specific.

Encoders map each modality (images, text, tabular) to embedding spaces.
Prompt encoders (textual or spatial) guide outputs (e.g., points, boxes, masks).
Decoders combine embeddings and prompts into task-specific heads.

Vision Transformers tokenize images into patches + positional embeddings; process via multi-head self-attention.
Masked Autoencoders (MAE): encoder sees 25% of patches, decoder reconstructs masked ones—efficient pre-training that boosts downstream generalization .
DINO / DINOv2: self-distillation without labels; student/teacher paradigm with global-local crop contrastive learning for robust features .

CLIP: dual-encoder contrastive learning on 400 M image–text pairs; enables zero-shot classification via text prompts .
FLAVA: joint vision-language alignment with unimodal + multimodal encoders on 70 M pairs.
Frozen LLMs: map visual inputs into LLM embedding space for few-shot captioning & reasoning without updating LLM weights.

SAM (Segment Anything Model): promptable mask generator trained on 1 B masks; supports points, boxes, text prompts for zero-shot segmentation .
Medical SAM: adapted on 1.57 M medical image–mask pairs across 10 modalities & 30+ cancer types.
DiENTeS: prompt-conditioned dynamic segmentation using local-global transformers for limited-data tasks.

DinoBLOOM: DINOv2-based FM for hematology single-cell images (380 k cells) .
OpenMind: 114 k head-and-neck MRI volumes self-supervised FM for 3D segmentation.
BiomedParse: joint segmentation/detection across nine modalities with meta-object classification & GPT-4 mapping .
MedGemini: multimodal embedding fusion with web-search & chain-of-reasoning prompting.
VoxelPrompt: vision-language agent generating executable instructions for grounded medical image analysis.

FM adaptation avoids costly re-training by leveraging pre-trained weights for downstream tasks.
Key approaches: Fine-tuning, Few-shot (in-context) learning, Instruction tuning, and hybrid methods.

Transfer Learning: freeze all pre-trained layers; train only new task-specific heads.
Fine-tuning: unfreeze selected pre-trained layers (often later blocks) to adapt representations; balances specificity vs. catastrophic forgetting.
Layer selection depends on data volume and domain gap:
- Similar & small data → fine-tune last layers
- Different modality → more layers or full fine-tuning

Definition: model adapts to new tasks by conditioning on K examples in the input prompt, requiring only forward passes.
Advantages: interpretable, no retraining, real-time adaptation, minimal labeling.
CLIP Example: convert class names into prompts (“A photo of a tumor”), compute image–text similarity for zero/few-shot classification .
Prompt Tuning & Adapters: learn soft prompt vectors or small bottleneck layers to steer FM without full weight updates.
LoRA (Low-Rank Adaptation): inject low-rank updates into weight matrices; efficient scaling to GPT-3/GPT-4 with near full fine-tuning performance .
FSEFT: few-shot efficient fine-tuning for volumetric organ segmentation using spatial box adapters and anatomical priors.

Definition: fine-tune FM on diverse instruction–response pairs to align with user queries.
Benefits: enhances alignment, supports human-in-the-loop, improves multi-task generalization.
Limitations: costly to collect high-quality instructions, potential bias amplification, privacy/misinformation risks if data are flawed.