Lost in Plot - Contrastive Learning for Movie Retrieval
Dense retrieval system for tip-of-the-tongue movie search from vague descriptions
Quick Navigation: Overview • Problem Statement • Approach • Results • Technical Contributions • Insights & Future Directions • Impact
Overview
Developed a dense retrieval system for “tip-of-the-tongue” movie search, enabling users to find films from vague, fragmentary natural language descriptions using contrastive learning.
Course: Natural Language Processing, University of Colorado Boulder Status: Completed
Problem Statement
Humans naturally recall movies through fragments—bits of plot, emotion, or striking visuals—rather than exact titles. We remember “that comic-horror about a haunted house at Christmas” or “the one where Brad Pitt plays Death,” not precise keywords. This creates a fundamental gap:
The Challenge:
- Traditional keyword-based search fails with vague queries
- Generative models (LLMs) hallucinate plausible but incorrect titles
- Need semantic understanding to bridge natural descriptions and movie metadata
Research Questions:
- How effective is contrastive learning-based dense retrieval for retrieving movies from vague user descriptions?
- How does a fine-tuned model compare to few-shot prompting (GPT-4) and vanilla BERT encoder?
Approach
System Architecture
Base Model: Pre-trained BERT encoder with contrastive multi-task learning
Components:
- Encoder: BERT produces contextualized embeddings with mean pooling
- Projection Layer: Linear projection to lower-dimensional retrieval space with ℓ2-normalization
- Classification Heads: Three auxiliary tasks for genre, decade, and theme prediction
Mathematical Formulation:
h = MeanPool(BERT(text, mask)) ∈ R^H
ẑ = normalize(W_proj · h) // retrieval embedding
Data Preparation
Movie Metadata (TMDb/IMDb):
- Combined plot summaries, titles, and keywords
- Multi-hot encoding for genres
- Decade mapping for release years
- Latent themes extracted via BERTopic clustering
Synthetic Dataset:
- Created 3,000 vague movie descriptions using GPT-4 few-shot prompting
- Indexed 100K+ movies in FAISS for efficient retrieval
- Evaluation queries mimic real “tip-of-the-tongue” scenarios
Anchor-Positive Sampling
Training pairs (aᵢ, pᵢ) where:
- Anchor (aᵢ): Movie metadata (plot, title, keywords) from a specific theme
- Positive (pᵢ): Different movie sharing same BERTopic theme
- Implicit Negatives: All other examples in the batch
Multi-Task Loss Function
InfoNCE Contrastive Loss:
L_retr = -1/N Σᵢ log[exp(ẑᵃᵢ · ẑᵖᵢ/τ) / Σⱼ exp(ẑᵃᵢ · ẑᵖⱼ/τ)]
Auxiliary Classification Losses:
- Genre prediction (multi-label binary cross-entropy)
- Decade prediction (cross-entropy)
- Theme prediction (cross-entropy)
Combined Objective:
L = w_retr·L_retr + w_genre·L_genre + w_year·L_year + w_theme·L_theme
Training Details
- Optimizer: AdamW with weight decay
- Batch size: 16
- Learning rate: 2×10⁻⁵
- Scheduler: Cosine decay with 10% warm-up
- Epochs: 8 (early stopping on MRR)
- Evaluation: FAISS index of 100K movies, measured Recall@1/5/10/25 and MRR
Results
Performance Comparison
Key Findings:
- vs. GPT-4 Few-Shot: Contrastive model significantly outperformed on Recall@K and MRR
- vs. Vanilla BERT: Fine-tuned model fell slightly short of base encoder performance
Metrics: Recall@1, Recall@5, Recall@10, Recall@25, Mean Reciprocal Rank (MRR)
Why It Beats GPT-4 Few-Shot
Reliability Advantage:
- GPT-4 frequently hallucinates plausible but incorrect titles
- Dense retrieval grounds answers in pre-computed embeddings
- Cosine similarity over indexed movies ensures factual retrieval
- No generation means no fabrication
Why It Lags Behind Vanilla BERT
Analysis of Performance Gap:
- Large Theme Space:
- Too many themes introduced noisy positives
- Some themes had only handful of movies (under-represented)
- Coarse Theme Granularity:
- Movies sharing high-level topics still semantically distant
- Example: Two “sci-fi” films with very different plots treated as similar
- Weak Positive Sampling:
- Relied only on shared themes
- Ignored stronger signals: overlapping keywords, cast, directors
- Competing Classification Heads:
- Multiple auxiliary objectives pulled model in different directions
- Genre, decade, and theme heads potentially overwhelmed core retrieval signal
- Balancing multi-task losses proved delicate
Technical Contributions
-
Dense Retrieval Framework: Contrastive learning system aligning vague descriptions with movie metadata in shared semantic space
-
Synthetic Evaluation Dataset: 3,000+ “tip-of-the-tongue” queries mimicking real user search patterns
-
Multi-Task Architecture: Combined retrieval and classification objectives for semantic understanding
-
Large-Scale Indexing: FAISS-based system for efficient retrieval over 100K+ movies
-
Comparative Analysis: Systematic evaluation against strong baselines (GPT-4, vanilla BERT)
Insights and Future Directions
Key Takeaways
Contrastive Learning Promise:
- Modest fine-tuning yields practical benefits over generative approaches
- Reliable retrieval without hallucination risk
- Scalable to large movie databases
Challenges Identified:
- Theme-based positives insufficient for strong semantic signal
- Multi-task loss balancing critical but difficult
- Need for harder negative mining strategies
Future Work
Methodological Improvements:
- Hard Negative Mining: More challenging negatives during training
- Partial Fine-Tuning: Only tune top transformer layers to preserve general knowledge
- Richer Metadata Signals: Incorporate cast, directors, cinematographer into contrastive objective
- Adaptive Loss Weighting: Dynamic balancing of multi-task objectives
Evaluation Enhancements:
- Test on real user queries (not just synthetic descriptions)
- User studies measuring practical search effectiveness
- Cross-dataset generalization evaluation
Broader Applications:
- Extend to music retrieval (“that song about flying and loss”)
- Photo search from vague descriptions (“waterfall at sunrise”)
- General “tip-of-the-tongue” information retrieval
Impact
This work demonstrates that contrastive learning can bridge the gap between natural human descriptions and structured metadata, enabling more intuitive search systems. While challenges remain, the approach shows clear advantages over generative methods in factual grounding and retrieval reliability.
The system addresses a common real-world problem—finding content from imperfect memory—and provides a foundation for semantic retrieval in domains where vague queries are the norm.
This project was completed as part of the Natural Language Processing course at the University of Colorado Boulder.