measuring model collapse under recursive summarization training

investigating the fundamental limits of recursive training and synthetic data proliferation in foundation model ecosystems

quick navigation: research motivationexperimental frameworkmeasurement & detectionadvisor


research motivation

As the web becomes increasingly saturated with model-generated content, the risk of model collapse—a degenerative process where future generations of models lose their ability to represent the diversity and fidelity of the original human data distribution—becomes a critical bottleneck for AI sustainability.

This research focuses on the theoretical and empirical foundations of this failure mode. Specifically, we investigate the autophagous (self-consuming) loops that occur when recursive training cycles progressively replace human-written golden corpora with synthetic summaries, leading to irreversible loss of knowledge fidelity.

Status: Work in progress (active research)
Advisor: Dr. Maria Leonor Pacheco, BLAST Lab

experimental framework

controlled recursive pipelines

Designing and orchestrating a recursive training environment to isolate the causal drivers of distributional shift:

  • Synthetic-Human Hybrid Corpora: Modeling the gradual transition from human-centric to synthetic-dominant data regimes.
  • Recursive Feedback Loops: Simulating multiple generations of model training where each generation is conditioned on the outputs of its predecessor.
  • Compression-Collapse Dynamics: Specifically analyzing how summarization—a task involving inherent information loss—accelerates or catalyzes token-level and semantic-level collapse.

measurement & detection

early warning signals & indicators

A core goal of this research is developing a robust measurement framework to detect the onset of collapse before major downstream failures occur. We are currently investigating:

  • Entropy Decay Analysis: Tracking the reduction in model output entropy as a proxy for distributional narrowing.
  • Lexical & Semantic Diversity: Developing metrics to quantify the “bleaching” of linguistic variety across generations.

This research represents a core interest in the sustainability and groundedness of large-scale reasoning systems, serving as a foundation for future inquiries into robust foundation model training.