Model Collapse
What happens when all training data is produced by AI?
Today, we consider the reality: synthetic content is flooding the internet, and that makes it harder for models to learn from the real world.
Studies show that models trained too heavily on their own outputs begin to lose rare information, reinforce earlier mistakes, and grow overconfident in repeated predictions.
Tools like watermarking or synthetic-data reweighting help, but they’re not enough. The real solution will hinge on maintaining verifiably human datasets, which may become a major new industry in its own right.
As synthetic contamination grows, we expect AI progress to slow unless we act now. Protecting high-quality, human-generated data isn’t optional - it’s the foundation of the next generation of models.
Here’s our video on the topic.
Full video is available at here.



