Recursive Decay: How AI Could Kill the Internet

The inevitable issues with the use of synthetic data and recursive training in large models and how human-only data sanctuaries, licensed corpora, human-in-the-loop methods, sandboxed synthetic data, and smaller models could stave off the death of the web as we know it.

The internet won’t get switched off by Sam Altman and Mark Zuckerberg using all their might to flip some comically large light switch hidden in an underground lab beneath Silicon Valley. But many think it will die. The dead internet theory has been around since at least 2021 when it was highlighted in an essay for the Atlantic. The theory posits that the internet will be “dead” when there are more bots than people and more AI generated content than human made content.  If you frequent TikTok, your FYP may be dotted with “reporters” whose lips don’t quite match up with what their monotonous voice is saying.  If you click on the comments, you’ll find a bunch of User214304s who comment strings of random emojis, a shameless plea to follow their page or buy from their shop.  These findings are harbingers of the dying internet to come.  

The content that is generated by machines and not people is called synthetic data and it makes up one part of the problem on the road to a dead internet.  Another part is something called recursive learning. This is when AI models retrain on their own or other models’ synthetic data output. When you put these two issues together, you can see that the human presence online is shrinking in both proportion and influence. That quality, human made data is representing an exponentially smaller and smaller portion of the web. Although the dead internet theory is sometimes treated as an urban legend, recursive decay is measurable and accelerating at an alarming rate. Unfortunately, it’s deeply entangled with the economics of how modern AI is built.

Synthetic data is more popular than ever before given how individuals use AI to help or completely generate content for their web pages.  In April 2025, Ahrefs analyzed over 900,000 newly published web pages and 74.2% contained AI-generated text. So only about a quarter of new web pages were purely human-written and considered non-synthetic. A running tally by NewsGuard now includes more than 1,200 “news” sites operated with little or no human oversight.  These sites are proliferating due to their automated nature, hands off management and ability to garner views and revenue.

Synthetic data isn’t inherently bad.  There are a number of wonderful uses for it; It powers simulations, preserves privacy, and fills gaps in rare-event datasets. But once it leaks into the public web without labels, it stops being a controlled tool useful in particular scenarios and starts being a contamination in the global pool of scraped training data.  Recursive learning is one method of training an AI model and isn’t necessarily inherently bad either but if done over and over, it can result in inaccuracies represented as truth. The first cycle involves training on human written data and the second on the model’s outputs that appear online unlabeled whereas the third scrapes and ingests synthetic outputs as if they were human originals. The AI has a larger store of training data but loses sight of where human written data ends and where synthetic outputs begin.  Researchers call the outcome of generations of recursive learning, model collapse. In its early stages, the model can lose rare, nuanced patterns. In the late stage, it can spiral into completely inaccurate output. It’s like playing the telephone game and winding up with a wild phrase or photocopying a photocopy until the image is nothing but a confusing blur.

As the synthetic data can be easily optimised for SEO and GEO more so than organic human made data would likely be, we can already see the illuminated walkway to the dead internet on our own devices.  When reviewing search results, many top links are AI compiled “how-to” pages, trending lists on YouTube are dominated by AI-slop video channels and as previously mentioned, many comment sections are statistically more bot than human.  If we continue along the same path we are walking now, humans will quickly become a statistical minority and the data we produce will be drowned out by AI generated content and algorithmically pushed engagement.

So why do we use recursive learning at all? It’s mainly used for three reasons, all of which are incentives for AI companies to use it in order to remain competitive or gain an edge over others. It’s economically advantageous for the model builders. Generating training data costs almost nothing once the model exists.  It gives a time advantage to creating new training data because you don’t have to wait for new human material to emerge once everything possible has been scraped. And lastly, it’s adaptable.  A company can manufacture edge cases or rare scenarios on demand rather than trying to find cases or stumbling across them.  In today’s competitive field, safety is rarely the priority when high speeds and low costs result in a significant market advantage.  

There are other options beyond recursive learning, they just take more money, man power, time and effort. These alternative options are being used by AI developers today around the world but the projects they’re being used in are far less ubiquitous due to the fact that they’re not “winning” the AI race as the current zeitgeist has defined it.  They are prioritizing quality, control and reliability over cost, scope and speed.  Some options are continuous human data collection which provides cleaner data for training the models but is much slower and more expensive.  Licensed and closed corpora of training data which gives high provenance control with reliable data but may have incomplete coverage and could result in hallucinations for queries outside the topics covered. Human in the loop protocols which results in higher quality data but is much slower and more expensive due to the workflow of human oversight.  Synthetic augmentation with firewalls which is where the synthetic data can be used but it remains in a sandbox and never enters the main training data set. And lastly, smaller, purpose built models which are gaining in popularity due to their applicability in certain scenarios and low exposure to contaminated data. All these options are on the table, but recursive data remains a key training method for the larger and more competitive models.

In order to mitigate the data decay for those still choosing to use recursive methods, a few tools have been implemented across the largest contributors responsible for the influx of this synthetic data.  These tools primarily aim to identify the non-human data in order to avoid it being treated solely as human created in large web scraping projects.  This will, ideally, allow further training to easily and clearly identify the synthetic data and either screen it out or treat it differently throughout the training process. Google DeepMind is using something called SynthID which is when they watermark text and media as it is generated.  Adobe Microsoft has created Content Credentials, also called C2PA which is a cryptographic provenance for images, video, and increasingly, text. The EU AI Act requires labeling of synthetic content in certain contexts. And Dolma and other similar datasets represent efforts of open corpora with clear data statements and provenance checks to lessen synthetic data being labeled and treated as human made. The drawback to labeling is that it is only effective if the labeling can survive deliberate stripping, reformatting and simple copy-paste efforts. Once the labeling becomes decoupled from the text or imagery, it becomes useless to make the effort in the first place.

It could be that recursive decay is inevitable in the more prominent models and on the web at large. The efforts being made could be just slowing down the rate of decay rather than staving it off indefinitely.  If that happens, we could be looking at a new field where verified human-made content will be seen as a premium commodity.  We could also see a homogenisation of information where niche knowledge like dead languages or minority cultures where the information doesn’t get repopulated is lost in its virtual presence, making its real world propagation and protection even more important. These are somewhat compelling real world responses to recursive decay and synthetic data and are logical extensions of already measurable trends. 

These logical offline responses highlight the importance of a digital repository for posterity, kept safe from synthetic data. The idea of a training sanctuary is one where only human archives are used and zero recursive training or synthetic data is involved. This differs from the current solution of purpose built smaller models and just bans synthetic data while keeping the breadth of larger models.  An effort like this could be viewed as internet conservation efforts to protect nuanced and rare information that would be statistically homogenized in recursive training. If we don’t intervene with a remedy to synthetic decay, our largest and most ubiquitous models, as well as the web at large, will be lost and along with it, the synthesis and searchability and breadth of human knowledge that we have so painstakingly digitised and collated online will also be rendered too contaminated to be useful. 

Works Consulted

Seddik, Mohamed El Amine, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah.How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse.” arXiv preprint, April 2024. https://arxiv.org/abs/2404.05090

Shumailov, Ilia, et al.The Curse of Recursion: Training on Generated Data Makes Models Forget.” arXiv preprint, May 2023. https://arxiv.org/abs/2305.17493

Gerstgrasser, Matthias, et al.Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data.” arXiv preprint, April 2024. https://arxiv.org/abs/2404.01413

Imperva. Bad Bot Report 2025. Imperva Inc., 2025. https://www.imperva.com/resources/resource-library/reports/bad-bot-report-2025/

NewsGuard.Tracking the Proliferation of AI-Generated News and Information Sites.” NewsGuard Technologies, updated 2025. https://www.newsguardtech.com/special-reports/news-bot-sites/

Ahrefs.AI Content on the Web: Analysis of New Pages 2025.” Ahrefs Blog, May 2025. https://ahrefs.com/blog/ai-content-analysis/

Originality.AI.Tracking AI Content in Google’s Top Results.” Originality.ai, 2025. https://originality.ai/blog/ai-detection-in-serps

Wikipedia Contributors.Dead Internet Theory.Wikipedia, last modified August 2025. https://en.wikipedia.org/wiki/Dead_Internet_theory

Tiffany, Kaitlyn.Maybe You Missed It, but the Internet ‘Died’ Five Years Ago.The Atlantic, August 30, 2021. https://www.theatlantic.com/technology/archive/2021/08/dead-internet-theory/619937/

Financial Times.The problem of ‘model collapse’: how a lack of human data limits AI progress.Financial Times, April 2024. https://www.ft.com/content/ae507468-7f5b-440b-8512-aea81c6bf4a5

Next
Next

AI Sabotage: Rebellion through the Disruption of Data