Infrastructure Basics11 min read

Can AI Companies Run Out of Training Data?

By Amara|Updated 19 June 2026

Stack of documents shrinking into an hourglass beside a glowing AI processor chip, representing the AI training data shortage

Key Numbers

2028

Median year Epoch AI estimates the public stock of high-quality text for LLM training runs out

Epoch AI (Villalobos et al.), 2022

60-80%

Share of all known high-quality public text a single GPT-4-scale training run may have already used

Epoch AI and OpenAI token estimates

10%

Probability Anthropic CEO Dario Amodei puts on AI running out of data to keep scaling

Dario Amodei, 2024

$60M/year

What Google reportedly pays Reddit for training data and API access

Reported licensing deal terms, 2024

23.5%

Projected annual growth rate of the global AI training data market through 2031

Cognitive Market Research, 2024

Key Takeaways

1Epoch AI researchers estimate the public internet holds roughly 15 to 20 trillion tokens of high-quality, human-written text, and at the current pace of frontier training that supply is likely exhausted somewhere between 2026 and 2032, with 2028 as the median estimate.
2A single GPT-4-scale training run may already use 60 to 80 percent of all the high-quality public text believed to exist, and Chinchilla scaling laws suggest a GPT-5-class model would need 60 to 100 trillion tokens, far more than is left to scrape.
3This is why OpenAI, Google, Anthropic, and Meta now pay for licensed data, with Google reportedly sending Reddit about $60 million a year, lean on synthetic data, and why Anthropic CEO Dario Amodei has put a 10 percent probability on data scarcity actually stalling AI progress.

AI companies are not running out of data in general. The internet keeps growing every day. What's running low is something narrower: the clean, high-quality, human-written text that built the last five years of progress and has not already been scraped, deduplicated, and fed into a model. Epoch AI, the research group behind the most-cited forecast on this question, puts the exhaustion date for that supply somewhere between 2026 and 2032, with 2028 as the median.

GPT-4 is estimated to have trained on up to 12 trillion tokens. The entire stock of high-quality public English text is believed to total only 15 to 20 trillion. Do the math and one model alone may have burned through 60 to 80 percent of everything good that humans have ever published online, which is a wild thing to sit with. Neema Raphael, Goldman Sachs' chief data officer, put it more bluntly on the bank's Exchanges podcast in 2025: "We've already run out of data."

What follows covers the math behind Epoch AI's 2028 estimate, what Dario Amodei's 10 percent figure actually means, and why synthetic data will not just paper over the gap. It also covers who's now writing checks for human data, Google included, at a reported $60 million a year to Reddit.

In This Article

1What "Running Out of Training Data" Actually Means
2The Data Wall Timeline: 2026 to 2032
3Why the Open Internet Isn't Big Enough Anymore
4The Token Math: Costs and Scale
5Synthetic Data, Model Collapse, and What Fills the Gap
6What People Get Wrong About the Data Wall
7What Happens Next: Licensing, Synthetic Data, and Specialization

What "Running Out of Training Data" Actually Means

Running out of training data means frontier AI labs are approaching the point where they have used nearly all the high-quality, human-generated text that is legally and practically available for training a large language model. It does not mean data in general is disappearing. Global data volume is expanding, roughly doubling every three to four years according to industry estimates summarized by the World Economic Forum in 2025. The constraint is narrower: clean, deduplicated, high-value text suitable for training a frontier model.

Epoch AI drew the clearest line between data tiers in its 2022 paper, Will We Run Out of Data?, led by researcher Pablo Villalobos. The paper splits data into quality tiers, each with its own exhaustion timeline.

Data tier	Examples	Estimated exhaustion window
High-quality language text	Books, edited articles, curated web	Before 2026, per the original 2022 estimate
Low-quality language text	Forums, social posts, low-signal web	2030 to 2050
Image data	Photos, illustrations, scanned media	2030 to 2060
Updated public-content estimate	All public web text combined	2028 to 2032, per later Epoch revisions

Two estimates exist because Epoch revised its own numbers as more training runs became public. The original 2022 paper put high-quality text exhaustion "soon, likely before 2026." A more recent synthesis pushed the median exhaustion year to 2028, with 2032 described as the point where exhaustion becomes very likely.

The Data Wall Timeline: 2026 to 2032

The most cited deadline for the data wall sits between 2026 and 2032, and the spread exists because different forecasts use different assumptions about how fast model training keeps scaling. The earliest estimate said exhaustion was likely before 2026. The latest pushes the midpoint out to 2028.

"The median exhaustion year is 2028, and by 2032 exhaustion becomes very likely." (Epoch AI, Will We Run Out of Data?, 2022)

What changed between those two figures? Mostly the pace of real-world training runs. GPT-3 trained on roughly 300 billion tokens back in 2020, according to OpenAI's own research paper. By 2025 and 2026, industry trackers put frontier training runs at 15 trillion tokens or more, a fiftyfold jump in five years. At that growth rate, the available pool of clean text shrinks fast no matter which year you pick as the starting line.

A separate, more conservative synthesis cited by Datanami in 2024 frames it slightly differently: tech companies will likely exhaust publicly available content for LLM training sometime between 2028 and 2032. Whichever number turns out to be right, every serious estimate lands inside this six-year window, not decades out.

Why the Open Internet Isn't Big Enough Anymore

The open internet stops being enough once a single training run needs more clean text than the web has ever produced. That is not a hypothetical. It is the situation OpenAI, Google DeepMind, Meta, and Anthropic are already managing.

OpenAI has not published an internal token-exhaustion estimate, but it has signed multi-year licensing deals with Shutterstock for images and video, and with news publishers including News Corp and Axel Springer for current and archival content.
Google DeepMind researchers have warned models could run out of fresh, human-written text as soon as 2026, while Google itself pursues data partnerships with publishers and businesses to access material that never touched the open web.
Meta trains Llama on large public web mixtures, but its real advantage is proprietary: billions of Instagram and Facebook posts that no competitor can scrape.
Anthropic relies on filtered text corpora plus a growing share of AI-generated preference data for safety and alignment tuning, which sidesteps the public-text bottleneck for that stage, though not for pretraining itself.

Neema Raphael, Goldman Sachs' chief data officer, said it most plainly on the bank's Exchanges podcast in 2025:

"We've already run out of data." (Neema Raphael, Goldman Sachs, 2025)

Raphael's point was that frontier labs have used most of the easy, public, legally safe text. What is left requires paying for rights, tapping private data, or generating synthetic text, each with its own cost and its own risk.

The Token Math: Costs and Scale

Training a frontier model is, underneath everything else, a token-counting exercise. GPT-3 used about 300 billion tokens in 2020. GPT-4 is estimated to have used up to 12 trillion tokens. Frontier runs in 2025 and 2026 are estimated at 15 trillion tokens or more.

Model or era	Estimated training tokens	Year
GPT-3	~300 billion	2020
GPT-4	Up to ~12 trillion	2023
Frontier models, 2025-2026	~15 trillion or more	2025-2026
GPT-5-class, per Chinchilla scaling	60-100 trillion (needed)	Projected

The number most guides don't show

The total stock of high-quality public English text is estimated at 15 to 20 trillion tokens. GPT-4's training run alone may have used up to 12 trillion of them. Divide one into the other and a single model run consumed somewhere between 60 and 80 percent of all the good text that exists on the open internet, in one training pass.

Now look forward instead of back. Chinchilla scaling laws, the rules researchers use to estimate how much data a model of a given size needs, suggest a GPT-5-class system would require 60 to 100 trillion tokens to scale the same way GPT-4 scaled from GPT-3. Even after counting every accessible source, including harder-to-reach non-English text that might push the ceiling toward 60 trillion tokens, that still leaves a shortfall of 10 to 20 trillion tokens or more. There is simply not enough human-written text left to train the next jump the old way.

That shortfall is also turning into a real market. Grand View Research priced the global AI training dataset market at $2.13 billion in 2023, and Cognitive Market Research projects 23.5 percent annual growth through 2031, which would put the market north of $11 billion by then. Data, in other words, has gone from a free byproduct of the internet to a line item companies now budget for.

Synthetic Data, Model Collapse, and What Fills the Gap

Synthetic data, text and images generated by an AI model rather than written by a person, is the most common answer to the shortfall, but it is not a free substitute. Every major lab already uses some form of it, mostly for instruction tuning, safety evaluation, and preference data rather than core pretraining, which is still dominated by human-written text and code.

The risk has a name: model collapse. When a model trains on data that was itself generated by an earlier model, errors and quirks can compound instead of cancel out. Over repeated generations, outputs drift toward less diverse, less accurate text, the same way a photocopy of a photocopy loses detail. Researchers studying this describe it as a feedback loop that needs active management, not a problem that fixes itself.

"There's a 10% chance that we could run out of enough data to continue scaling models." (Dario Amodei, Anthropic CEO, 2024)

Amodei's 10 percent is a tail-risk estimate from someone running a frontier lab, not a confident prediction either way. It is also a reminder that synthetic data, licensing deals, and better filtering are treated industry-wide as mitigations, not solutions. Anyone curious why labs spend this much engineering effort on the data side rather than just buying more GPUs should see our explainer on AI training versus inference, which breaks down where the compute actually goes during a training run, separately from the data question.

What People Get Wrong About the Data Wall

"We're running out of data, period." Not true. Global data volume is expanding, roughly doubling every three to four years according to industry estimates. What is running out is the narrow slice of clean, high-quality, legally usable text that frontier models can train on. Sensor logs, internal documents, and raw video are exploding in volume; they are just not the kind of data that taught GPT-4 to write a coherent paragraph.

"Once we hit the wall, AI progress stops." Hitting the limit of public text data constrains one specific strategy: scaling bigger models on more generic web text. Progress can still come from better architectures, smarter fine-tuning, domain-specific data in fields like medicine and law, and tool-augmented reasoning that does not depend on raw token counts the same way.

"Synthetic data solves this with no downside." It extends the runway, but it carries the model collapse risk described above and tends to inherit whatever biases or blind spots the generating model already had. Labs treat it as one tool among several, not a replacement for human-written text.

"This only affects text. Images and video are basically infinite." Forecasts cover those too. The same Epoch AI research that projects high-quality text exhaustion before 2026 estimates image data will not run out until 2030 to 2060, but that is still a finite, projected number, not an infinite resource.

What Happens Next: Licensing, Synthetic Data, and Specialization

What happens next is already visible in the deals labs are signing. Google pays Reddit a reported $60 million a year for access to its content and API, a deal struck in early 2024 as Google's search and AI products leaned harder on real conversational text. OpenAI's licensing arrangement with Shutterstock, first signed in 2022 and expanded in 2023, covers images, video, and metadata for training and generation. OpenAI has also struck multi-year agreements with news publishers including News Corp and Axel Springer, reportedly worth hundreds of millions of dollars combined, for current articles and archives.

Deal	Companies	What it covers
Reddit licensing	Google and Reddit	Content and API access, reported at ~$60M/year (2024)
Shutterstock partnership	OpenAI and Shutterstock	Images, video, and metadata (2022, expanded 2023)
News publisher deals	OpenAI, News Corp, Axel Springer, and others	Current and archival news content

Three trends look set to define the next few years. Licensing deals will likely keep growing, since paying for rights is now cheaper than litigating the copyright claims already working through US courts. Synthetic data pipelines will get more sophisticated, paired with techniques like self-critique and density filtering meant to catch model collapse before it degrades output quality. And data quality engineering, picking the best 10 to 30 trillion tokens rather than scraping everything indiscriminately, will likely matter more than chasing additional raw volume.

Anyone curious how much of this trickles down to models they can run themselves should see our guide to open source LLMs, which covers which models are trained on licensed versus purely scraped data, and our walkthrough on running Ollama locally for experimenting with these models directly.

Frequently Asked Questions

Can AI companies run out of data?

Yes, in the specific sense that matters for frontier model training. AI companies are not running out of data overall since the internet keeps growing, but they are approaching the limit of clean, high-quality, human-written text suitable for training large language models. Epoch AI estimates that supply is likely exhausted somewhere between 2026 and 2032, with 2028 as the median estimate.

Will AI run out of training data?

Within this decade for the highest-quality public text specifically. Epoch AI's original 2022 estimate put exhaustion before 2026; a later revision pushed the median to 2028, with 2032 described as the point exhaustion becomes very likely. Lower-quality text and image data have longer runways, lasting into the 2030s, 2040s, and beyond.

Is AI running out of data to learn from?

Largely yes, for the high-quality public text that powered recent AI progress. Frontier labs have likely already used most of the easy, high-quality, legally available public text. Goldman Sachs chief data officer Neema Raphael said in 2025 that "we've already run out of data," pointing to the shift toward synthetic and proprietary sources as evidence.

What happens when AI runs out of internet data?

Labs pay for licensed data, with Google reportedly sending Reddit about $60 million a year; lean on synthetic data despite known risks like model collapse; and shift focus from raw volume to data quality, curating smaller, denser, higher-value datasets. Expect more of the next gains to come from better architectures and fine-tuning rather than just adding more raw text.

What is synthetic data and can it replace human-written training data?

Synthetic data is text, images, or other content generated by an AI model rather than written by a person. It can extend a model's effective training supply, but it cannot fully replace human-written data because of model collapse, a risk where models trained on AI-generated content gradually lose diversity and accuracy across generations. Most labs use synthetic data heavily for fine-tuning and safety work, while pretraining still leans on human text.

What is model collapse?

Model collapse is what happens when an AI model is trained, directly or indirectly, on data generated by an earlier AI model rather than by humans. Errors and quirks in the earlier model's output get amplified instead of corrected, and over several generations the resulting model produces less diverse, less accurate text. Researchers treat it as an active risk to manage, not a one-time event.

How much training data did GPT-4 use?

GPT-4 is estimated to have trained on up to 12 trillion tokens, though OpenAI has not confirmed an exact figure publicly. That estimate matters because the total stock of high-quality public English text is believed to be only 15 to 20 trillion tokens, meaning a single GPT-4-scale run may have used 60 to 80 percent of all the good text that exists online.

Why do companies like Google pay for data licensing deals?

Because the easiest, free, publicly scrapable text has mostly already been used, and legal risk around scraping has grown alongside a wave of copyright lawsuits against AI companies. Google reportedly pays Reddit roughly $60 million a year for guaranteed access to its content and API, which is cheaper and lower-risk than relying on scraped data that could later be restricted or challenged in court.

Related Articles

Infrastructure Basics

What Is an LLM? Large Language Models Explained

10 min read

Infrastructure Basics

AI Training vs Inference: What's the Difference and Why the Cost Gap Is Growing

10 min read

Infrastructure Basics

Open Source LLMs: The Best Models You Can Run Yourself in 2026

12 min read

Infrastructure Basics

How Does ChatGPT Work? How AI Learns, Thinks, and Generates Text

12 min read

Want hands-on setup guides?

These step-by-step guides relate to topics covered in this article.

run ollama locally →

Back to AI Infrastructure