Tool DiscoveryTool Discovery
Infrastructure Basics12 min read

How Does ChatGPT Work? How AI Learns, Thinks, and Generates Text

AmaraBy Amara|Updated 11 May 2026
How ChatGPT works: transformer architecture diagram showing text tokens The, bank, WBS flowing through stacked neural network layers to generate an AI language model response

Key Numbers

175B
Parameters in GPT-3, each adjusted billions of times during training
OpenAI GPT-3 paper, 2020
1.76T
Estimated parameters in GPT-4 via Mixture of Experts architecture
SemiAnalysis, 2023
128K
Token context window of GPT-4o, roughly 100,000 words of text
OpenAI, May 2024
$4.6M–$12M
Estimated cost to train GPT-3 on thousands of GPUs
Rethink Priorities, 2022
2017
Year transformer architecture was introduced in "Attention Is All You Need"
Vaswani et al., Google, 2017

Key Takeaways

  • 1ChatGPT works by predicting the next most likely word fragment (token) one at a time, using a transformer neural network. GPT-3 uses 175 billion parameters trained on 300 billion tokens (OpenAI, 2020). GPT-4 is estimated at 1.76 trillion parameters in a Mixture of Experts architecture.
  • 2Training GPT-3 required 3.14 × 10²³ FLOPs of compute and cost an estimated $4.6M to $12M (Rethink Priorities, 2022). Running that same training on a single consumer RTX 4090 GPU would take approximately 120 years. Frontier model training requires datacenter-scale infrastructure with thousands of GPUs running in parallel.
  • 3ChatGPT does not understand language or know facts. It generates statistically plausible continuations of text based on patterns learned during training. RLHF shapes responses to be more helpful and safe, but the underlying mechanism is always next-token probability sampling with no truth-checking capability.

ChatGPT works by predicting the most likely next word fragment, one token at a time, based on everything that came before it. That sounds almost reductive. The scale at which it does it is what changes the result entirely.

The model underlying ChatGPT stores learned language patterns across billions of parameters. GPT-3, released in 2020, used 175 billion parameters trained on 300 billion tokens of text from the internet, books, and code (OpenAI, 2020). GPT-4 is estimated to use approximately 1.76 trillion parameters in a Mixture of Experts architecture, where different subsets of parameters handle different types of input. Each of those parameters was adjusted billions of times during training to make the model's next-token predictions more accurate.

What follows covers transformer architecture, the training pipeline from raw data to deployed model, the token-by-token generation process, and how image-generating AI uses a separate mechanism called diffusion.

How Does AI Work? The Transformer Architecture

ChatGPT and every major language model deployed today, including Claude, Gemini, and Llama, run on an architecture called the transformer. It was introduced in a 2017 paper by Google researchers titled "Attention Is All You Need" (Vaswani et al.). Before transformers, language models processed text sequentially, word by word, which caused them to lose context across long passages and made training slow. Transformers process all tokens in parallel using a mechanism called self-attention.

Self-attention works by computing, for every token in the input, how much weight to give every other token when forming a contextual representation. When the model sees the sentence "The bank was steep," self-attention pulls information from surrounding words to determine whether "bank" refers to a financial institution or a riverside. Every token gets updated based on its computed relationship to all other tokens in the context window simultaneously, not sequentially.

A context window is the total amount of text the model can consider at once. Longer context windows allow the model to reason over more information without losing earlier content. Context length has grown considerably since GPT-3.

ModelContext WindowArchitectureReleased
GPT-34K tokensDense transformerMay 2020
GPT-48K to 128K tokensMixture of ExpertsMarch 2023
GPT-4o128K tokensMultimodal MoEMay 2024
Claude 3 Opus200K tokensDense transformerMarch 2024
Gemini 1.5 Pro2M tokensMoE with Ring AttentionFebruary 2024

Gemini 1.5 Pro's 2 million token context window represents a 500-fold increase over GPT-3's 4,000 tokens. A 2M token window can hold roughly 20 full-length novels, several hours of audio transcripts, or an entire software codebase for analysis in one request.

For a detailed explanation of what large language models are and how they differ from earlier AI approaches, see our guide to large language models explained.

How Does AI Learn? Training, Data, and Gradient Descent

AI language models learn through pretraining: expose the model to enormous amounts of text, have it predict the next token at each position, measure how wrong it was, and adjust the model's parameters slightly in the direction that would have made the prediction more accurate. Repeat this across hundreds of billions of examples.

The adjustment process is called gradient descent. Every parameter in the model gets nudged by a tiny amount determined by how much changing that parameter would reduce the prediction error. This process, running through a technique called backpropagation, is what training actually is. Run it long enough on enough data and the parameters settle into values that reproduce the statistical patterns of human language.

GPT-3 was trained on 300 billion tokens drawn from filtered Common Crawl web data, books, Wikipedia, and code (OpenAI, 2020). The full pretraining run required approximately 3.14 × 10²³ floating-point operations.

The Number Most Guides Don't Show

GPT-3's pretraining required 3.14 × 10²³ FLOPs of compute. A consumer NVIDIA RTX 4090 GPU performs about 82.6 TeraFLOPS, or 8.26 × 10¹³ operations per second. Running GPT-3's full training on a single RTX 4090 would take approximately 3.8 × 10⁹ seconds, around 120 years. OpenAI completed it in weeks by running over 10,000 GPUs in parallel. That gap explains why training frontier models remains inaccessible outside of companies with hyperscale data center infrastructure, and why the cost structure of training differs so sharply from daily inference costs, a distinction covered in detail in our AI training vs inference explained guide.

After pretraining, models go through two further refinement steps:

  • Supervised fine-tuning: human trainers write example prompt-and-response pairs. The model trains on these to learn a conversational style and instruction-following behavior.
  • RLHF (Reinforcement Learning from Human Feedback): human raters compare pairs of model responses and rank which is more helpful, safer, or more accurate. A separate "reward model" learns from those rankings. The main language model is then updated to produce outputs the reward model scores highly. OpenAI introduced this approach in the InstructGPT paper (2022), reporting a 3x improvement in human preference ratings over the base pretrained model.

"GPT-3 is trained using next word prediction, just the same as its GPT-2 predecessor." (Ilya Sutskever, OpenAI, GPT-3 paper, May 2020)

According to DeepMind's Chinchilla scaling laws research (Hoffmann et al., March 2022), training compute is most efficiently split roughly equally between model size and training data volume. The paper showed that GPT-3's 175 billion parameters were undertrained relative to the optimal data volume, and that a 70 billion parameter model trained on 1.4 trillion tokens could match GPT-3's performance at a fraction of the compute cost. This finding shifted how subsequent models, including GPT-4 and Llama, were designed.

How Does ChatGPT Specifically Work? Tokens and Temperature

ChatGPT does not process text as words. It works with tokens, small units that can be whole words, word fragments, punctuation marks, or spaces. The word "tokenization" becomes roughly three tokens. Numbers and uncommon words frequently split into multiple tokens. GPT-4o handles up to 128,000 tokens at once, equivalent to a full-length novel.

Tokens matter because the model's entire operation, from reading your input to generating its response, happens in token space. When you type a message, your text is split into tokens and each token gets converted into a high-dimensional numerical vector. These vectors pass through the transformer layers, where self-attention updates each token's representation based on its relationship to all other tokens in the context.

At the end of the forward pass, the model produces a probability distribution over its entire vocabulary, roughly 100,000 possible next tokens, for what should come next. It samples from that distribution, adds the selected token to the context, and runs the entire forward pass again for the next token. This repeats until the model generates a stop token or reaches its output limit.

Temperature controls how that sampling works. At temperature 0, the model always picks the highest-probability token and generates identical outputs for the same input. At temperature 1.0, roughly ChatGPT's default, it samples proportionally to the probabilities, producing varied but coherent outputs. That is why asking ChatGPT the same question twice gives different phrasing each time.

GPT-4 uses a Mixture of Experts (MoE) architecture, estimated at 1.76 trillion total parameters but only activating roughly 200 billion of them per token (SemiAnalysis, 2023). Each input gets routed to different specialized sub-networks ("experts") depending on its content. This allows a much larger total model size without a proportional increase in inference cost, because most parameters stay inactive for any given forward pass.

How Does AI Work Step by Step? From Your Prompt to the Response

The full sequence from prompt to response, based on OpenAI's published documentation on how ChatGPT and foundation models are developed.

1. Your text is tokenized into subword units and filtered through a content moderation layer that blocks clearly unsafe inputs before the main model processes anything.

2. The system selects an appropriate model variant based on the complexity of the request. Simpler queries route to GPT-4o mini (128K context, $0.15 per million input tokens). More complex reasoning tasks route to models like o1, which runs additional internal chain-of-thought computation before generating a final response.

3. The tokens pass through all transformer layers simultaneously, with self-attention updating each token's representation based on the full context. For a 100-token input, this happens 96 times (the number of layers in GPT-3-scale models), building progressively richer representations of meaning and context.

4. The model generates the response one token at a time, with each new token added to the context and the full forward pass repeated. Temperature, top-p sampling, and top-k filtering shape which tokens get selected during generation.

5. The generated output passes through a second content moderation layer before being displayed.

The OpenAI o1 series, released in September 2024, added a step before generation: internal chain-of-thought reasoning, where the model runs thousands of internal tokens of deliberation before producing a visible response. This "test-time compute" scaling improved performance on mathematics and science benchmarks by over 50% compared to standard GPT-4 (OpenAI o1 system card, 2024).

"Claude 3 uses Constitutional AI: we train with a constitution of principles to align behavior." (Dario Amodei, Anthropic CEO, Anthropic blog, March 2024)

All major deployed models share transformer architecture. What separates them is training data, training methodology, fine-tuning approach, and constitutional constraints. Those choices determine how responses feel and where the models succeed or fail.

What none of these models do is understand language in the way humans do. They have no internal world model, no persistent memory between conversations (unless given a memory tool), and no ability to verify whether a generated statement is true. They generate statistically plausible continuations of text. When that plausibility happens to match reality, responses are accurate. When it does not, they hallucinate confidently. For the full picture of what this means for the long-term trajectory of AI capabilities, see our guide to what AGI is and when it might arrive.

How Does AI Generate Images? Diffusion Models Explained

Image-generating AI uses a different mechanism from text-generating AI. Where language models predict the next token autoregressively, image generators use a process called diffusion: start with random Gaussian noise, then iteratively remove that noise in small steps guided by a learned model until a coherent image emerges.

During training, the model learns to reverse a "noising" process. A real image gets Gaussian noise added to it in small increments over 1,000 steps until it becomes pure noise. The model trains to predict, given a noisy image at step T and the original text description, what noise was added at that step. Run this prediction in reverse 50 to 100 times and the original clean image reconstructs itself.

The text conditioning works through an embedding model, typically CLIP (Contrastive Language-Image Pre-training), that maps text descriptions and images into a shared vector space. The diffusion model uses the text embedding as a guide for which direction to remove noise during generation.

SystemArchitectureParametersStepsOpen SourceKey Feature
Stable Diffusion 1.5U-Net + CLIP1B (U-Net)20 to 50YesRuns on 4GB GPU
DALL-E 3Diffusion Transformer~12B est.50 to 100NoIntegrated with ChatGPT
Midjourney v6Proprietary DiT~10B+ est.~40NoAesthetic quality

Stable Diffusion uses "latent diffusion": rather than denoising full-resolution pixel arrays, it works in a compressed latent space (8x smaller than the raw image), then decodes back to pixels at the end. This is why it can run on consumer hardware that DALL-E and Midjourney cannot.

DALL-E 3, released in October 2023 and integrated into ChatGPT, uses a Diffusion Transformer architecture and is trained on images paired with detailed GPT-4-rewritten captions, which produces better text-image alignment than earlier versions trained on scraped web captions.

GPT-4o processes images differently from both: it is a natively multimodal model where the vision encoder and language decoder share the same transformer, rather than being separate models stitched together. This allows it to reason about image content and text simultaneously rather than first describing the image and then reasoning about the description.

How Does Generative AI Work? The Common Thread

Generative AI is any model that produces new content by learning the statistical distribution of existing content and sampling from that distribution. Text generators learn the distribution of tokens. Image generators learn the distribution of pixel arrangements. Audio models learn the distribution of waveform patterns. The underlying mathematics differs by modality, but the core concept is the same: learn what patterns appear in training data, then produce new examples that fit those patterns.

For language models, this means learning which tokens tend to follow which other tokens in which contexts. For diffusion image models, it means learning how real-image distributions differ from random noise distributions. In both cases, the model is not storing examples and retrieving them; it is learning compressed representations of statistical patterns that it can then recombine during generation.

The key distinction that the research community and the media often conflate: generative AI does not "know" things in any meaningful sense. ChatGPT is not consulting a database of facts when it answers a question. It is generating text that fits the pattern of how factual answers are written. When the training data contained many correct descriptions of a topic, the generated answer will likely be correct. When it did not, or when the topic is rare or post-training, the model generates plausible-sounding text that may be wrong.

This is the hallucination problem: not a bug in implementation but a consequence of how next-token prediction works. The model has no mechanism to distinguish "I am generating text that matches the statistical pattern of a correct answer" from "I am generating text that matches the statistical pattern of a confident but wrong answer." Both patterns exist in training data.

The post-2023 generation of reasoning models, including OpenAI's o1 series and various "chain of thought" systems, attempts to address this by adding an internal deliberation step where the model checks its own reasoning before committing to a response. This improves accuracy on structured problems like mathematics, logic, and coding, but does not eliminate hallucinations on factual or open-ended questions.

Frequently Asked Questions

How does ChatGPT work?

ChatGPT works by predicting the next most likely token (word fragment) given all the text that came before it, using a transformer neural network with billions of parameters. GPT-3 used 175 billion parameters trained on 300 billion tokens of text. GPT-4 is estimated to use approximately 1.76 trillion parameters in a Mixture of Experts architecture. The model generates responses one token at a time, with each selection added to the context before predicting the next. RLHF (reinforcement learning from human feedback) shapes its responses to be more helpful and safe.

How does AI learn?

AI language models learn through pretraining: the model predicts the next token in a text sequence, measures how wrong it was, and adjusts its billions of parameters through gradient descent to make better predictions next time. GPT-3 required 3.14 × 10²³ floating-point operations across over 10,000 GPUs for several weeks (OpenAI, 2020). After pretraining, models go through supervised fine-tuning on human-written examples and RLHF, where human raters rank responses to train a reward model that guides further improvement.

How does AI make decisions?

AI language models do not "make decisions" in any meaningful sense. At each step of generating a response, the model computes a probability distribution over all possible next tokens in its vocabulary (roughly 100,000 options for GPT-4o) and samples from that distribution. The sampling is influenced by temperature settings. Higher temperature introduces more randomness; lower temperature makes outputs more deterministic. There is no reasoning process, no goal, and no intent. The model generates statistically plausible continuations of text based on patterns learned during training.

How does AI work step by step?

For a language model like ChatGPT: (1) your text is tokenized and filtered through content moderation; (2) tokens pass through all transformer layers with self-attention updating each token's representation; (3) the model computes probabilities for every possible next token; (4) it samples a token and adds it to the context; (5) steps 3 and 4 repeat until the response is complete; (6) the output passes through a second content moderation layer. For reasoning models like o1, an internal chain-of-thought step runs before the visible response is generated.

How does generative AI work?

Generative AI learns the statistical distribution of its training data and generates new examples that fit that distribution. Text models learn which tokens follow which other tokens in context. Image models using diffusion learn to reverse a noise-addition process, guided by text descriptions. Audio models learn waveform distributions. All generative AI is fundamentally sampling from a learned distribution, not retrieving stored examples. The model has no memory of specific training documents and cannot verify whether its generated content is accurate.

How does AI generate images?

AI image generators use diffusion models: start with random noise, then iteratively remove that noise in small steps guided by a text description until a coherent image forms. The text conditioning comes from an embedding model (typically CLIP) that maps text and images into a shared vector space. Stable Diffusion works in a compressed latent space for efficiency. DALL-E 3 uses a Diffusion Transformer architecture integrated with ChatGPT. Midjourney uses proprietary diffusion with curated aesthetic training. All require 20 to 100 denoising steps to produce an image.

How does AI understand language?

AI language models do not understand language in any human sense. They represent language as high-dimensional numerical vectors and learn statistical relationships between those vectors during training. Self-attention in transformers captures contextual relationships between words, for example distinguishing "bank" (financial) from "bank" (riverbank) based on surrounding words. But this is pattern matching, not semantic understanding. The model has no internal world model, no ability to reason about truth, and no awareness of what its outputs mean.

Is ChatGPT just autocomplete?

Next-token prediction is the mathematical foundation of ChatGPT, so "autocomplete" is technically accurate but misleading in practice. Standard autocomplete on a smartphone predicts one or two words from a small vocabulary using shallow statistical models. ChatGPT predicts across a 100,000-token vocabulary using 175 billion to 1.76 trillion parameters trained on trillions of tokens of text, shaped further by RLHF to produce useful, goal-directed responses. The mechanism is similar; the scale and alignment training are what make the outputs qualitatively different.

Related Articles