NVIDIA H100 vs A100: Full Comparison and When to Upgrade

Key Numbers
Key Takeaways
- 1H100 SXM5 delivers 989 TFLOPS FP16 (3.2x the A100's 312 TFLOPS) and 3,350 GB/s memory bandwidth (1.6x the A100's 2,039 GB/s). The Transformer Engine, which adds FP8 support unavailable on the A100, is the key differentiator for transformer model workloads.
- 2H100 SXM5 costs $25,000 to $40,000 new versus $8,000 to $15,000 for an A100 SXM4 80GB as of Q1 2026. Cloud rental averages $2.29/hr for H100 versus $1.49 to $2.00/hr for A100. For large LLM training, H100's 2.4x to 9x speed advantage makes it cheaper per completed job despite the higher hourly rate.
- 3The A100 remains the right choice for models under 13 billion parameters, batch inference without latency constraints, and teams with existing A100 infrastructure under two years old. For pre-training at 30B+ parameters or running low-latency inference APIs, H100 is clearly the better investment.
The H100 is 3.2x faster than the A100 in raw FP16 compute. For large language model training using FP8 precision, the gap reaches 9x. For inference, NVIDIA benchmarks show up to 30x higher throughput on large models. These numbers lead most people to assume the H100 is always the right choice.
It is not. For models under 13 billion parameters, the A100 handles training and inference adequately. For batch workloads where speed does not affect your timeline, paying twice the hardware cost for faster processing has no return. For teams with A100 infrastructure purchased within the last two years, the total cost of switching rarely makes financial sense.
The actual decision is more nuanced. What workloads are you running? At what model scale? And what does your cost-per-completed-job look like when you factor in actual training speed rather than just hourly rental rates?
This article covers the complete spec comparison, benchmark data across workload types, a cloud pricing analysis, and the honest case for staying on A100 in 2026.
In This Article
H100 vs A100: Core Differences at a Glance
The A100 (Ampere architecture, TSMC 7nm) launched in May 2020. The H100 (Hopper architecture, TSMC 4nm) launched in March 2023. Three years and one full process node separate them, which is why the performance gap is as large as it is.
The most important architectural addition in the H100 is the Transformer Engine: dedicated hardware that dynamically adjusts computation between FP8 and FP16 precision per operation during training. The A100 has no equivalent. This single feature accounts for much of the gap on transformer-model workloads. Without it, the H100's raw performance advantage over the A100 would be closer to 3x. With it, for large language model training specifically, the gap reaches 9x.
The second key difference is memory bandwidth. H100 SXM5 uses HBM3 at 3,350 GB/s. A100 SXM4 uses HBM2e at 2,039 GB/s. Transformer inference is memory-bandwidth-bound: the bottleneck is how fast weights and KV caches can be loaded from memory, not how many floating-point operations per second the chip can perform. A 1.6x bandwidth improvement translates directly to faster inference even for workloads that do not use the Transformer Engine.
The third difference is interconnect. H100 SXM5 uses NVLink 4.0 at 900 GB/s bidirectional bandwidth between GPUs in a server. A100 SXM4 uses NVLink 3.0 at 600 GB/s. For distributed training across 8 GPUs in a single DGX system, gradient synchronization goes 50% faster on H100.
| Difference | A100 | H100 | Impact |
|---|---|---|---|
| Transformer Engine | No | Yes | Up to 9x LLM training speedup |
| Memory bandwidth | 2,039 GB/s | 3,350 GB/s | 1.6x faster inference |
| FP8 precision | No | Yes | Halves memory usage during training |
| NVLink bandwidth | 600 GB/s | 900 GB/s | 50% faster multi-GPU gradient sync |
| Process node | 7nm | 4nm | More efficient per watt of compute |
Full Specs: H100 SXM5 vs A100 SXM4 80GB
The table below uses SXM variants for both, as these are the chips deployed in large training clusters. PCIe variants are available for both and have lower specs across the board.
| Specification | H100 SXM5 80GB | A100 SXM4 80GB |
|---|---|---|
| Architecture | Hopper (GH100) | Ampere (GA100) |
| Process node | TSMC 4nm | TSMC 7nm |
| Transistors | 80 billion | 54.2 billion |
| CUDA cores | 16,896 | 6,912 |
| Tensor cores | 528 (4th gen) | 432 (3rd gen) |
| Memory | 80GB HBM3 | 80GB HBM2e |
| Memory bandwidth | 3,350 GB/s | 2,039 GB/s |
| FP32 | 67.0 TFLOPS | 19.5 TFLOPS |
| TF32 Tensor Core | 494 TFLOPS | 156 TFLOPS |
| FP16 Tensor Core | 989 TFLOPS | 312 TFLOPS |
| BF16 Tensor Core | 989 TFLOPS | 312 TFLOPS |
| FP8 Tensor Core | 1,979 TFLOPS | Not supported |
| INT8 Tensor Core | 1,979 TOPS | 624 TOPS |
| FP64 | 34.0 TFLOPS | 9.7 TFLOPS |
| TDP | 700W | 400W |
| NVLink bandwidth | 900 GB/s | 600 GB/s |
| MIG instances | 7 | 7 |
| Transformer Engine | Yes | No |
| PCIe generation | PCIe 5.0 | PCIe 4.0 |
All TFLOPS figures are without sparsity. With sparsity (applicable when at least 50% of weights are near-zero), all tensor core figures double. Most production transformer models do not meet that threshold in practice.
One specification worth noting: both GPUs have 80GB of on-chip memory and both support Multi-Instance GPU (MIG), which splits a single chip into up to 7 isolated instances. For multi-tenant inference deployments running multiple smaller models simultaneously, the MIG capability is identical between the two generations. The H100 does not offer more MIG instances than the A100.
For the full H100 specification breakdown including all three variants, see our dedicated NVIDIA H100 specs and pricing article. For the A100, our NVIDIA A100 full specs article covers all variants including the 40GB and PCIe versions.
Performance Gap by Workload Type
The H100's advantage varies significantly by workload. The headline "up to 30x faster" applies only to large-model inference using FP8 precision at high batch sizes. Most real workloads see a more modest improvement.
| Workload | H100 Advantage | Notes |
|---|---|---|
| LLM training (13B+ params, FP8) | 6 to 9x | Transformer Engine + FP8, only available on H100 |
| LLM training (13B+ params, FP16) | 2.4 to 3x | Architectural improvement + memory bandwidth |
| LLM inference (large models) | 1.5 to 2x typical, up to 30x max | Memory bandwidth advantage + Transformer Engine for quantized models |
| Fine-tuning (7B–13B params) | 2 to 3x | Architectural improvement, Transformer Engine less impactful at smaller scale |
| HPC, FP64 scientific compute | 3.45x | H100: 34 TFLOPS vs A100: 9.7 TFLOPS |
| Standard computer vision (FP16) | 2 to 3x | Raw tensor core improvement |
| Batch inference, latency-insensitive | 1.5 to 2x | Meaningful but may not justify cost depending on throughput needs |
The 30x inference figure requires: FP8 quantization, large model (30B+ parameters), high concurrency workload, and effective use of the Transformer Engine. For a typical 7B model serving requests one at a time, the real throughput advantage is closer to 1.5x to 2x.
LLM inference throughput benchmarks show H100 delivering approximately 250 to 300 tokens per second versus A100 at around 130 tokens per second on comparable LLM inference setups. That is a 1.9x to 2.3x real-world advantage, meaningful for production APIs but not the headline "30x" figure.
"H100 delivers up to 30x faster inference performance than A100 for large language models when using NVIDIA's FP8 Transformer Engine." (NVIDIA technical brief, 2023)
The gap is largest on the workload the H100 was specifically designed for: training very large transformer models using FP8 precision. For everything else, expect 2x to 3x, not 9x to 30x.
Cost Comparison and ROI: When H100 Pays for Itself
The H100 is roughly 2 to 3x more expensive to buy and 1.3 to 1.5x more expensive to rent per hour. Whether it is more expensive per completed job depends entirely on the workload.
Purchase Prices (Q1 2026)
| GPU | New Unit | Used Unit | DGX System (8 GPU) |
|---|---|---|---|
| A100 SXM4 80GB | $8,000–$15,000 | $4,000–$9,000 | $150,000–$200,000 |
| H100 SXM5 80GB | $25,000–$40,000 | $15,000–$22,000 | $400,000+ |
Cloud Rental Rates (Q1 2026)
| Provider | A100 Rate | H100 Rate |
|---|---|---|
| Lambda Labs | ~$1.29/hr | ~$2.49/hr |
| RunPod | ~$0.99–$1.49/hr | ~$2.39/hr |
| CoreWeave | ~$1.49/hr | ~$2.00–$3.50/hr |
| Market median | ~$1.75/hr | ~$2.29/hr |
The Number Most Guides Don't Show
Most comparisons stop at hourly rate. The actual cost per training job changes the picture entirely.
Take a training run that requires 1,000 GPU-hours on an A100. At a median of $1.75/hr, the cost is $1,750.
The same run on H100 with 2.4x speedup takes 417 GPU-hours. At $2.29/hr, the cost is $955. H100 saves $795 on a 1,000-GPU-hour A100 job despite costing 31% more per hour.
Now apply the 9x speedup for a large LLM pre-training run using FP8: 1,000 A100 GPU-hours becomes 111 H100 GPU-hours. At $2.29/hr, total cost is $254 versus $1,750 on A100. H100 costs 85% less for the same training outcome.
The crossover point: H100 becomes cheaper per completed job as soon as the speedup exceeds the cost-per-hour ratio. At 1.3x the hourly rate, any speedup above 1.3x means H100 is cheaper per job. For transformer model training, where the real speedup starts at 2.4x, H100 is always the more cost-efficient choice at cloud rental rates, regardless of the higher per-hour price.
When the A100 Is Still the Right Choice in 2026
The H100 is not always the correct answer. Six scenarios exist where sticking with A100 makes financial sense in 2026.
1. Models Under 13 Billion Parameters
The A100 handles full-parameter fine-tuning of 7B and 13B models comfortably in FP16. The 2x to 3x training speedup of the H100 at this scale rarely translates to business value that offsets the higher hardware cost. LoRA fine-tuning of 70B models is also feasible on A100 with gradient checkpointing.
2. Existing A100 Infrastructure Under Two Years Old
If your A100 cluster was deployed in 2024 or later, replacing it in 2026 means writing off hardware with 2 to 3 years of useful life remaining. The upgrade economics rarely close unless training speed is directly on the critical path of your business.
3. Batch Inference Without Latency Requirements
If your inference jobs run overnight, process documents in bulk, or serve an internal tool with no user-facing latency requirement, the A100's 1.5x to 2x slower throughput makes no practical difference. You pay twice for speed you do not use.
4. Mixed HPC and AI Workloads
Organizations running both AI training and scientific computing (molecular dynamics, climate simulation, FP64 workloads) may find A100's broader optimization more practical. The H100 is more specialized.
5. Budget-Constrained Research Groups
A100 cloud rental at $1.29/hr on Lambda Labs versus $2.49/hr for H100 means 1.9x more compute hours per dollar spent. For exploration, iteration on small models, and academic research that does not require frontier-scale training, A100 remains a capable and significantly cheaper option.
6. Cloud Spot Instances for Interruptible Jobs
A100 spot pricing on some platforms drops to $0.49 to $0.79/hr when demand is low. For fault-tolerant training jobs using periodic checkpointing, spot A100s at half the price of spot H100s can outperform H100 on total cost even with the slower training speed.
For context on the hyperscale data centers that house these GPU clusters and how infrastructure choices affect total AI compute costs, see our overview of what hyperscalers are and how they operate.
H100 vs H200: Should You Go Even Newer?
The H200 uses the same GH100 chip as the H100 SXM5 but replaces 80GB HBM3 with 141GB HBM3e at 4,800 GB/s bandwidth. The compute specs (TFLOPS, CUDA cores, Transformer Engine) are identical to the H100 SXM5. The difference is entirely in memory.
| Spec | H100 SXM5 | H200 SXM5 | B200 SXM6 |
|---|---|---|---|
| Memory | 80GB HBM3 | 141GB HBM3e | 192GB HBM3e |
| Memory bandwidth | 3,350 GB/s | 4,800 GB/s | 8,000+ GB/s |
| FP16 compute | 989 TFLOPS | 989 TFLOPS | ~2,250 TFLOPS |
| TDP | 700W | 700W | ~1,000W |
| Est. price (Q1 2026) | $25,000–$40,000 | $35,000–$45,000 | $30,000–$40,000 |
For inference on models between 70B and 130B parameters, the H200 is meaningfully better than the H100. These models fit in a single H200's 141GB but require splitting across two H100s, which adds communication overhead. The H200 removes that overhead and increases bandwidth by 43%.
For training, where models are already split across multiple GPUs regardless of single-GPU memory, the H200's memory advantage is less decisive. The bandwidth improvement (43% higher) does help, but the compute ceiling is the same.
The B200 (Blackwell) is a full generational change, not a memory upgrade. At roughly 2.3x the FP16 compute of the H100 and 192GB HBM3e, it is the chip to target for new large-scale training infrastructure built in 2026. The limitation is tooling maturity: CUDA and framework support for Blackwell is still maturing, and deployment documentation is less complete than for the well-established H100 ecosystem.
For teams choosing between H100 and H200 for inference in 2026: if you regularly serve models above 70B parameters and your current setup requires multi-GPU for single inference requests, H200 is the better purchase. For models at 7B to 13B, H100 at lower cost is the practical choice. For a broader view of all AI accelerator options including AMD and Google alternatives, see our guide to AI accelerator card types.
Frequently Asked Questions
Is the H100 worth upgrading from A100?
It depends on your workload. For training large language models (13B+ parameters), the H100 is 2.4x to 9x faster than the A100, which often makes it cheaper per completed training job despite a higher hourly cloud rental rate. For inference on large models, H100 delivers 1.5x to 2x higher throughput. For models under 13B parameters, batch inference without latency requirements, or teams with A100 infrastructure purchased within the last two years, the upgrade is rarely financially justified. The hardware cost gap is roughly 2 to 3x in favor of A100.
How much faster is the H100 compared to the A100?
The answer depends on workload. For large LLM training using FP8 precision and the Transformer Engine, H100 is up to 9x faster than A100. For general AI training in mixed precision (FP16/BF16), the speedup is 2.4x. For LLM inference, real-world throughput runs 1.5x to 2x higher on H100 (approximately 250-300 tokens/second versus 130 tokens/second on A100). For FP64 scientific computing, H100 is 3.45x faster (34 TFLOPS versus 9.7 TFLOPS). The headline "30x faster" figures apply only to large-model inference under optimal FP8 conditions.
What is the price difference between H100 and A100?
As of Q1 2026, A100 SXM4 80GB units cost $8,000 to $15,000 new (or $4,000 to $9,000 used). H100 SXM5 80GB units cost $25,000 to $40,000 new (or $15,000 to $22,000 used). The H100 is roughly 2 to 3x more expensive to purchase. For cloud rental, A100 averages $1.29 to $1.75 per GPU-hour, while H100 averages $2.00 to $2.49 per GPU-hour (approximately 1.3 to 1.5x more). A DGX H100 system (8 GPUs) costs $400,000 or more versus $150,000 to $200,000 for a comparable DGX A100 system.
Can the A100 still train large language models in 2026?
Yes, but with practical limits. A100 handles full-parameter fine-tuning of models up to 13B parameters in FP16 and LoRA fine-tuning of 70B models with gradient checkpointing. For pre-training models above 70B parameters, the A100 is significantly slower than H100 and lacks FP8 support, which roughly doubles effective training throughput on H100. Many cloud providers still run A100 clusters for Llama-class model fine-tuning and batch inference in 2026. The A100 is not obsolete, but it is no longer competitive for frontier model training.
What is the memory bandwidth difference between H100 and A100?
H100 SXM5 delivers 3,350 GB/s of HBM3 memory bandwidth. A100 SXM4 delivers 2,039 GB/s of HBM2e bandwidth. The H100 is 1.6x faster on memory bandwidth. For transformer inference, which is memory-bandwidth-bound (the bottleneck is loading weights and KV caches from memory, not raw compute), this 1.6x bandwidth advantage directly translates to 1.6x faster inference throughput per GPU, independent of the Transformer Engine. This is why even non-quantized FP16 inference runs meaningfully faster on H100.
What are the H800 and A800 chips?
The H800 and A800 are China-specific variants of the H100 and A100, created by NVIDIA to comply with US export control rules. The H800 reduces NVLink bandwidth from 900 GB/s to 400 GB/s and lowers interconnect performance to fall below the Bureau of Industry and Security (BIS) thresholds for restricted AI chips. The A800 similarly derates the A100's chip-to-chip interconnect. The H800 was sold in China through 2023. Updated BIS rules in November 2023 restricted the H800 as well. As of 2026, NVIDIA has no H100 or A100-equivalent product available for the Chinese market.
Should I choose H100 or H200 for inference in 2026?
It depends on model size. The H200 uses the same GH100 chip as the H100 SXM5 but adds 141GB HBM3e memory at 4,800 GB/s bandwidth (versus 80GB at 3,350 GB/s on H100). For serving models between 70B and 130B parameters, H200 is significantly better because these models fit in a single H200 without multi-GPU sharding, whereas H100 requires splitting across two GPUs with added communication overhead. For serving models at 7B to 30B parameters, where a single H100 already holds the weights comfortably, H200's advantages are marginal and the H100 is the more cost-efficient choice. The B200 (Blackwell) is the option for teams building new large-scale training infrastructure from scratch in 2026.