AI Hardware10 min read

NVIDIA H100 vs A100: Full Comparison and When to Upgrade

Q: Is the H100 worth upgrading from A100?

For training large LLMs (13B+ params), H100 is 2.4–9x faster than A100, making it cheaper per completed job. For smaller models, batch inference, or recent A100 investments, the upgrade rarely makes financial sense. Hardware gap: H100 costs 2–3x more.

Q: How much faster is the H100 compared to the A100?

H100 vs A100 speedups: 2.4–9x for LLM training (FP8), 1.5–2x for typical inference (~250–300 vs ~130 tokens/sec), 3.45x for FP64 HPC. The 30x headline applies only to large-model inference under optimal FP8 conditions.

Q: What is the price difference between H100 and A100?

A100 SXM4 80GB: $8,000–15,000 new. H100 SXM5 80GB: $25,000–40,000 new. H100 is 2–3x more expensive to purchase. Cloud rental: A100 $1.29–1.75/hr vs H100 $2.00–2.49/hr (Q1 2026).

Q: Can the A100 still train large language models in 2026?

A100 handles fine-tuning of models up to 13B (full-parameter) and 70B (LoRA). For pre-training above 70B parameters, A100 is significantly slower than H100 and lacks FP8 support. Still viable for fine-tuning and batch inference in 2026.

Q: What is the memory bandwidth difference between H100 and A100?

H100 SXM5: 3,350 GB/s HBM3. A100 SXM4: 2,039 GB/s HBM2e. H100 is 1.6x faster. For transformer inference (memory-bandwidth-bound), this directly translates to 1.6x faster inference throughput even in standard FP16.

Q: What are the H800 and A800 chips?

H800 and A800 are China-market variants of H100 and A100 with reduced NVLink bandwidth to comply with US export controls. H800 reduced NVLink from 900 to 400 GB/s. Both were ultimately restricted by updated BIS rules in November 2023.

Q: Should I choose H100 or H200 for inference in 2026?

For serving 70B–130B parameter models, H200 is better (141GB fits the model in one GPU, avoiding multi-GPU sharding). For 7B–30B models, H100 at lower cost is the practical choice. B200 (Blackwell) is for new large-scale training infrastructure.

By Amara|Updated 4 June 2026

Side-by-side comparison of NVIDIA A100 SXM4 GPU module on the left with a matte chip die and copper frame, and NVIDIA H100 SXM5 on the right with a glowing blue chip die and larger copper heat spreader, labeled NVIDIA A100 and NVIDIA H100 SXM5, with a 3.2x performance indicator overlay between them on a dark carbon fiber background

Key Numbers

3.2x

H100 FP16 compute advantage over A100 (989 vs 312 TFLOPS)

NVIDIA datasheets, 2020/2023

Up to 9x

H100 LLM training speedup over A100 using FP8 Transformer Engine

NVIDIA benchmark data, 2023

$25K–$40K vs $8K–$15K

H100 SXM5 vs A100 SXM4 80GB purchase price (Q1 2026)

Market data, Q1 2026

1.6x

H100 memory bandwidth advantage (3,350 vs 2,039 GB/s)

NVIDIA datasheets

250–300 vs ~130

Tokens per second on typical LLM inference: H100 vs A100

GPU benchmark data, 2024

Key Takeaways

1H100 SXM5 delivers 989 TFLOPS FP16 (3.2x the A100's 312 TFLOPS) and 3,350 GB/s memory bandwidth (1.6x the A100's 2,039 GB/s). The Transformer Engine, which adds FP8 support unavailable on the A100, is the key differentiator for transformer model workloads.
2H100 SXM5 costs $25,000 to $40,000 new versus $8,000 to $15,000 for an A100 SXM4 80GB as of Q1 2026. Cloud rental averages $2.29/hr for H100 versus $1.49 to $2.00/hr for A100. For large LLM training, H100's 2.4x to 9x speed advantage makes it cheaper per completed job despite the higher hourly rate.
3The A100 remains the right choice for models under 13 billion parameters, batch inference without latency constraints, and teams with existing A100 infrastructure under two years old. For pre-training at 30B+ parameters or running low-latency inference APIs, H100 is clearly the better investment.

The H100 is 3.2x faster than the A100 in raw FP16 compute. For large language model training using FP8 precision, the gap reaches 9x. For inference, NVIDIA benchmarks show up to 30x higher throughput on large models. These numbers lead most people to assume the H100 is always the right choice.

It is not. For models under 13 billion parameters, the A100 handles training and inference adequately. For batch workloads where speed does not affect your timeline, paying twice the hardware cost for faster processing has no return. For teams with A100 infrastructure purchased within the last two years, the total cost of switching rarely makes financial sense.

The actual decision is more nuanced. What workloads are you running? At what model scale? And what does your cost-per-completed-job look like when you factor in actual training speed rather than just hourly rental rates?

This article covers the complete spec comparison, benchmark data across workload types, a cloud pricing analysis, and the honest case for staying on A100 in 2026.

H100 vs A100: Core Differences at a Glance

The A100 (Ampere architecture, TSMC 7nm) launched in May 2020. The H100 (Hopper architecture, TSMC 4nm) launched in March 2023. Three years and one full process node separate them, which is why the performance gap is as large as it is.

The most important architectural addition in the H100 is the Transformer Engine: dedicated hardware that dynamically adjusts computation between FP8 and FP16 precision per operation during training. The A100 has no equivalent. This single feature accounts for much of the gap on transformer-model workloads. Without it, the H100's raw performance advantage over the A100 would be closer to 3x. With it, for large language model training specifically, the gap reaches 9x.

The second key difference is memory bandwidth. H100 SXM5 uses HBM3 at 3,350 GB/s. A100 SXM4 uses HBM2e at 2,039 GB/s. Transformer inference is memory-bandwidth-bound: the bottleneck is how fast weights and KV caches can be loaded from memory, not how many floating-point operations per second the chip can perform. A 1.6x bandwidth improvement translates directly to faster inference even for workloads that do not use the Transformer Engine.

The third difference is interconnect. H100 SXM5 uses NVLink 4.0 at 900 GB/s bidirectional bandwidth between GPUs in a server. A100 SXM4 uses NVLink 3.0 at 600 GB/s. For distributed training across 8 GPUs in a single DGX system, gradient synchronization goes 50% faster on H100.

Difference	A100	H100	Impact
Transformer Engine	No	Yes	Up to 9x LLM training speedup
Memory bandwidth	2,039 GB/s	3,350 GB/s	1.6x faster inference
FP8 precision	No	Yes	Halves memory usage during training
NVLink bandwidth	600 GB/s	900 GB/s	50% faster multi-GPU gradient sync
Process node	7nm	4nm	More efficient per watt of compute

Full Specs: H100 SXM5 vs A100 SXM4 80GB

The table below uses SXM variants for both, as these are the chips deployed in large training clusters. PCIe variants are available for both and have lower specs across the board.

Specification	H100 SXM5 80GB	A100 SXM4 80GB
Architecture	Hopper (GH100)	Ampere (GA100)
Process node	TSMC 4nm	TSMC 7nm
Transistors	80 billion	54.2 billion
CUDA cores	16,896	6,912
Tensor cores	528 (4th gen)	432 (3rd gen)
Memory	80GB HBM3	80GB HBM2e
Memory bandwidth	3,350 GB/s	2,039 GB/s
FP32	67.0 TFLOPS	19.5 TFLOPS
TF32 Tensor Core	494 TFLOPS	156 TFLOPS
FP16 Tensor Core	989 TFLOPS	312 TFLOPS
BF16 Tensor Core	989 TFLOPS	312 TFLOPS
FP8 Tensor Core	1,979 TFLOPS	Not supported
INT8 Tensor Core	1,979 TOPS	624 TOPS
FP64	34.0 TFLOPS	9.7 TFLOPS
TDP	700W	400W
NVLink bandwidth	900 GB/s	600 GB/s
MIG instances	7	7
Transformer Engine	Yes	No
PCIe generation	PCIe 5.0	PCIe 4.0

All TFLOPS figures are without sparsity. With sparsity (applicable when at least 50% of weights are near-zero), all tensor core figures double. Most production transformer models do not meet that threshold in practice.

One specification worth noting: both GPUs have 80GB of on-chip memory and both support Multi-Instance GPU (MIG), which splits a single chip into up to 7 isolated instances. For multi-tenant inference deployments running multiple smaller models simultaneously, the MIG capability is identical between the two generations. The H100 does not offer more MIG instances than the A100.

For the full H100 specification breakdown including all three variants, see our dedicated NVIDIA H100 specs and pricing article. For the A100, our NVIDIA A100 full specs article covers all variants including the 40GB and PCIe versions.

Performance Gap by Workload Type

The H100's advantage varies significantly by workload. The headline "up to 30x faster" applies only to large-model inference using FP8 precision at high batch sizes. Most real workloads see a more modest improvement.

Workload	H100 Advantage	Notes
LLM training (13B+ params, FP8)	6 to 9x	Transformer Engine + FP8, only available on H100
LLM training (13B+ params, FP16)	2.4 to 3x	Architectural improvement + memory bandwidth
LLM inference (large models)	1.5 to 2x typical, up to 30x max	Memory bandwidth advantage + Transformer Engine for quantized models
Fine-tuning (7B–13B params)	2 to 3x	Architectural improvement, Transformer Engine less impactful at smaller scale
HPC, FP64 scientific compute	3.45x	H100: 34 TFLOPS vs A100: 9.7 TFLOPS
Standard computer vision (FP16)	2 to 3x	Raw tensor core improvement
Batch inference, latency-insensitive	1.5 to 2x	Meaningful but may not justify cost depending on throughput needs

The 30x inference figure requires: FP8 quantization, large model (30B+ parameters), high concurrency workload, and effective use of the Transformer Engine. For a typical 7B model serving requests one at a time, the real throughput advantage is closer to 1.5x to 2x.

LLM inference throughput benchmarks show H100 delivering approximately 250 to 300 tokens per second versus A100 at around 130 tokens per second on comparable LLM inference setups. That is a 1.9x to 2.3x real-world advantage, meaningful for production APIs but not the headline "30x" figure.

"H100 delivers up to 30x faster inference performance than A100 for large language models when using NVIDIA's FP8 Transformer Engine." (NVIDIA technical brief, 2023)

The gap is largest on the workload the H100 was specifically designed for: training very large transformer models using FP8 precision. For everything else, expect 2x to 3x, not 9x to 30x.

Cost Comparison and ROI: When H100 Pays for Itself

The H100 is roughly 2 to 3x more expensive to buy and 1.3 to 1.5x more expensive to rent per hour. Whether it is more expensive per completed job depends entirely on the workload.

Purchase Prices (Q1 2026)

GPU	New Unit	Used Unit	DGX System (8 GPU)
A100 SXM4 80GB	$8,000–$15,000	$4,000–$9,000	$150,000–$200,000
H100 SXM5 80GB	$25,000–$40,000	$15,000–$22,000	$400,000+

Cloud Rental Rates (Q1 2026)

Provider	A100 Rate	H100 Rate
Lambda Labs	~$1.29/hr	~$2.49/hr
RunPod	~$0.99–$1.49/hr	~$2.39/hr
CoreWeave	~$1.49/hr	~$2.00–$3.50/hr
Market median	~$1.75/hr	~$2.29/hr

The Number Most Guides Don't Show

Most comparisons stop at hourly rate. The actual cost per training job changes the picture entirely.

Take a training run that requires 1,000 GPU-hours on an A100. At a median of $1.75/hr, the cost is $1,750.

The same run on H100 with 2.4x speedup takes 417 GPU-hours. At $2.29/hr, the cost is $955. H100 saves $795 on a 1,000-GPU-hour A100 job despite costing 31% more per hour.

Now apply the 9x speedup for a large LLM pre-training run using FP8: 1,000 A100 GPU-hours becomes 111 H100 GPU-hours. At $2.29/hr, total cost is $254 versus $1,750 on A100. H100 costs 85% less for the same training outcome.

The crossover point: H100 becomes cheaper per completed job as soon as the speedup exceeds the cost-per-hour ratio. At 1.3x the hourly rate, any speedup above 1.3x means H100 is cheaper per job. For transformer model training, where the real speedup starts at 2.4x, H100 is always the more cost-efficient choice at cloud rental rates, regardless of the higher per-hour price.

When the A100 Is Still the Right Choice in 2026

The H100 is not always the correct answer. Six scenarios exist where sticking with A100 makes financial sense in 2026.

1. Models Under 13 Billion Parameters

The A100 handles full-parameter fine-tuning of 7B and 13B models comfortably in FP16. The 2x to 3x training speedup of the H100 at this scale rarely translates to business value that offsets the higher hardware cost. LoRA fine-tuning of 70B models is also feasible on A100 with gradient checkpointing.

2. Existing A100 Infrastructure Under Two Years Old

If your A100 cluster was deployed in 2024 or later, replacing it in 2026 means writing off hardware with 2 to 3 years of useful life remaining. The upgrade economics rarely close unless training speed is directly on the critical path of your business.

3. Batch Inference Without Latency Requirements

If your inference jobs run overnight, process documents in bulk, or serve an internal tool with no user-facing latency requirement, the A100's 1.5x to 2x slower throughput makes no practical difference. You pay twice for speed you do not use.

4. Mixed HPC and AI Workloads

Organizations running both AI training and scientific computing (molecular dynamics, climate simulation, FP64 workloads) may find A100's broader optimization more practical. The H100 is more specialized.

5. Budget-Constrained Research Groups

A100 cloud rental at $1.29/hr on Lambda Labs versus $2.49/hr for H100 means 1.9x more compute hours per dollar spent. For exploration, iteration on small models, and academic research that does not require frontier-scale training, A100 remains a capable and significantly cheaper option.

6. Cloud Spot Instances for Interruptible Jobs

A100 spot pricing on some platforms drops to $0.49 to $0.79/hr when demand is low. For fault-tolerant training jobs using periodic checkpointing, spot A100s at half the price of spot H100s can outperform H100 on total cost even with the slower training speed.

For context on the hyperscale data centers that house these GPU clusters and how infrastructure choices affect total AI compute costs, see our overview of what hyperscalers are and how they operate.

H100 vs H200: Should You Go Even Newer?

The H200 uses the same GH100 chip as the H100 SXM5 but replaces 80GB HBM3 with 141GB HBM3e at 4,800 GB/s bandwidth. The compute specs (TFLOPS, CUDA cores, Transformer Engine) are identical to the H100 SXM5. The difference is entirely in memory.

Spec	H100 SXM5	H200 SXM5	B200 SXM6
Memory	80GB HBM3	141GB HBM3e	192GB HBM3e
Memory bandwidth	3,350 GB/s	4,800 GB/s	8,000+ GB/s
FP16 compute	989 TFLOPS	989 TFLOPS	~2,250 TFLOPS
TDP	700W	700W	~1,000W
Est. price (Q1 2026)	$25,000–$40,000	$35,000–$45,000	$30,000–$40,000

For inference on models between 70B and 130B parameters, the H200 is meaningfully better than the H100. These models fit in a single H200's 141GB but require splitting across two H100s, which adds communication overhead. The H200 removes that overhead and increases bandwidth by 43%.

For training, where models are already split across multiple GPUs regardless of single-GPU memory, the H200's memory advantage is less decisive. The bandwidth improvement (43% higher) does help, but the compute ceiling is the same.

The B200 (Blackwell) is a full generational change, not a memory upgrade. At roughly 2.3x the FP16 compute of the H100 and 192GB HBM3e, it is the chip to target for new large-scale training infrastructure built in 2026. The limitation is tooling maturity: CUDA and framework support for Blackwell is still maturing, and deployment documentation is less complete than for the well-established H100 ecosystem.

For teams choosing between H100 and H200 for inference in 2026: if you regularly serve models above 70B parameters and your current setup requires multi-GPU for single inference requests, H200 is the better purchase. For models at 7B to 13B, H100 at lower cost is the practical choice. For a broader view of all AI accelerator options including AMD and Google alternatives, see our guide to AI accelerator card types.

Frequently Asked Questions

Is the H100 worth upgrading from A100?

It depends on your workload. For training large language models (13B+ parameters), the H100 is 2.4x to 9x faster than the A100, which often makes it cheaper per completed training job despite a higher hourly cloud rental rate. For inference on large models, H100 delivers 1.5x to 2x higher throughput. For models under 13B parameters, batch inference without latency requirements, or teams with A100 infrastructure purchased within the last two years, the upgrade is rarely financially justified. The hardware cost gap is roughly 2 to 3x in favor of A100.

How much faster is the H100 compared to the A100?

The answer depends on workload. For large LLM training using FP8 precision and the Transformer Engine, H100 is up to 9x faster than A100. For general AI training in mixed precision (FP16/BF16), the speedup is 2.4x. For LLM inference, real-world throughput runs 1.5x to 2x higher on H100 (approximately 250-300 tokens/second versus 130 tokens/second on A100). For FP64 scientific computing, H100 is 3.45x faster (34 TFLOPS versus 9.7 TFLOPS). The headline "30x faster" figures apply only to large-model inference under optimal FP8 conditions.

What is the price difference between H100 and A100?

As of Q1 2026, A100 SXM4 80GB units cost $8,000 to $15,000 new (or $4,000 to $9,000 used). H100 SXM5 80GB units cost $25,000 to $40,000 new (or $15,000 to $22,000 used). The H100 is roughly 2 to 3x more expensive to purchase. For cloud rental, A100 averages $1.29 to $1.75 per GPU-hour, while H100 averages $2.00 to $2.49 per GPU-hour (approximately 1.3 to 1.5x more). A DGX H100 system (8 GPUs) costs $400,000 or more versus $150,000 to $200,000 for a comparable DGX A100 system.

Can the A100 still train large language models in 2026?

Yes, but with practical limits. A100 handles full-parameter fine-tuning of models up to 13B parameters in FP16 and LoRA fine-tuning of 70B models with gradient checkpointing. For pre-training models above 70B parameters, the A100 is significantly slower than H100 and lacks FP8 support, which roughly doubles effective training throughput on H100. Many cloud providers still run A100 clusters for Llama-class model fine-tuning and batch inference in 2026. The A100 is not obsolete, but it is no longer competitive for frontier model training.

What is the memory bandwidth difference between H100 and A100?

H100 SXM5 delivers 3,350 GB/s of HBM3 memory bandwidth. A100 SXM4 delivers 2,039 GB/s of HBM2e bandwidth. The H100 is 1.6x faster on memory bandwidth. For transformer inference, which is memory-bandwidth-bound (the bottleneck is loading weights and KV caches from memory, not raw compute), this 1.6x bandwidth advantage directly translates to 1.6x faster inference throughput per GPU, independent of the Transformer Engine. This is why even non-quantized FP16 inference runs meaningfully faster on H100.

What are the H800 and A800 chips?

The H800 and A800 are China-specific variants of the H100 and A100, created by NVIDIA to comply with US export control rules. The H800 reduces NVLink bandwidth from 900 GB/s to 400 GB/s and lowers interconnect performance to fall below the Bureau of Industry and Security (BIS) thresholds for restricted AI chips. The A800 similarly derates the A100's chip-to-chip interconnect. The H800 was sold in China through 2023. Updated BIS rules in November 2023 restricted the H800 as well. As of 2026, NVIDIA has no H100 or A100-equivalent product available for the Chinese market.

Should I choose H100 or H200 for inference in 2026?

It depends on model size. The H200 uses the same GH100 chip as the H100 SXM5 but adds 141GB HBM3e memory at 4,800 GB/s bandwidth (versus 80GB at 3,350 GB/s on H100). For serving models between 70B and 130B parameters, H200 is significantly better because these models fit in a single H200 without multi-GPU sharding, whereas H100 requires splitting across two GPUs with added communication overhead. For serving models at 7B to 30B parameters, where a single H100 already holds the weights comfortably, H200's advantages are marginal and the H100 is the more cost-efficient choice. The B200 (Blackwell) is the option for teams building new large-scale training infrastructure from scratch in 2026.

AI Hardware

NVIDIA H100 GPU: Full Specs, Price, and Cloud Rates for 2026

11 min read

AI Hardware

NVIDIA A100 GPU: Specs, Price, and Performance in 2026

12 min read

AI Hardware

What Is an AI Accelerator Card? Types, Specs, and Costs for 2026

10 min read

Back to AI Infrastructure

NVIDIA H100 vs A100: Full Comparison and When to Upgrade

In This Article

H100 vs A100: Core Differences at a Glance

Full Specs: H100 SXM5 vs A100 SXM4 80GB

Performance Gap by Workload Type

Cost Comparison and ROI: When H100 Pays for Itself

Purchase Prices (Q1 2026)

Cloud Rental Rates (Q1 2026)

The Number Most Guides Don't Show

When the A100 Is Still the Right Choice in 2026

1. Models Under 13 Billion Parameters

2. Existing A100 Infrastructure Under Two Years Old

3. Batch Inference Without Latency Requirements

4. Mixed HPC and AI Workloads

5. Budget-Constrained Research Groups

6. Cloud Spot Instances for Interruptible Jobs

H100 vs H200: Should You Go Even Newer?

Frequently Asked Questions

Related Articles