Tool DiscoveryTool Discovery
Local AIBeginner15 min to complete12 min read

How to Run Nemotron 3 Ultra on Ollama (2026 Cloud Guide)

Nemotron 3 Ultra is cloud-only on Ollama, a 550B-parameter MoE model. Sign in, run nemotron-3-ultra:cloud, set up API access, and find local Nemotron picks.

AmaraBy Amara|Updated 22 June 2026
Terminal showing the ollama run nemotron-3-ultra:cloud command output for NVIDIA Nemotron 3 Ultra on Ollama

NVIDIA Nemotron 3 Ultra is the largest model NVIDIA has released as open weights, and Ollama only offers it as a cloud model. There is no local download tag, no matter how aggressively you quantize, because the model has 550 billion total parameters with 55 billion active per token. This guide covers signing in to Ollama Cloud, running `nemotron-3-ultra:cloud`, wiring it into your own scripts and agents through the API, and where the smaller Nemotron 3 Nano and Super models fit in if you want something that actually runs on your own hardware.

NVIDIA announced Nemotron 3 Ultra at Computex 2026 on June 1 and released it three days later. According to Ollama's official Nemotron 3 Ultra announcement, it uses a hybrid Mamba-Transformer architecture NVIDIA calls LatentMoE, mixing Mamba-2 layers, mixture-of-experts layers, and a smaller number of attention layers, plus dedicated layers for multi-token prediction that speed up generation through native speculative decoding. NVIDIA scored it at 48 on the Artificial Analysis Intelligence Index, the highest of any US-developed open-weight model as of June 2026, and reports throughput above 400 output tokens per second.

By the end of this guide you will have Nemotron 3 Ultra running through the Ollama CLI, a working API setup with your own key, and a clear picture of which Nemotron-family model to reach for when you actually need something running locally instead of in NVIDIA's cloud.

Prerequisites

  • Ollama installed (version 0.12 or higher, required for cloud model support)
  • A free Ollama account (no payment information needed to use cloud models)
  • An internet connection (Nemotron 3 Ultra runs entirely on NVIDIA and Ollama infrastructure)
  • Basic terminal familiarity for running commands and reading output
  • Optional: an Ollama API key if you plan to call the model from your own scripts
  • Optional: a rented GPU if you want to try the local Nemotron 3 Nano or Super alternatives covered below
đŸ–Ĩī¸

Need more GPU power?

Rent a RTX 4090 on Vast.ai from $0.20/hr. On-demand GPU rentals by the hour, useful for running larger models without buying hardware.

What Nemotron 3 Ultra Is and Why Ollama Runs It in the Cloud

NVIDIA Nemotron 3 Ultra has 550 billion total parameters with only 55 billion active per token through its mixture-of-experts design. NVIDIA pretrained it natively in NVFP4, its own 4-bit floating point format, across roughly 20 trillion tokens of crawled and synthetic code, math, science, and general knowledge data, rather than training at higher precision and shrinking it afterward.

The reason Nemotron 3 Ultra only shows up on Ollama as a cloud model comes down to size, the same constraint that keeps Kimi K2.6 cloud-only. Even with NVFP4's smaller memory footprint, a 550-billion-parameter model is well beyond what a single workstation or GPU server can hold. Ollama's `:cloud` tag sends each prompt to NVIDIA's infrastructure through Ollama's servers instead of downloading any weights, so `ollama run nemotron-3-ultra:cloud` works on any machine with an internet connection.

NVIDIA built Nemotron 3 Ultra for long-running agent work: orchestrating other agents, coding agents, deep research, and enterprise workflows that span hundreds of tool calls. The Ollama cloud tag exposes a 256K-token context window, and NVIDIA's broader materials reference up to 1 million tokens of capability for the underlying model. For a broader look at what people actually use agent tooling like this for, see our guide to the best AI agents Reddit actually uses.

NVIDIA released two smaller siblings alongside Ultra that do run locally. Here is how the family breaks down on Ollama as of June 2026:

ModelParametersContextLocal Ollama TagSmallest Local Size
Nemotron 3 Nano4B dense or 30B-A3B MoE256K`nemotron-3-nano:4b` / `:30b`2.8 GB (4B, Q4)
Nemotron 3 Super120B total / 12B active MoE256K`nemotron-3-super:120b`87 GB (Q4_K_M)
Nemotron 3 Ultra550B total / 55B active MoE256K cloud (up to 1M referenced)Cloud only: `nemotron-3-ultra:cloud`Not available locally

If your goal is a Nemotron-family model running entirely on hardware you control, Nano and Super further down this guide cover that. For "Nemotron 3 Ultra Ollama" specifically, `nemotron-3-ultra:cloud` is the only option that exists, and that is by design rather than an oversight.

Set Up Ollama Cloud and Run Nemotron 3 Ultra

Running Nemotron 3 Ultra through Ollama takes three steps: install Ollama, sign in, and run the model. None of this downloads a 550-billion-parameter file. The whole setup takes under five minutes on any machine with a working internet connection.

Step 1: Install Ollama

# Linux and macOS, one-command installer
curl -fsSL https://ollama.com/install.sh | sh

On Windows, download the installer from ollama.com/download, or use winget:

powershell
winget install Ollama.Ollama

Verify the installation:

ollama --version
# Expected: ollama version 0.12.x or higher

Cloud models require a recent Ollama release. If your version is older, re-run the install command to update.

Step 2: Sign In to Ollama Cloud

ollama signin

This prints a sign-in URL and opens your browser. Create a free account at ollama.com (or log in if you already have one) and approve the device. The terminal confirms with a message similar to:

Signing in to ollama.com...
Signed in as your-username
â„šī¸
Note:`ollama signin` links your machine's key to your ollama.com account. As of June 2026, Nemotron 3 Ultra's `:cloud` tag is listed as "High Usage," meaning it draws more from your free cloud quota per request than smaller cloud models like Nemotron 3 Super. Check ollama.com/settings for current limits.

Step 3: Run Nemotron 3 Ultra From the Terminal

ollama run nemotron-3-ultra:cloud

Ollama fetches a small manifest, a few KB rather than the model weights since inference happens remotely, then drops you into a prompt:

pulling manifest
pulling 802a2bec181a... 100% ▕████████████████▏  3.4 KB
success
>>> Send a message (/? for help)

Type a prompt to test it:

>>> Plan a 5-step approach to refactor a Django monolith into separate services, and list the files you'd touch first.

The first response after signing in can take 10-30 seconds while Ollama establishes the cloud session. After that, responses stream back at normal speed, and NVIDIA reports throughput above 400 tokens per second on its own infrastructure.

Step 4: Verify the Model

ollama list

`nemotron-3-ultra:cloud` appears in the list at only a few KB rather than hundreds of gigabytes. That is expected: this is a cloud passthrough entry, not a downloaded model.

âš ī¸
Warning:There is no local download tag for Nemotron 3 Ultra. `ollama pull nemotron-3-ultra` without `:cloud` returns an error. If you want a Nemotron-family model that runs on your own hardware, use `nemotron-3-nano` or `nemotron-3-super` instead, covered in the alternatives section below.

Switching Between Local and Cloud Models

`ollama run` works the same way for local and cloud models, so you can keep both on one machine. Pull a small local model alongside Nemotron 3 Ultra:

ollama pull nemotron-3-nano:4b

`ollama list` now shows both `nemotron-3-nano:4b`, a 2.8 GB local download, and `nemotron-3-ultra:cloud`, a manifest-only cloud entry. Switch between them by changing the model name in `ollama run` or in your application's API request. This is useful for keeping a fast local model for routine tasks and saving Nemotron 3 Ultra's larger context and agent orchestration for harder jobs.

Use Nemotron 3 Ultra in Your Own Scripts and Agents

Beyond the interactive `ollama run` session, Nemotron 3 Ultra is reachable through Ollama's REST API. Any tool that already talks to a local Ollama instance, or to the OpenAI API format, can use it with a one-line model name change.

Generate an API Key

Visit ollama.com/settings/keys while signed in, click "Create API key," and copy the value. Set it as an environment variable:

export OLLAMA_API_KEY=your_api_key_here
💡
Tip:An API key is only needed for direct requests to `https://ollama.com/api`. If your application talks to `localhost:11434`, the standard local Ollama server, `ollama signin` already authenticated that machine and no separate key is required.

Call Nemotron 3 Ultra From the Local Endpoint

curl http://localhost:11434/api/chat -d '{
  "model": "nemotron-3-ultra:cloud",
  "messages": [
    { "role": "user", "content": "Summarize the tradeoff between NVFP4 and bf16 for inference cost." }
  ],
  "stream": false
}'

Expected output (truncated):

json
{
  "model": "nemotron-3-ultra:cloud",
  "message": {
    "role": "assistant",
    "content": "NVFP4 packs weights into 4-bit floating point, cutting memory and bandwidth versus bf16's 16-bit format, which lowers serving cost and raises throughput at a small accuracy tradeoff that NVIDIA's training recipe is built to absorb."
  },
  "done": true
}

Call Nemotron 3 Ultra Directly From ollama.com

For a server or serverless function without a local Ollama install, send requests straight to ollama.com using your API key:

curl https://ollama.com/api/chat \
  -H "Authorization: Bearer $OLLAMA_API_KEY" \
  -d '{
    "model": "nemotron-3-ultra:cloud",
    "messages": [{ "role": "user", "content": "Hello" }]
  }'

Python Example

python
from ollama import Client

client = Client(host="https://ollama.com", headers={"Authorization": "Bearer " + api_key})

response = client.chat(
    model="nemotron-3-ultra:cloud",
    messages=[{"role": "user", "content": "Outline a test plan for a payments API migration."}],
)
print(response["message"]["content"])

Connecting Agent Tools

Ollama's own documentation calls out `ollama launch` as a shortcut for pointing existing agent tools at a cloud model, including Claude Code, the Codex App, and OpenClaw. Hermes Agent and OpenClaw both connect through Ollama's OpenAI-compatible endpoint at `http://localhost:11434/v1`:

yaml
model:
  default: nemotron-3-ultra:cloud
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 256000
â„šī¸
Note:Set `context_length` to 256000 to match the cloud tag's window. NVIDIA's own materials reference up to 1 million tokens for Nemotron 3 Ultra, but the publicly documented `:cloud` tag on Ollama tops out at 256K as of June 2026, so configure agents against that figure to avoid requests that exceed what the endpoint accepts.

Troubleshooting

`ollama pull nemotron-3-ultra` (without `:cloud`) returns an error

Cause: There is no local download tag for Nemotron 3 Ultra, only the cloud passthrough

Fix: Use `nemotron-3-ultra:cloud` for the full model, or pull `nemotron-3-nano` or `nemotron-3-super` if you specifically need a Nemotron model that runs on your own hardware.

`ollama run nemotron-3-ultra:cloud` returns "model not found"

Cause: The installed Ollama version predates this model release

Fix: Update Ollama by re-running the install command (`curl -fsSL https://ollama.com/install.sh | sh` on Linux/macOS, or re-download on Windows), then retry.

"Unauthorized" error or repeated sign-in prompts

Cause: The machine is not signed in, or the session expired

Fix: Run `ollama signin` again and complete the browser approval. Check ollama.com/settings/connections to confirm the device is listed as connected.

First response takes 20-30 seconds or longer

Cause: Cold start while Ollama establishes a session with NVIDIA's cloud infrastructure

Fix: This is normal for the first request after signing in or after an idle period. Requests later in the same session stream back at normal speed.

`ollama list` shows nemotron-3-ultra:cloud at only a few KB

Cause: This is expected. Cloud models store only a manifest locally; the weights run on NVIDIA's and Ollama's servers

Fix: No action needed. If you want a model that runs entirely on your own hardware, see the alternatives section below.

API requests to `https://ollama.com/api` return 401

Cause: Missing or invalid `OLLAMA_API_KEY`

Fix: Generate a new key at ollama.com/settings/keys and re-export the environment variable: `export OLLAMA_API_KEY=your_new_key`.

Alternatives to Consider

ToolTypePriceBest For
Nemotron 3 Nano (4B / 30B-A3B)Local (Ollama)FreeHardware with 3-24 GB RAM that needs a Nemotron-family model running entirely offline.
Nemotron 3 Super (120B-A12B)Local (Ollama) or VPSFreeSingle high-VRAM rigs (87 GB or more at Q4_K_M) that want Nemotron-level reasoning without a cloud dependency.
DeepSeek R1Local (Ollama) or VPSFreeReasoning-heavy tasks with visible chain-of-thought output, on hardware from 4 GB distilled up to 70B-class rigs.
Nemotron 3 Ultra via Ollama CloudCloud (Ollama)Free within Ollama Cloud limits256K to 1M context and long-running multi-tool agent orchestration without any local hardware requirement.

Frequently Asked Questions

Can I run Nemotron 3 Ultra locally with Ollama?

Not in any practical sense. Nemotron 3 Ultra has 550 billion total parameters with 55 billion active per token, and NVIDIA has not published a local download tag for it on Ollama, only the `nemotron-3-ultra:cloud` passthrough.

`ollama run nemotron-3-ultra:cloud` works on any machine because the model itself never downloads, the same approach Ollama uses for other very large models like Kimi K2.6.

If you want a Nemotron-family model that runs entirely on your own hardware, see Nemotron 3 Nano or Nemotron 3 Super in the alternatives section, or check the best local LLM models guide for hardware-to-model matching.

Is Nemotron 3 Ultra free to use through Ollama?

Yes, within Ollama's free usage limits as of June 2026. `ollama signin` does not require payment information, and `ollama run nemotron-3-ultra:cloud` works immediately after signing in.

Ollama lists the cloud tag as "High Usage," meaning it draws more from your free quota per request than smaller cloud models. Check ollama.com/settings for the current numbers if you are running large batches of requests.

What is the difference between Nemotron 3 Ultra, Super, and Nano?

Nemotron 3 Nano is the smallest, available in a 4B dense version and a 30B-A3B mixture-of-experts version, both of which run locally on Ollama starting at 2.8 GB.

Nemotron 3 Super is a 120B total parameter MoE model with 12B active per token, downloadable locally starting at 87 GB at Q4_K_M quantization.

Nemotron 3 Ultra is the flagship: 550B total parameters with 55B active, available on Ollama only as `nemotron-3-ultra:cloud`. All three share the same LatentMoE hybrid Mamba-Transformer architecture, scaled differently.

How much RAM or VRAM do I need to run Nemotron 3 Ultra with Ollama?

For the cloud setup in this guide, effectively none beyond what Ollama itself needs to run, which is a few hundred MB. Inference happens on NVIDIA's and Ollama's servers, not your machine.

There is no way to run the full 550-billion-parameter Nemotron 3 Ultra locally as of June 2026. If your goal is a Nemotron model that fits in 3-24 GB, use Nemotron 3 Nano. For 87 GB or more, Nemotron 3 Super is the local option.

Is Nemotron 3 Ultra better than DeepSeek R1 or Kimi K2 for agentic coding?

On NVIDIA's published numbers, Nemotron 3 Ultra scores 48 on the Artificial Analysis Intelligence Index, the highest of any US-developed open-weight model as of June 2026, and it leads on agent productivity, instruction following, and long-context benchmarks at over 400 output tokens per second.

DeepSeek R1 and Kimi K2.6 cover different tradeoffs. DeepSeek R1 can run entirely on your own hardware down to 4 GB at the distilled sizes, and Kimi K2.6 adds multimodal input and swarm orchestration across up to 300 sub-agents. If raw benchmark scores and throughput matter most, Nemotron 3 Ultra leads. If local control or a different agent feature set matters more, the others are worth comparing directly.

Can I use Nemotron 3 Ultra with Claude Code, OpenClaw, or Hermes Agent?

Yes. Ollama's own documentation specifically calls out `ollama launch` as a way to point Claude Code, the Codex App, and OpenClaw at Nemotron 3 Ultra. Hermes Agent and OpenClaw both connect to Ollama's OpenAI-compatible endpoint at `http://localhost:11434/v1`.

Point the agent's model configuration at `nemotron-3-ultra:cloud` and set `context_length` to 256000, and it runs through your existing Ollama setup with no other configuration changes needed.

Does `ollama pull nemotron-3-ultra:cloud` download the 550 billion parameter model?

No. `ollama pull`, or the pull step that runs automatically before `ollama run`, for a `:cloud` tag downloads only a small manifest, typically a few KB.

The actual 550 billion parameter weights for Nemotron 3 Ultra stay on NVIDIA's and Ollama's infrastructure. Your machine sends prompts and receives responses over the network, which is why `ollama list` shows the model at only a few KB instead of hundreds of gigabytes.

Do I need a VPS or GPU to use Nemotron 3 Ultra with Ollama?

No. Inference runs on the cloud side regardless of where you run `ollama`, so a laptop or desktop with no GPU is enough for the `:cloud` tag covered in this guide.

The only reason to add hardware is if you want to try the local alternatives instead, such as Nemotron 3 Nano or Nemotron 3 Super, which need real VRAM to run well. For that, renting a GPU on Vast.ai by the hour costs less than buying a card, and you can shut it down once you are done.

Where can I try Nemotron 3 Ultra without installing Ollama?

NVIDIA hosts Nemotron 3 Ultra on its own NIM playground at build.nvidia.com, a web interface that needs no terminal or installation. It is also listed as a free model on OpenRouter for developers who want to test it through a unified API before committing to a setup.

The Ollama setup in this guide is for developers who want Nemotron 3 Ultra inside scripts, agents, or tools that already use Ollama's API, such as Hermes Agent or OpenClaw, rather than through a web playground.

What is NVFP4 and why does Nemotron 3 Ultra use it?

NVFP4 is NVIDIA's 4-bit floating point format. Instead of training in a higher-precision format and shrinking it afterward, NVIDIA pretrained Nemotron 3 Ultra natively in NVFP4 across roughly 20 trillion tokens, which keeps the weights small enough in memory to serve a 550-billion-parameter model at a reasonable cost.

NVIDIA reports this approach saves up to 30% on serving costs compared to other leading open models, and it is part of why Nemotron 3 Ultra reaches over 400 output tokens per second on cloud infrastructure.

Related Guides