How to Run Kimi K2 on Ollama: Cloud Setup Guide (2026)
Kimi K2 isn't a local Ollama pull. It's a 1T-parameter cloud model. Learn to sign in, run kimi-k2.6:cloud, set up API access, and find local alternatives.

Kimi K2 is Moonshot AI's open-weight agentic model, and on Ollama it only runs as a cloud model. The original Kimi K2 (1 trillion total parameters, 32 billion active per token via mixture-of-experts) was never something a laptop or single-GPU workstation could load. Even at 1-bit quantization it needs roughly 250 GB of combined RAM and VRAM. Ollama's answer was a `:cloud` tag: `ollama run` still works the same way, but the prompt is sent to Moonshot's servers through Ollama's infrastructure instead of running on your machine.
There's a timing issue worth knowing about before you set anything up. The original cloud tag, `kimi-k2:1t-cloud`, along with the `kimi-k2-thinking` reasoning variant, retires on June 16, 2026. Moonshot's newest release, Kimi K2.6 (1.04 trillion parameters, 256K context, multimodal), takes over as `kimi-k2.6:cloud`. If you're searching for how to install Kimi K2 in Ollama right now, that's the tag and command you actually want.
This guide covers the full Ollama Cloud setup: installing Ollama, signing in, running `kimi-k2.6:cloud` from the terminal, and generating an API key for your own scripts and agents. If you already had something pointed at the old tags, there's a short section on what to change. And if your hardware can actually run a model on its own, the alternatives section near the end covers Qwen3, DeepSeek R1, and GLM 4.6.
Prerequisites
- Ollama 0.6.x or later, installed on Linux, macOS, or Windows (no GPU or high-RAM machine required)
- A free account at ollama.com for the `ollama signin` step
- A stable internet connection. Inference for cloud models runs on Ollama and Moonshot's servers, not your hardware
- Basic terminal familiarity for running `ollama run` and `curl` commands
- (Optional) An API key from ollama.com/settings/keys if you plan to call Kimi K2 from your own scripts or agents
- (Optional) A rented GPU if you want to run the local alternatives (Qwen3, DeepSeek R1, GLM 4.6) on more VRAM than your own machine has
Need more GPU power?
Rent a RTX 4090 on Vast.ai from $0.20/hr. On-demand GPU rentals by the hour, useful for running larger models without buying hardware.
In This Guide
What Kimi K2 Is and Why Ollama Runs It in the Cloud
Kimi K2 is an open-weight large language model from Moonshot AI, a Beijing-based AI lab. The original Kimi K2 (released in 2025) uses a mixture-of-experts architecture with 1 trillion total parameters and 32 billion active per token, pretrained on 15.5 trillion tokens. Moonshot released the weights under a Modified MIT License on GitHub at moonshotai/Kimi-K2. The model is built for agentic work: long multi-step coding sessions, tool use, and autonomous task execution rather than single-turn chat.
The reason Kimi K2 only shows up on Ollama as a cloud model comes down to size. Even at the most aggressive 1-bit quantization, it needs around 250 GB of combined RAM and VRAM to run at a usable speed, which rules out almost every desktop, laptop, and single-GPU workstation. Ollama's solution is the `:cloud` model tag. Instead of downloading weights, `ollama run kimi-k2.6:cloud` sends your prompt to Moonshot's infrastructure through Ollama's servers and streams the response back to your terminal, using the same commands and API as a local model.
Moonshot has shipped several Kimi K2 versions since the original release. Here's what's live on Ollama as of June 2026:
| Model | Parameters | Context | Notes | Ollama Cloud Tag |
|---|---|---|---|---|
| Kimi K2 (original) | 1T total / 32B active | 256K | Retiring June 16, 2026 | `kimi-k2:1t-cloud` (deprecated) |
| Kimi K2 Thinking | 1T total / 32B active | 256K | Reasoning variant, retiring June 16, 2026 | `kimi-k2-thinking` (deprecated) |
| Kimi K2.5 | 1T total / 32B active | 256K | Earlier 2026 update | `kimi-k2.5:cloud` |
| Kimi K2.6 | 1.04T total | 256K | Current flagship, multimodal (text and image), swarm orchestration up to 300 sub-agents and 4,000 steps | `kimi-k2.6:cloud` |
For most people searching for "Kimi K2 Ollama" today, `kimi-k2.6:cloud` is the tag to use. It's the actively maintained version, has the longest support runway, and adds image input on top of everything the original K2 could already do. Moonshot also offers a hosted Kimi chat assistant if you want to try K2.6 through a web interface without touching a terminal.
Set Up Ollama Cloud and Run Kimi K2.6
Running Kimi K2 through Ollama takes three steps: install Ollama, sign in, and run the model. Nothing here downloads a multi-hundred-gigabyte file. The whole setup takes under five minutes on any machine with a working internet connection.
Step 1: Install Ollama
# Linux and macOS, one-command installer
curl -fsSL https://ollama.com/install.sh | shOn Windows, download the installer from ollama.com/download, or use winget:
winget install Ollama.OllamaVerify the installation:
ollama --version
# Expected: ollama version 0.6.x or higherCloud models require Ollama 0.6.x or later. If `ollama --version` returns an older release, re-run the install command to update.
Step 2: Sign In to Ollama Cloud
ollama signinThis prints a sign-in URL and opens your browser. Create a free account at ollama.com (or log in if you already have one), then approve the device. The terminal confirms with a message similar to:
Signing in to ollama.com...
Signed in as your-usernameStep 3: Run Kimi K2.6 from the Terminal
ollama run kimi-k2.6:cloudOllama fetches a small manifest (a few KB, not the model weights, since inference happens remotely), then drops you into a prompt:
pulling manifest
pulling 4f3b2a1c... 100% ââââââââââââââââââ 3.1 KB
success
>>> Send a message (/? for help)Type a prompt to test it:
>>> Write a Python function that returns the nth Fibonacci number using memoization.The first response after signing in can take 10-30 seconds while Ollama establishes the cloud session. After that, responses stream back at normal speed.
Step 4: Verify the Model and Migrate from Older Tags
ollama list`kimi-k2.6:cloud` appears in the list with a size of a few KB rather than hundreds of gigabytes. That's expected: this is a cloud passthrough entry, not a downloaded model.
Switching Between Local and Cloud Models
`ollama run` works the same way for local and cloud models, so you can keep both on one machine. Pull a small local model alongside Kimi K2.6:
ollama pull qwen3:8b`ollama list` now shows both `qwen3:8b` (a multi-gigabyte local download) and `kimi-k2.6:cloud` (a manifest-only cloud entry). Switch between them by changing the model name in `ollama run` or in your application's API request. This is useful for keeping a fast local model for routine tasks and reserving Kimi K2.6's larger context and agentic capabilities for harder jobs.
Use Kimi K2 in Your Own Scripts and Agents (API Access)
Beyond the interactive `ollama run` session, Kimi K2.6 is reachable through Ollama's REST API. Any tool that already talks to a local Ollama instance, or to the OpenAI API format, can use it with a one-line model name change.
Generate an API Key
Visit ollama.com/settings/keys while signed in, click "Create API key", and copy the value. Set it as an environment variable:
export OLLAMA_API_KEY=your_api_key_hereCall Kimi K2.6 from the Local Endpoint
curl http://localhost:11434/api/chat -d '{
"model": "kimi-k2.6:cloud",
"messages": [
{ "role": "user", "content": "Summarize the difference between Kimi K2 and Kimi K2.6 in two sentences." }
],
"stream": false
}'Expected output (truncated):
{
"model": "kimi-k2.6:cloud",
"message": {
"role": "assistant",
"content": "Kimi K2.6 is Moonshot's 1.04T-parameter successor to Kimi K2, adding multimodal image input and swarm-style multi-agent orchestration on top of K2's original 256K-context agentic coding capabilities."
},
"done": true
}Call Kimi K2.6 Directly from ollama.com
For a server or serverless function without a local Ollama install, send requests straight to ollama.com using your API key:
curl https://ollama.com/api/chat \
-H "Authorization: Bearer $OLLAMA_API_KEY" \
-d '{
"model": "kimi-k2.6:cloud",
"messages": [{ "role": "user", "content": "Hello" }]
}'Python Example
from ollama import Client
client = Client(host="https://ollama.com", headers={"Authorization": "Bearer " + api_key})
response = client.chat(
model="kimi-k2.6:cloud",
messages=[{"role": "user", "content": "Outline a plan to refactor a Flask app into FastAPI."}],
)
print(response["message"]["content"])OpenAI-Compatible Endpoint for Existing Agent Tools
Ollama exposes an OpenAI-compatible layer at `http://localhost:11434/v1`, the same endpoint used in the Hermes Agent and OpenClaw setups. Point that configuration at `kimi-k2.6:cloud` instead of a local model name, and the agent runs on Kimi K2's 256K context and swarm orchestration without any other config changes:
model:
default: kimi-k2.6:cloud
provider: custom
base_url: http://localhost:11434/v1
context_length: 256000Troubleshooting
`ollama run kimi-k2.6:cloud` returns "model not found"
Cause: The installed Ollama version predates cloud model support
Fix: Update Ollama by re-running the install command (`curl -fsSL https://ollama.com/install.sh | sh` on Linux/macOS, or re-download on Windows), then retry. Cloud models require Ollama 0.6.x or later.
"unauthorized" error or repeated sign-in prompts
Cause: The machine is not signed in, or the session expired
Fix: Run `ollama signin` again and complete the browser approval. Check ollama.com/settings/connections to confirm the device is listed as connected.
Requests to `kimi-k2:1t-cloud` or `kimi-k2-thinking` start failing after June 16, 2026
Cause: Both tags are retired in favor of Kimi K2.6
Fix: Replace the model name with `kimi-k2.6:cloud` (or `kimi-k2.5:cloud`) in every config file, Modelfile, and script that references the old tags.
First response takes 20-30 seconds or longer
Cause: Cold start while Ollama establishes a session with the cloud infrastructure
Fix: This is normal for the first request after signing in or after an idle period. Subsequent requests in the same session stream back at normal speed.
`ollama list` shows kimi-k2.6:cloud at only a few KB instead of a multi-gigabyte download
Cause: This is expected. Cloud models store only a manifest locally; the weights run on Moonshot and Ollama's servers
Fix: No action needed. If you want a model that runs entirely on your own hardware, see the alternatives section below.
API requests to `https://ollama.com/api` return 401
Cause: Missing or invalid `OLLAMA_API_KEY`
Fix: Generate a new key at ollama.com/settings/keys and re-export the environment variable: `export OLLAMA_API_KEY=your_new_key`.
Alternatives to Consider
| Tool | Type | Price | Best For |
|---|---|---|---|
| Qwen3 8B / Qwen3.5 27B | Local (Ollama) | Free | Hardware with 8-24 GB RAM that needs a model running entirely offline, with reliable tool-calling. |
| DeepSeek R1 | Local (Ollama) or VPS | Free | Reasoning-heavy tasks (math, coding, logic) with visible chain-of-thought output, on hardware from 4 GB (1.5B distilled) up to 64 GB or more (70B). |
| GLM 4.6 | Local (Ollama) or cloud | Free (local) / cloud pricing varies | Agentic coding workloads similar to Kimi K2, with smaller variants that fit on a single high-VRAM GPU instead of requiring a cloud connection. |
| Kimi K2.6 via Ollama Cloud | Cloud (Ollama) | Free within Ollama Cloud limits | 256K context, multimodal input, and swarm-style multi-agent orchestration without any local hardware requirement. |
Frequently Asked Questions
Can I run Kimi K2 locally with Ollama?
Not in any practical sense. Kimi K2 has 1 trillion total parameters with 32 billion active per token, and even at 1-bit quantization it needs around 250 GB of combined RAM and VRAM to run at a usable speed. That's beyond almost every desktop, laptop, and single-GPU workstation.
Ollama only distributes Kimi K2 as `kimi-k2.6:cloud`, a passthrough to Moonshot's infrastructure. `ollama run kimi-k2.6:cloud` works on any machine because the model itself never downloads.
If you want a model that runs entirely on your own hardware, see Qwen3, DeepSeek R1, or GLM 4.6 in the alternatives section, or check the best local LLM models guide for hardware-to-model matching.
Is Kimi K2 free to use through Ollama?
Yes, within Ollama's free usage limits as of June 2026. `ollama signin` does not require payment information, and `ollama run kimi-k2.6:cloud` works immediately after signing in.
Ollama applies usage limits to cloud models to manage server load, and these limits change periodically. Check ollama.com/settings for the current numbers if you're running large batches of requests.
What happened to kimi-k2:1t-cloud and kimi-k2-thinking?
Both tags retire on June 16, 2026. `kimi-k2:1t-cloud` was the original Kimi K2 cloud tag (1T total parameters, 32B active), and `kimi-k2-thinking` was its reasoning-focused variant.
Moonshot's Kimi K2.6 (1.04T total parameters, 256K context, multimodal) replaces both as `kimi-k2.6:cloud`. Update any script, Modelfile, or agent config that references the old tags before the retirement date, or requests will start failing.
What is the difference between Kimi K2, K2.5, and K2.6?
Kimi K2 (the original 2025 release) is a 1 trillion parameter mixture-of-experts model with 32 billion active parameters per token and a 256K context window, focused on agentic coding and tool use.
Kimi K2.5 was an interim 2026 update with the same parameter profile and context window.
Kimi K2.6 (1.04T total parameters, 256K context) is the current flagship. It adds multimodal input (text and image) and swarm-style orchestration that can coordinate up to 300 sub-agents across 4,000 steps for long-horizon coding and automation tasks.
How much RAM do I need to run Kimi K2 with Ollama?
For the cloud setup in this guide, effectively none beyond what Ollama itself needs to run, which is a few hundred MB. Inference happens on Moonshot and Ollama's servers, not your machine.
For a true local install of the full Kimi K2 model, plan on roughly 250 GB of combined RAM and VRAM at 1-bit quantization, which is why almost nobody runs it locally. If your goal is a model that fits in 8-64 GB of RAM, see the alternatives section for Qwen3, DeepSeek R1, and GLM 4.6.
Is Kimi K2 better than DeepSeek R1 or Qwen3 for coding?
For long-horizon agentic coding, tasks that span many files and steps, Kimi K2.6 leads on Moonshot's published benchmarks, helped by its 256K context and swarm orchestration across up to 300 sub-agents.
DeepSeek R1 and Qwen3 run entirely on your own hardware and are strong for single-session coding and reasoning tasks. If privacy, offline use, or zero ongoing dependency on a cloud connection matters more than agentic scale, a local model is the better fit.
Can I use Kimi K2 with an agent like Hermes Agent or OpenClaw?
Yes. Both Hermes Agent and OpenClaw connect to Ollama's OpenAI-compatible endpoint at `http://localhost:11434/v1`. Point the agent's model configuration at `kimi-k2.6:cloud` and set `context_length` to 256000.
The agent then runs on Kimi K2.6's 256K context and agentic capabilities through your existing Ollama setup, with no other configuration changes needed.
Does `ollama pull kimi-k2.6:cloud` download the 1 trillion parameter model?
No. `ollama pull` (or the pull step that runs automatically before `ollama run`) for a `:cloud` tag downloads only a small manifest, typically a few KB.
The actual 1.04 trillion parameter weights for Kimi K2.6 stay on Moonshot and Ollama's infrastructure. Your machine sends prompts and receives responses over the network, which is why `ollama list` shows the model at only a few KB instead of hundreds of gigabytes.
Do I need a VPS or GPU to use Kimi K2 with Ollama?
No. Inference runs on the cloud side regardless of where you run `ollama`, so a laptop or desktop with no GPU is enough for the `:cloud` tags covered in this guide.
The only reason to add hardware is if you want to try the local alternatives instead, such as DeepSeek R1, GLM 4.6, or Qwen3.5 27B, which need real VRAM to run well. For that, renting a GPU on Vast.ai by the hour is cheaper than buying a card, and you can shut it down when you're done.
Where can I try Kimi K2.6 without installing Ollama?
Moonshot AI offers Kimi K2.6 through its own Kimi chat assistant, a web interface that requires no terminal or installation.
The Ollama setup in this guide is for developers who want Kimi K2.6 inside scripts, agents, or tools that already use Ollama's API, such as Hermes Agent or OpenClaw, rather than through a chat window.