Local AIBeginner15 min to complete12 min read

How to Run Kimi K2 on Ollama: Cloud Setup Guide (2026)

Q: Can I run Kimi K2 locally with Ollama?

No. Kimi K2 has 1T total parameters and needs around 250 GB of combined RAM and VRAM even at 1-bit quantization. Ollama only offers it as kimi-k2.6:cloud, which runs on Moonshot's servers. For a model that runs on your own hardware, see Qwen3, DeepSeek R1, or GLM 4.6.

Q: Is Kimi K2 free to use through Ollama?

Yes, within Ollama's free usage limits as of June 2026. ollama signin requires no payment information, and ollama run kimi-k2.6:cloud works right after signing in. Limits on cloud model usage may apply and change over time.

Q: What happened to kimi-k2:1t-cloud and kimi-k2-thinking?

kimi-k2:1t-cloud and kimi-k2-thinking retire on June 16, 2026, replaced by kimi-k2.6:cloud (1.04T parameters, 256K context, multimodal). Update any script, Modelfile, or agent config that references the old tag names before that date.

Q: What is the difference between Kimi K2, K2.5, and K2.6?

Kimi K2 (1T total / 32B active, 256K context) was the 2025 original. K2.5 was an interim update with the same profile. K2.6 (1.04T parameters, 256K context) is current, adding multimodal text and image input plus swarm orchestration of up to 300 sub-agents across 4,000 steps.

Q: How much RAM do I need to run Kimi K2 with Ollama?

For the Ollama Cloud setup, almost no local RAM is needed beyond what Ollama itself uses (a few hundred MB), since inference runs remotely. A true local install of Kimi K2 needs roughly 250 GB of combined RAM and VRAM at 1-bit quantization.

Q: Is Kimi K2 better than DeepSeek R1 or Qwen3 for coding?

For long-horizon agentic coding across many files and steps, Kimi K2.6 leads on Moonshot's benchmarks due to its 256K context and swarm orchestration. DeepSeek R1 and Qwen3 run entirely locally and are strong for single-session coding and reasoning if privacy or offline use matters more.

Q: Can I use Kimi K2 with an agent like Hermes Agent or OpenClaw?

Yes. Hermes Agent and OpenClaw both connect to Ollama's OpenAI-compatible endpoint at http://localhost:11434/v1. Set the agent's model to kimi-k2.6:cloud and context_length to 256000, and it runs through your existing Ollama setup with no other changes.

Q: Does `ollama pull kimi-k2.6:cloud` download the 1 trillion parameter model?

No. ollama pull for a :cloud tag downloads only a small manifest (a few KB), not the model weights. Kimi K2.6's 1.04T parameters stay on Moonshot and Ollama's servers, and your machine just sends prompts and receives responses over the network.

Q: Do I need a VPS or GPU to use Kimi K2 with Ollama?

No. Kimi K2 inference runs in the cloud regardless of where ollama runs, so no GPU or VPS is needed for the :cloud tags. To run local alternatives like DeepSeek R1 or GLM 4.6 instead, rent a GPU on Vast.ai by the hour rather than buying hardware.

Q: What is Kimi K2.7 Code and how is it different from K2.6?

Kimi K2.7 Code is a coding-specialized agentic model built on top of K2.6, not a general replacement. It cuts thinking-token usage by about 30% and strengthens MCP tool calling for long-horizon coding tasks. K2.6 remains better for general chat and multimodal use. Run it with ollama run kimi-k2.7-code:cloud.

Kimi K2 isn't a local Ollama pull. Run kimi-k2.6:cloud for general use, kimi-k2.7-code:cloud for agentic coding, set up API access, and find local alternatives.

By Amara|Updated 1 July 2026

Terminal showing the ollama run kimi-k2.6:cloud command output for Kimi K2 on Ollama Cloud

Kimi K2 is Moonshot AI's open-weight agentic model, and on Ollama it only runs as a cloud model. The original Kimi K2 (1 trillion total parameters, 32 billion active per token via mixture-of-experts) was never something a laptop or single-GPU workstation could load. Even at 1-bit quantization it needs roughly 250 GB of combined RAM and VRAM. Ollama's answer was a `:cloud` tag: `ollama run` still works the same way, but the prompt is sent to Moonshot's servers through Ollama's infrastructure instead of running on your machine.

There's a timing issue worth knowing about before you set anything up. The original cloud tag, `kimi-k2:1t-cloud`, along with the `kimi-k2-thinking` reasoning variant, retires on June 16, 2026. Moonshot's newest release, Kimi K2.6 (1.04 trillion parameters, 256K context, multimodal), takes over as `kimi-k2.6:cloud`. If you're searching for how to install Kimi K2 in Ollama right now, that's the tag and command you actually want.

Moonshot also shipped `kimi-k2.7-code:cloud` shortly after K2.6. It is not a general-purpose successor, it is a coding-specialized agentic model built on top of K2.6 that cuts thinking-token usage by roughly 30% and strengthens multi-step tool calling and MCP-based workflows. If your use case is specifically agentic coding rather than general chat or multimodal tasks, that tag is worth running instead, and this guide covers both.

This guide covers the full Ollama Cloud setup: installing Ollama, signing in, running `kimi-k2.6:cloud` from the terminal, generating an API key for your own scripts and agents, and where `kimi-k2.7-code:cloud` fits in for coding-focused workloads. If you already had something pointed at the old tags, there's a short section on what to change. And if your hardware can actually run a model on its own, the alternatives section near the end covers Qwen3, DeepSeek R1, and GLM 4.6.

Prerequisites

Ollama 0.6.x or later, installed on Linux, macOS, or Windows (no GPU or high-RAM machine required)
A free account at ollama.com for the `ollama signin` step
A stable internet connection. Inference for cloud models runs on Ollama and Moonshot's servers, not your hardware
Basic terminal familiarity for running `ollama run` and `curl` commands
(Optional) An API key from ollama.com/settings/keys if you plan to call Kimi K2 from your own scripts or agents
(Optional) A rented GPU if you want to run the local alternatives (Qwen3, DeepSeek R1, GLM 4.6) on more VRAM than your own machine has

🖥️

Need more GPU power?

Rent a RTX 4090 on Vast.ai from $0.20/hr. On-demand GPU rentals by the hour, useful for running larger models without buying hardware.

In This Guide

1What Kimi K2 Is and Why Ollama Runs It in the Cloud
2Set Up Ollama Cloud and Run Kimi K2.6
3Use Kimi K2 in Your Own Scripts and Agents (API Access)
4Troubleshooting
5FAQ

What Kimi K2 Is and Why Ollama Runs It in the Cloud

Kimi K2 is an open-weight large language model from Moonshot AI, a Beijing-based AI lab. The original Kimi K2 (released in 2025) uses a mixture-of-experts architecture with 1 trillion total parameters and 32 billion active per token, pretrained on 15.5 trillion tokens. Moonshot released the weights under a Modified MIT License on GitHub at moonshotai/Kimi-K2. The model is built for agentic work: long multi-step coding sessions, tool use, and autonomous task execution rather than single-turn chat.

The reason Kimi K2 only shows up on Ollama as a cloud model comes down to size. Even at the most aggressive 1-bit quantization, it needs around 250 GB of combined RAM and VRAM to run at a usable speed, which rules out almost every desktop, laptop, and single-GPU workstation. Ollama's solution is the `:cloud` model tag. Instead of downloading weights, `ollama run kimi-k2.6:cloud` sends your prompt to Moonshot's infrastructure through Ollama's servers and streams the response back to your terminal, using the same commands and API as a local model.

Moonshot has shipped several Kimi K2 versions since the original release. Here's what's live on Ollama as of June 2026:

Model	Parameters	Context	Notes	Ollama Cloud Tag
Kimi K2 (original)	1T total / 32B active	256K	Retiring June 16, 2026	`kimi-k2:1t-cloud` (deprecated)
Kimi K2 Thinking	1T total / 32B active	256K	Reasoning variant, retiring June 16, 2026	`kimi-k2-thinking` (deprecated)
Kimi K2.5	1T total / 32B active	256K	Earlier 2026 update	`kimi-k2.5:cloud`
Kimi K2.6	1.04T total	256K	Current flagship, multimodal (text and image), swarm orchestration up to 300 sub-agents and 4,000 steps	`kimi-k2.6:cloud`
Kimi K2.7 Code	1.04T total	256K	Coding-specialized agentic model built on K2.6, image and video input via MoonViT, roughly 30% fewer thinking tokens, stronger MCP tool calling	`kimi-k2.7-code:cloud`

For most people searching for "Kimi K2 Ollama" today, `kimi-k2.6:cloud` is the tag to use for general chat, multimodal input, and broad agent orchestration. If your work is specifically long-horizon coding across 10+ languages and full production tech stacks, `kimi-k2.7-code:cloud` is the newer, narrower option built for exactly that, covered in its own section below. Moonshot also offers a hosted Kimi chat assistant if you want to try K2.6 through a web interface without touching a terminal.

Set Up Ollama Cloud and Run Kimi K2.6

Running Kimi K2 through Ollama takes three steps: install Ollama, sign in, and run the model. Nothing here downloads a multi-hundred-gigabyte file. The whole setup takes under five minutes on any machine with a working internet connection.

Step 1: Install Ollama

# Linux and macOS, one-command installer
curl -fsSL https://ollama.com/install.sh | sh

On Windows, download the installer from ollama.com/download, or use winget:

powershell

winget install Ollama.Ollama

Verify the installation:

ollama --version
# Expected: ollama version 0.6.x or higher

Cloud models require Ollama 0.6.x or later. If `ollama --version` returns an older release, re-run the install command to update.

ollama signin

This prints a sign-in URL and opens your browser. Create a free account at ollama.com (or log in if you already have one), then approve the device. The terminal confirms with a message similar to:

Signing in to ollama.com...
Signed in as your-username

ℹ️

Note:`ollama signin` links your local machine's key to your ollama.com account. As of June 2026, no payment information is required for cloud models within Ollama's free usage limits. Check ollama.com/settings for current limits, since these are adjusted from time to time.

Step 3: Run Kimi K2.6 from the Terminal

ollama run kimi-k2.6:cloud

Ollama fetches a small manifest (a few KB, not the model weights, since inference happens remotely), then drops you into a prompt:

pulling manifest
pulling 4f3b2a1c... 100% ▕████████████████▏  3.1 KB
success
>>> Send a message (/? for help)

Type a prompt to test it:

>>> Write a Python function that returns the nth Fibonacci number using memoization.

The first response after signing in can take 10-30 seconds while Ollama establishes the cloud session. After that, responses stream back at normal speed.

Step 4: Verify the Model and Migrate from Older Tags

ollama list

`kimi-k2.6:cloud` appears in the list with a size of a few KB rather than hundreds of gigabytes. That's expected: this is a cloud passthrough entry, not a downloaded model.

⚠️

Warning:If you previously set up `kimi-k2:1t-cloud` or `kimi-k2-thinking`, both tags retire on June 16, 2026. Update any script, agent config, or Modelfile that references these names to `kimi-k2.6:cloud` (or `kimi-k2.5:cloud` if you specifically need that version) before that date, or requests will start failing.

Switching Between Local and Cloud Models

`ollama run` works the same way for local and cloud models, so you can keep both on one machine. Pull a small local model alongside Kimi K2.6:

ollama pull qwen3:8b

`ollama list` now shows both `qwen3:8b` (a multi-gigabyte local download) and `kimi-k2.6:cloud` (a manifest-only cloud entry). Switch between them by changing the model name in `ollama run` or in your application's API request. This is useful for keeping a fast local model for routine tasks and reserving Kimi K2.6's larger context and agentic capabilities for harder jobs.

Running Kimi K2.7 Code for Agentic Coding Workloads

If your work is long-horizon coding rather than general chat, run `kimi-k2.7-code:cloud` instead of K2.6:

ollama run kimi-k2.7-code:cloud

pulling manifest
pulling 7c1d4f9a... 100% ▕████████████████▏  3.2 KB
success
>>> Send a message (/? for help)

Test it with a multi-step coding prompt:

>>> Refactor this Express route to use async/await error handling, add input validation with Zod, and write three Jest tests for it.

Moonshot built K2.7 Code on top of K2.6 specifically for this kind of work. On Moonshot's published benchmarks it scores 62.0 on Kimi Code Bench v2, 76.0 on MCP Atlas, and 81.1 on MCP Mark Verified, and it uses roughly 30% fewer thinking tokens than K2.6 on comparable tasks, which translates to lower latency and cost for long agentic sessions that call tools repeatedly.

ℹ️

Note:K2.7 Code is not a strict replacement for K2.6. K2.6 remains the broader multimodal flagship for general chat, design-to-code, and swarm orchestration. K2.7 Code is the narrower option specifically for coding agents and MCP-based tool use. Pick the tag that matches the task rather than always defaulting to the newest one.

Use Kimi K2 in Your Own Scripts and Agents (API Access)

Beyond the interactive `ollama run` session, Kimi K2.6 is reachable through Ollama's REST API. Any tool that already talks to a local Ollama instance, or to the OpenAI API format, can use it with a one-line model name change.

Generate an API Key

Visit ollama.com/settings/keys while signed in, click "Create API key", and copy the value. Set it as an environment variable:

export OLLAMA_API_KEY=your_api_key_here

💡

Tip:An API key is only needed for direct requests to `https://ollama.com/api`. If your application talks to `localhost:11434` (the standard local Ollama server), `ollama signin` already authenticated that machine and no separate key is required.

Call Kimi K2.6 from the Local Endpoint

curl http://localhost:11434/api/chat -d '{
  "model": "kimi-k2.6:cloud",
  "messages": [
    { "role": "user", "content": "Summarize the difference between Kimi K2 and Kimi K2.6 in two sentences." }
  ],
  "stream": false
}'

Expected output (truncated):

json

{
  "model": "kimi-k2.6:cloud",
  "message": {
    "role": "assistant",
    "content": "Kimi K2.6 is Moonshot's 1.04T-parameter successor to Kimi K2, adding multimodal image input and swarm-style multi-agent orchestration on top of K2's original 256K-context agentic coding capabilities."
  },
  "done": true
}

Call Kimi K2.6 Directly from ollama.com

For a server or serverless function without a local Ollama install, send requests straight to ollama.com using your API key:

curl https://ollama.com/api/chat \
  -H "Authorization: Bearer $OLLAMA_API_KEY" \
  -d '{
    "model": "kimi-k2.6:cloud",
    "messages": [{ "role": "user", "content": "Hello" }]
  }'

Python Example

python

from ollama import Client

client = Client(host="https://ollama.com", headers={"Authorization": "Bearer " + api_key})

response = client.chat(
    model="kimi-k2.6:cloud",
    messages=[{"role": "user", "content": "Outline a plan to refactor a Flask app into FastAPI."}],
)
print(response["message"]["content"])

OpenAI-Compatible Endpoint for Existing Agent Tools

Ollama exposes an OpenAI-compatible layer at `http://localhost:11434/v1`, the same endpoint used in the Hermes Agent and OpenClaw setups. Point that configuration at `kimi-k2.6:cloud` instead of a local model name, and the agent runs on Kimi K2's 256K context and swarm orchestration without any other config changes:

yaml

model:
  default: kimi-k2.6:cloud
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 256000

ℹ️

Note:Set `context_length` to 256000 to match Kimi K2.6's full window. Agents that default to a smaller context, like Hermes's 64K minimum, work fine at this setting too, since 256K sits well above any minimum requirement.

Troubleshooting

`ollama run kimi-k2.6:cloud` returns "model not found"

Cause: The installed Ollama version predates cloud model support

Fix: Update Ollama by re-running the install command (`curl -fsSL https://ollama.com/install.sh | sh` on Linux/macOS, or re-download on Windows), then retry. Cloud models require Ollama 0.6.x or later.

"unauthorized" error or repeated sign-in prompts

Cause: The machine is not signed in, or the session expired

Fix: Run `ollama signin` again and complete the browser approval. Check ollama.com/settings/connections to confirm the device is listed as connected.

Requests to `kimi-k2:1t-cloud` or `kimi-k2-thinking` start failing after June 16, 2026

Cause: Both tags are retired in favor of Kimi K2.6

Fix: Replace the model name with `kimi-k2.6:cloud` (or `kimi-k2.5:cloud`) in every config file, Modelfile, and script that references the old tags.

First response takes 20-30 seconds or longer

Cause: Cold start while Ollama establishes a session with the cloud infrastructure

Fix: This is normal for the first request after signing in or after an idle period. Subsequent requests in the same session stream back at normal speed.

`ollama list` shows kimi-k2.6:cloud at only a few KB instead of a multi-gigabyte download

Cause: This is expected. Cloud models store only a manifest locally; the weights run on Moonshot and Ollama's servers

Fix: No action needed. If you want a model that runs entirely on your own hardware, see the alternatives section below.

API requests to `https://ollama.com/api` return 401

Cause: Missing or invalid `OLLAMA_API_KEY`

Fix: Generate a new key at ollama.com/settings/keys and re-export the environment variable: `export OLLAMA_API_KEY=your_new_key`.

Alternatives to Consider

Tool	Type	Price	Best For
Qwen3 8B / Qwen3.5 27B	Local (Ollama)	Free	Hardware with 8-24 GB RAM that needs a model running entirely offline, with reliable tool-calling.
DeepSeek R1	Local (Ollama) or VPS	Free	Reasoning-heavy tasks (math, coding, logic) with visible chain-of-thought output, on hardware from 4 GB (1.5B distilled) up to 64 GB or more (70B).
GLM 4.6	Local (Ollama) or cloud	Free (local) / cloud pricing varies	Agentic coding workloads similar to Kimi K2, with smaller variants that fit on a single high-VRAM GPU instead of requiring a cloud connection.
Kimi K2.6 via Ollama Cloud	Cloud (Ollama)	Free within Ollama Cloud limits	256K context, multimodal input, and swarm-style multi-agent orchestration without any local hardware requirement.
Kimi K2.7 Code via Ollama Cloud	Cloud (Ollama)	Free within Ollama Cloud limits	Long-horizon agentic coding and MCP tool use with roughly 30% fewer thinking tokens than K2.6 on comparable tasks.

Frequently Asked Questions

Can I run Kimi K2 locally with Ollama?

Not in any practical sense. Kimi K2 has 1 trillion total parameters with 32 billion active per token, and even at 1-bit quantization it needs around 250 GB of combined RAM and VRAM to run at a usable speed. That's beyond almost every desktop, laptop, and single-GPU workstation.

Ollama only distributes Kimi K2 as `kimi-k2.6:cloud`, a passthrough to Moonshot's infrastructure. `ollama run kimi-k2.6:cloud` works on any machine because the model itself never downloads.

If you want a model that runs entirely on your own hardware, see Qwen3, DeepSeek R1, or GLM 4.6 in the alternatives section, or check the best local LLM models guide for hardware-to-model matching.

Is Kimi K2 free to use through Ollama?

Yes, within Ollama's free usage limits as of June 2026. `ollama signin` does not require payment information, and `ollama run kimi-k2.6:cloud` works immediately after signing in.

Ollama applies usage limits to cloud models to manage server load, and these limits change periodically. Check ollama.com/settings for the current numbers if you're running large batches of requests.

What happened to kimi-k2:1t-cloud and kimi-k2-thinking?

Both tags retire on June 16, 2026. `kimi-k2:1t-cloud` was the original Kimi K2 cloud tag (1T total parameters, 32B active), and `kimi-k2-thinking` was its reasoning-focused variant.

Moonshot's Kimi K2.6 (1.04T total parameters, 256K context, multimodal) replaces both as `kimi-k2.6:cloud`. Update any script, Modelfile, or agent config that references the old tags before the retirement date, or requests will start failing.

What is the difference between Kimi K2, K2.5, and K2.6?

Kimi K2 (the original 2025 release) is a 1 trillion parameter mixture-of-experts model with 32 billion active parameters per token and a 256K context window, focused on agentic coding and tool use.

Kimi K2.5 was an interim 2026 update with the same parameter profile and context window.

Kimi K2.6 (1.04T total parameters, 256K context) is the current flagship. It adds multimodal input (text and image) and swarm-style orchestration that can coordinate up to 300 sub-agents across 4,000 steps for long-horizon coding and automation tasks.

How much RAM do I need to run Kimi K2 with Ollama?

For the cloud setup in this guide, effectively none beyond what Ollama itself needs to run, which is a few hundred MB. Inference happens on Moonshot and Ollama's servers, not your machine.

For a true local install of the full Kimi K2 model, plan on roughly 250 GB of combined RAM and VRAM at 1-bit quantization, which is why almost nobody runs it locally. If your goal is a model that fits in 8-64 GB of RAM, see the alternatives section for Qwen3, DeepSeek R1, and GLM 4.6.

Is Kimi K2 better than DeepSeek R1 or Qwen3 for coding?

For long-horizon agentic coding, tasks that span many files and steps, Kimi K2.6 leads on Moonshot's published benchmarks, helped by its 256K context and swarm orchestration across up to 300 sub-agents.

DeepSeek R1 and Qwen3 run entirely on your own hardware and are strong for single-session coding and reasoning tasks. If privacy, offline use, or zero ongoing dependency on a cloud connection matters more than agentic scale, a local model is the better fit.

Can I use Kimi K2 with an agent like Hermes Agent or OpenClaw?

Yes. Both Hermes Agent and OpenClaw connect to Ollama's OpenAI-compatible endpoint at `http://localhost:11434/v1`. Point the agent's model configuration at `kimi-k2.6:cloud` and set `context_length` to 256000.

The agent then runs on Kimi K2.6's 256K context and agentic capabilities through your existing Ollama setup, with no other configuration changes needed.

Does `ollama pull kimi-k2.6:cloud` download the 1 trillion parameter model?

No. `ollama pull` (or the pull step that runs automatically before `ollama run`) for a `:cloud` tag downloads only a small manifest, typically a few KB.

The actual 1.04 trillion parameter weights for Kimi K2.6 stay on Moonshot and Ollama's infrastructure. Your machine sends prompts and receives responses over the network, which is why `ollama list` shows the model at only a few KB instead of hundreds of gigabytes.

Do I need a VPS or GPU to use Kimi K2 with Ollama?

No. Inference runs on the cloud side regardless of where you run `ollama`, so a laptop or desktop with no GPU is enough for the `:cloud` tags covered in this guide.

The only reason to add hardware is if you want to try the local alternatives instead, such as DeepSeek R1, GLM 4.6, or Qwen3.5 27B, which need real VRAM to run well. For that, renting a GPU on Vast.ai by the hour is cheaper than buying a card, and you can shut it down when you're done.

What is Kimi K2.7 Code and how is it different from K2.6?

Kimi K2.7 Code is Moonshot's coding-specialized agentic model, built on top of K2.6 rather than replacing it. It targets long-horizon coding tasks across 10+ programming languages and full production tech stacks, and supports image and video input through Moonshot's MoonViT encoder.

Compared to K2.6, K2.7 Code reduces thinking-token usage by roughly 30% on comparable tasks and strengthens multi-step tool calling and MCP-based workflows. K2.6 stays the better choice for general chat, design-to-code, and swarm orchestration. K2.7 Code is the option to reach for specifically when the task is agentic coding.

Run it with `ollama run kimi-k2.7-code:cloud`, covered earlier in this guide.

Where can I try Kimi K2.6 without installing Ollama?

Moonshot AI offers Kimi K2.6 through its own Kimi chat assistant, a web interface that requires no terminal or installation.

The Ollama setup in this guide is for developers who want Kimi K2.6 inside scripts, agents, or tools that already use Ollama's API, such as Hermes Agent or OpenClaw, rather than through a chat window.