Local AIBeginner15 min to complete13 min read

How to Run GLM 5.2 on Ollama: Cloud Setup Guide (2026)

Q: Can I run GLM 5.2 locally with Ollama?

Not through Ollama's official library. GLM 5.2 needs around 223 GB of combined RAM and VRAM even at 1-bit quantization. Ollama only offers glm-5.2:cloud. A true local install is possible via Unsloth GGUF and llama.cpp with 256 GB+ memory, or see GLM 4.6 for a smaller alternative.

Q: Is GLM 5.2 free to use through Ollama?

Yes, within Ollama's free usage limits as of June 2026. ollama signin requires no payment information. Z.ai separately sells a GLM Coding Plan (roughly $10-$80/month) for using GLM 5.2 directly outside Ollama, but that subscription is not required for the Ollama cloud setup.

Q: What is the difference between GLM 5.1 and GLM 5.2?

GLM 5.1 (756B parameters, 198K context, cloud-only) is the predecessor. GLM 5.2 (744B total / ~40B active) expands context to 1M tokens, adds DeepSeek Sparse Attention, and introduces effort-level control (High/Max) for trading reasoning depth against latency on harder tasks.

Q: How much RAM do I need to run GLM 5.2 with Ollama?

For the Ollama Cloud setup, almost no local RAM is needed since inference runs remotely. A true local install of GLM 5.2 outside Ollama needs roughly 223 GB of combined RAM and VRAM at 1-bit quantization, up to about 810 GB at 8-bit.

Q: Is GLM 5.2 better than Claude Opus or GPT-5.5 for coding?

On Z.ai's vendor-reported benchmarks, GLM 5.2 scores 62.1 on SWE-bench Pro and beats GPT-5.5 by 2.5 points on HLE with Tools, while trailing Claude Opus 4.8 by 3.2 points. Independent third-party benchmarks were not yet widely published at launch.

Q: Can I use GLM 5.2 with an agent like Hermes Agent or OpenClaw?

Yes. Hermes Agent and OpenClaw both connect to Ollama's OpenAI-compatible endpoint at http://localhost:11434/v1. Set the agent's model to glm-5.2:cloud and context_length to 1000000, and it runs through your existing Ollama setup with no other changes.

Q: Does `ollama pull glm-5.2:cloud` download the 744 billion parameter model?

No. ollama pull for a :cloud tag downloads only a small manifest (a few KB), not the model weights. GLM 5.2's 744B parameters stay on Z.ai and Ollama's servers, and your machine just sends prompts and receives responses over the network.

Q: Do I need a VPS or GPU to use GLM 5.2 with Ollama?

No. GLM 5.2 inference runs in the cloud for the :cloud tag regardless of where ollama runs, so no GPU or VPS is needed. A true local install needs roughly 223 GB or more of combined RAM and VRAM, which makes renting a GPU on Vast.ai by the hour more practical than buying hardware.

Q: Where can I try GLM 5.2 without installing Ollama?

Z.ai offers its own hosted chat interface and GLM Coding Plan subscription for using GLM 5.2 directly, with no terminal or Ollama required. The Ollama setup in this guide is for developers who want GLM 5.2 inside scripts, agents, or existing Ollama-based tools.

GLM 5.2 isn't a local Ollama pull. It's Z.ai's 744B-parameter cloud model. Learn to run glm-5.2:cloud, set up API access, and find true local alternatives.

By Amara|Updated 20 June 2026

Terminal output of ollama run glm-5.2:cloud connecting to the GLM 5.2 cloud model on Ollama

GLM 5.2 is Z.ai's (formerly Zhipu AI) flagship open-weight model, and like Kimi K2, Ollama only runs it in the cloud. It's a mixture-of-experts design with 744 billion total parameters and roughly 40 billion active per token across 384 experts, paired with a 1 million token context window that Z.ai says is the first one actually usable at that size rather than just technically supported. Even at the most aggressive 1-bit quantization, the full model needs around 223 GB of combined RAM and VRAM, which rules out almost every desktop and single-GPU workstation. Ollama's answer, the same one it used for Kimi K2, is a `:cloud` tag: `ollama run glm-5.2:cloud` sends your prompt to Z.ai's infrastructure through Ollama's servers instead of loading anything locally.

There's a rollout detail worth knowing before you set anything up. Z.ai released GLM 5.2 on June 13, 2026, but only through its own GLM Coding Plan at first. The open weights followed on Hugging Face under `zai-org/GLM-5.2` and an MIT license within days, and Ollama added the `glm-5.2:cloud` tag the same week. As of June 2026, Ollama's official library still doesn't host a quantized local pull tag for GLM 5.2, only the cloud one, so `ollama pull glm-5.2` on its own returns a manifest error instead of a multi-hundred-gigabyte download.

This guide covers the full Ollama Cloud setup: installing Ollama, signing in, running `glm-5.2:cloud` from the terminal, and generating an API key for your own scripts and agents. If your hardware can genuinely handle a model this size, there's a section on running GLM 5.2 locally through Unsloth and llama.cpp instead of Ollama. And if 744 billion parameters is more than you need, the alternatives section near the end covers GLM 4.6, Qwen3.5, DeepSeek R1, and Kimi K2.6.

Prerequisites

Ollama 0.6.x or later, installed on Linux, macOS, or Windows (no GPU or high-RAM machine required for the cloud setup)
A free account at ollama.com for the `ollama signin` step
A stable internet connection. Inference for cloud models runs on Ollama and Z.ai's servers, not your hardware
Basic terminal familiarity for running `ollama run` and `curl` commands
(Optional) An API key from ollama.com/settings/keys if you plan to call GLM 5.2 from your own scripts or agents
(Optional) 256 GB or more of combined RAM/VRAM and a 24 GB+ GPU if you want to attempt the true local install via Unsloth and llama.cpp covered later in this guide

🖥️

Need more GPU power?

Rent a RTX 4090 on Vast.ai from $0.20/hr. On-demand GPU rentals by the hour, useful for running larger models without buying hardware.

In This Guide

1What GLM 5.2 Is and Why Ollama Runs It in the Cloud
2Set Up Ollama Cloud and Run GLM 5.2
3Running GLM 5.2 Locally Instead of in the Cloud
4Use GLM 5.2 in Your Own Scripts and Agents (API Access)
5Troubleshooting
6FAQ

What GLM 5.2 Is and Why Ollama Runs It in the Cloud

GLM 5.2 is an open-weight large language model from Z.ai, the renamed consumer and enterprise arm of Zhipu AI, a Beijing-based lab. The model uses a mixture-of-experts architecture with 744 billion total parameters and roughly 40 billion active per token across 384 experts, built on a DeepSeek Sparse Attention mechanism and pretrained on 28.5 trillion tokens. Z.ai released the weights on Hugging Face at `zai-org/GLM-5.2` under an MIT license, more permissive than the custom terms some earlier GLM releases shipped under. The model targets long-horizon coding: large refactors, multi-file changes, and agentic tasks that run for hours instead of a single prompt-response exchange.

The reason GLM 5.2 only shows up on Ollama as a cloud model comes down to the same math that applies to Kimi K2. At 1-bit quantization, the smallest practical version still needs around 223 GB of combined RAM and VRAM, and the full unquantized model sits closer to 1.5 TB. Ollama's `:cloud` tag sidesteps that entirely. `ollama run glm-5.2:cloud` sends your prompt to Z.ai's infrastructure through Ollama's servers and streams the response back, using the same command syntax as a model that actually lives on your disk.

Z.ai has shipped several GLM generations in 2026. Here's what's live on Ollama as of June 2026:

Model	Parameters	Context	Notes	Ollama Tag
GLM 4.6	357B total / 32B active	200K	Smaller footprint, fits a single high-VRAM GPU at low quantization	`glm-4.6` (local pull available)
GLM 5	Not fully disclosed	128K	Earlier 2026 release	Cloud only
GLM 5.1	756B total	198K	Predecessor to 5.2, smaller context window	`glm-5.1:cloud`
GLM 5.2	744B total / ~40B active	1M (976K usable)	Current flagship, DeepSeek Sparse Attention, effort-level control	`glm-5.2:cloud`

For most people searching for "GLM 5.2 Ollama" today, `glm-5.2:cloud` is the only option in Ollama's official library. It also has the largest context window in the GLM line by a wide margin. GLM 5.2 includes an effort-level control with High and Max settings, trading latency and compute cost for deeper reasoning on harder coding problems, similar in spirit to the reasoning-effort parameters other frontier APIs expose.

Set Up Ollama Cloud and Run GLM 5.2

Running GLM 5.2 through Ollama takes three steps: install Ollama, sign in, and run the model. Nothing here downloads a multi-hundred-gigabyte file. The whole setup takes under five minutes on any machine with a working internet connection.

Step 1: Install Ollama

# Linux and macOS, one-command installer
curl -fsSL https://ollama.com/install.sh | sh

On Windows, download the installer from ollama.com/download, or use winget:

powershell

winget install Ollama.Ollama

Verify the installation:

ollama --version
# Expected: ollama version 0.6.x or higher

ollama signin

This prints a sign-in URL and opens your browser. Create a free account at ollama.com, or log in if you already have one, then approve the device. The terminal confirms with a message similar to:

Signing in to ollama.com...
Signed in as your-username

ℹ️

Note:`ollama signin` links your local machine's key to your ollama.com account. As of June 2026, no payment information is required for cloud models within Ollama's free usage limits. Check ollama.com/settings for current limits, since these change from time to time.

Step 3: Run GLM 5.2 from the Terminal

ollama run glm-5.2:cloud

Ollama fetches a small manifest, a few KB rather than the model weights, since inference happens remotely, then drops you into a prompt:

pulling manifest
pulling 7c2a9f1e... 100% ▕████████████████▏  4.2 KB
success
>>> Send a message (/? for help)

Type a prompt to test it:

>>> Refactor this Python function to use type hints and explain the change.

The first response after signing in can take 10-30 seconds while Ollama establishes the cloud session. After that, responses stream back at normal speed.

⚠️

Warning:Running `ollama pull glm-5.2` or `ollama pull glm-5.2:latest` without the `:cloud` suffix fails with a "pull model manifest: file does not exist" error. As of June 2026, Ollama's official library only hosts `glm-5.2:cloud`. For true local inference, skip ahead to the "Running GLM 5.2 Locally Instead of in the Cloud" section below.

Step 4: Verify the Model

ollama list

`glm-5.2:cloud` appears in the list at a few KB rather than hundreds of gigabytes. That's expected: this is a cloud passthrough entry, not a downloaded model.

Switching Between Local and Cloud Models

`ollama run` works the same way for local and cloud models, so you can keep both on one machine. Pull a small local model alongside GLM 5.2:

ollama pull qwen3.5:8b

`ollama list` now shows both `qwen3.5:8b` (a multi-gigabyte local download) and `glm-5.2:cloud` (a manifest-only cloud entry). Switch between them by changing the model name in `ollama run` or in your application's API request, keeping a fast local model for routine tasks and reserving GLM 5.2's 1M context and effort-level control for harder, longer-running jobs.

Running GLM 5.2 Locally Instead of in the Cloud

If you have the hardware, GLM 5.2 can run entirely on your own machine, just not through Ollama's official library yet. Unsloth publishes dynamic GGUF quantizations of the model that work with llama.cpp and Unsloth's own inference server.

Memory requirements by quantization

Quantization	Combined RAM/VRAM	Approx. download size
1-bit	~223 GB	~217 GB
2-bit (UD-IQ2_M)	~245 GB	~239-280 GB
4-bit	~372-475 GB	Larger
8-bit	~810 GB	Largest

The full, unquantized model is roughly 1.5 TB. The 2-bit dynamic quant (`UD-IQ2_M`) is the practical starting point: Unsloth describes it as the best balance of accessibility and accuracy, and it runs on a single 24 GB GPU paired with 256 GB of unified memory, the kind of setup a high-end Mac Studio provides, or a 24 GB GPU plus 256 GB of system RAM using MoE layer offloading.

Download and run with Unsloth

pip install unsloth

Download the GGUF weights from Hugging Face (`unsloth/GLM-5.2-GGUF`), then start the inference server:

unsloth studio -H 0.0.0.0 -p 8888

This serves a local OpenAI-compatible API and a chat UI at `http://127.0.0.1:8888`. For a connection reachable outside your own machine, add HTTPS:

unsloth studio --secure

💡

Tip:Most people don't own a 256 GB Mac or a server with that much system RAM sitting idle. Renting a GPU instance with enough combined VRAM and RAM on Vast.ai by the hour is the realistic path for testing the local route without buying hardware, and you can shut the instance down the moment you're done.

This local setup is separate from Ollama. If your goal is specifically to use Ollama's command syntax and ecosystem, the `glm-5.2:cloud` tag from the previous section is currently the only way to do that, and it's the simpler option for almost everyone.

Use GLM 5.2 in Your Own Scripts and Agents (API Access)

Beyond the interactive `ollama run` session, GLM 5.2's cloud tag is reachable through Ollama's REST API. Any tool that already talks to a local Ollama instance, or to the OpenAI API format, can use it with a one-line model name change.

Generate an API Key

Visit ollama.com/settings/keys while signed in, click "Create API key", and copy the value. Set it as an environment variable:

export OLLAMA_API_KEY=your_api_key_here

💡

Tip:An API key is only needed for direct requests to `https://ollama.com/api`. If your application talks to `localhost:11434` (the standard local Ollama server), `ollama signin` already authenticated that machine and no separate key is required.

Call GLM 5.2 from the Local Endpoint

curl http://localhost:11434/api/chat -d '{
  "model": "glm-5.2:cloud",
  "messages": [
    { "role": "user", "content": "Summarize the difference between GLM 5.1 and GLM 5.2 in two sentences." }
  ],
  "stream": false
}'

Expected output (truncated):

json

{
  "model": "glm-5.2:cloud",
  "message": {
    "role": "assistant",
    "content": "GLM 5.2 expands the context window from GLM 5.1's 198K to a usable 1M tokens, adds DeepSeek Sparse Attention, and introduces effort-level control for trading latency against reasoning depth on harder coding tasks."
  },
  "done": true
}

Call GLM 5.2 Directly from ollama.com

For a server or serverless function without a local Ollama install, send requests straight to ollama.com using your API key:

curl https://ollama.com/api/chat \
  -H "Authorization: Bearer $OLLAMA_API_KEY" \
  -d '{
    "model": "glm-5.2:cloud",
    "messages": [{ "role": "user", "content": "Hello" }]
  }'

Python Example

python

from ollama import Client

client = Client(host="https://ollama.com", headers={"Authorization": "Bearer " + api_key})

response = client.chat(
    model="glm-5.2:cloud",
    messages=[{"role": "user", "content": "Plan a migration from a Flask monolith to FastAPI microservices."}],
)
print(response["message"]["content"])

OpenAI-Compatible Endpoint for Existing Agent Tools

Ollama exposes an OpenAI-compatible layer at `http://localhost:11434/v1`, the same endpoint used in the Hermes Agent and OpenClaw setups. Point that configuration at `glm-5.2:cloud` instead of a local model name, and the agent runs on GLM 5.2's 1M context window without any other config changes:

yaml

model:
  default: glm-5.2:cloud
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 1000000

ℹ️

Note:Z.ai also sells GLM 5.2 access directly through its own GLM Coding Plan, for tools and editors that integrate with Z.ai rather than going through Ollama's cloud passthrough. The Ollama route in this guide is for anyone who wants GLM 5.2 inside scripts or agents that already speak Ollama's API.

Troubleshooting

`ollama run glm-5.2:cloud` returns "model not found"

Cause: The installed Ollama version predates cloud model support

Fix: Update Ollama by re-running the install command (`curl -fsSL https://ollama.com/install.sh | sh` on Linux/macOS, or re-download on Windows), then retry. Cloud models require Ollama 0.6.x or later.

`ollama pull glm-5.2` fails with "pull model manifest: file does not exist"

Cause: Ollama's official library does not host a local quantized tag for GLM 5.2, only `glm-5.2:cloud`

Fix: Use `ollama run glm-5.2:cloud` for the cloud-hosted version. For true local inference on your own hardware, use the Unsloth GGUF and llama.cpp setup covered in the "Running GLM 5.2 Locally" section.

"unauthorized" error or repeated sign-in prompts

Cause: The machine is not signed in, or the session expired

Fix: Run `ollama signin` again and complete the browser approval. Check ollama.com/settings/connections to confirm the device is listed as connected.

First response takes 20-30 seconds or longer

Cause: Cold start while Ollama establishes a session with the cloud infrastructure

Fix: This is normal for the first request after signing in or after an idle period. Subsequent requests in the same session stream back at normal speed.

`ollama list` shows glm-5.2:cloud at only a few KB instead of a multi-gigabyte download

Cause: This is expected. Cloud models store only a manifest locally; the weights run on Z.ai and Ollama's servers

Fix: No action needed. If you want a model that runs entirely on your own hardware, see the local install section or the alternatives below.

API requests to `https://ollama.com/api` return 401

Cause: Missing or invalid `OLLAMA_API_KEY`

Fix: Generate a new key at ollama.com/settings/keys and re-export the environment variable: `export OLLAMA_API_KEY=your_new_key`.

Unsloth or llama.cpp throws an out-of-memory error during the local install

Cause: GLM 5.2 exceeds the available combined RAM and VRAM on the machine, even at 2-bit quantization

Fix: Drop to the 1-bit quantization (~223 GB combined RAM/VRAM) if your hardware allows it, or rent a GPU instance with enough memory on Vast.ai rather than buying hardware. Otherwise, fall back to `glm-5.2:cloud` through Ollama.

Alternatives to Consider

Tool	Type	Price	Best For
GLM 4.6	Local (Ollama) or cloud	Free (local) / cloud pricing varies	Long-context agentic coding similar to GLM 5.2, in a 357B model small enough to run locally at low quantization on a single high-VRAM GPU.
Qwen3.5	Local (Ollama)	Free	Hardware with 8-24 GB RAM that needs a model running entirely offline, with reliable tool-calling and no cloud dependency.
DeepSeek R1	Local (Ollama) or VPS	Free	Reasoning-heavy tasks (math, coding, logic) with visible chain-of-thought output, on hardware from 4 GB (1.5B distilled) up to 64 GB or more (70B).
Kimi K2.6 via Ollama Cloud	Cloud (Ollama)	Free within Ollama Cloud limits	256K context and swarm-style multi-agent orchestration as a cloud-only alternative if GLM 5.2's 1M context and 744B parameters are more than your use case needs.

Frequently Asked Questions

Can I run GLM 5.2 locally with Ollama?

Not through Ollama's official library as it stands. GLM 5.2 has 744 billion total parameters, and even at 1-bit quantization it needs around 223 GB of combined RAM and VRAM, which rules out almost every desktop, laptop, and single-GPU workstation. Ollama currently only distributes GLM 5.2 as `glm-5.2:cloud`, a passthrough to Z.ai's infrastructure.

If you have genuinely high-end hardware (256 GB of unified memory or a large RAM pool plus a 24 GB+ GPU), you can run a true local install through Unsloth's GGUF quantizations and llama.cpp, covered in the "Running GLM 5.2 Locally" section of this guide. Otherwise, see GLM 4.6, Qwen3.5, or DeepSeek R1 in the alternatives section for models that fit on more ordinary hardware.

Is GLM 5.2 free to use through Ollama?

Yes, within Ollama's free usage limits as of June 2026. `ollama signin` does not require payment information, and `ollama run glm-5.2:cloud` works immediately after signing in.

Z.ai also sells its own GLM Coding Plan for using GLM 5.2 directly in editors and CLIs outside Ollama: Lite at roughly $10/month for about 400 prompts a week, Pro at roughly $30/month for about 2,000 prompts a week, and Max at roughly $80/month for about 8,000 prompts a week, with a Team tier billed per seat. None of that is required for the Ollama setup in this guide.

What is the difference between GLM 5.1 and GLM 5.2?

GLM 5.1 has 756 billion total parameters and a 198K context window, and like GLM 5.2 it's only available on Ollama as a cloud tag (`glm-5.1:cloud`).

GLM 5.2 (744 billion total parameters, roughly 40 billion active per token) expands the context window to 1 million tokens, adds a DeepSeek Sparse Attention mechanism, and introduces effort-level control with High and Max settings for balancing reasoning depth against latency and compute cost on harder coding tasks.

How much RAM do I need to run GLM 5.2 with Ollama?

For the cloud setup in this guide, effectively none beyond what Ollama itself needs to run, a few hundred MB. Inference happens on Z.ai and Ollama's servers, not your machine.

For a true local install outside Ollama, plan on roughly 223 GB of combined RAM and VRAM at 1-bit quantization, up to around 810 GB at 8-bit. The full unquantized model is closer to 1.5 TB. If your goal is a model that fits in 8-64 GB of RAM, see the alternatives section for GLM 4.6, Qwen3.5, and DeepSeek R1.

Is GLM 5.2 better than Claude Opus or GPT-5.5 for coding?

On Z.ai's own vendor-reported numbers, GLM 5.2 scores 62.1 on SWE-bench Pro and beats GPT-5.5 by 2.5 points on HLE with Tools, while trailing Claude Opus 4.8 by 3.2 points on the same benchmark. Independent third-party benchmark results were not yet widely published at launch, so treat vendor-reported scores as a starting point rather than a final answer.

For agentic, long-horizon coding work specifically, GLM 5.2's 1M context window and effort-level control are real differentiators over its own predecessor, GLM 5.1, regardless of how it ultimately stacks up against closed models.

Can I use GLM 5.2 with an agent like Hermes Agent or OpenClaw?

Yes. Both Hermes Agent and OpenClaw connect to Ollama's OpenAI-compatible endpoint at `http://localhost:11434/v1`. Point the agent's model configuration at `glm-5.2:cloud` and set `context_length` to 1000000.

The agent then runs on GLM 5.2's 1M context window and effort-level control through your existing Ollama setup, with no other configuration changes needed.

Does `ollama pull glm-5.2:cloud` download the 744 billion parameter model?

No. `ollama pull` (or the pull step that runs automatically before `ollama run`) for a `:cloud` tag downloads only a small manifest, typically a few KB.

The actual 744 billion parameter weights for GLM 5.2 stay on Z.ai and Ollama's infrastructure. Your machine sends prompts and receives responses over the network, which is why `ollama list` shows the model at only a few KB instead of hundreds of gigabytes.

Do I need a VPS or GPU to use GLM 5.2 with Ollama?

No, not for the `:cloud` tag covered in most of this guide. Inference runs on Z.ai's side regardless of where you run `ollama`, so a laptop or desktop with no GPU is enough.

The only reason to add hardware is if you want the true local install instead, which needs roughly 223 GB or more of combined RAM and VRAM. For that, renting a GPU on Vast.ai by the hour is cheaper than buying enough hardware outright, and you can shut it down when you're done.

Where can I try GLM 5.2 without installing Ollama?

Z.ai offers its own hosted chat interface and the GLM Coding Plan subscription for using GLM 5.2 directly in supported editors and CLIs, without touching a terminal or Ollama at all.

The Ollama setup in this guide is specifically for developers who want GLM 5.2 inside scripts, agents, or tools that already use Ollama's API, such as Hermes Agent or OpenClaw, rather than through Z.ai's own interface.

Related Guides

Beginner15 min

How to Run Kimi K2 on Ollama: Cloud Setup Guide (2026)