Local AIAdvanced30 min to complete15 min read

How to Run Mistral Medium 3.5 Locally with Ollama (2026 Guide)

Q: Can I run Mistral Medium 3.5 without a data center GPU?

Yes, but you need at least one 80GB-class GPU (A100 80GB or H100 80GB) for the default q4_K_M tag. Mistral recommends four H100 80GB or H200 141GB GPUs for production. A typical 16-24GB consumer GPU cannot run it.

Q: Is Mistral Medium 3.5 free to use?

Yes. Mistral Medium 3.5 uses a Modified MIT license, free for commercial and noncommercial use with exceptions for large-revenue companies. Self-hosting through Ollama costs nothing beyond hardware or GPU rental.

Q: How does Mistral Medium 3.5 compare to Mistral Medium 3.1?

Mistral Medium 3.5 replaces Medium 3.1 and merges what used to require three separate models (Medium 3.1, Magistral for reasoning, Devstral 2 for coding) into one 128B dense model that handles all three task types.

Q: What does Mistral Medium 3.5 replace?

Mistral Medium 3.5 replaces Mistral Medium 3.1, Magistral, and Devstral 2 in Mistral AI production lineup, merging general chat, reasoning, and coding into a single 128B dense model.

Q: Does Mistral Medium 3.5 support image input?

Yes. Mistral Medium 3.5 has a native vision encoder that handles documents, diagrams, and UI screenshots at variable sizes and aspect ratios. In Ollama, include an image file path in your prompt to attach it.

Q: How do I control how much the model reasons before answering?

Pass "think": "high" or "think": "low" in your Ollama API chat request. High reasoning effort suits complex coding, debugging, or planning tasks; low effort suits simple prompts where speed matters more.

Q: What is the minimum VRAM or RAM needed?

Minimum 80GB combined VRAM/RAM for the default q4_K_M tag, 140GB for q8_0, and 260GB or more for the full-precision bf16 tag, plus extra headroom for the KV cache as context length grows.

Q: Can I try Mistral Medium 3.5 without installing Ollama?

Yes. Mistral Medium 3.5 is available through the Mistral API, Le Chat, Mistral Vibe, and NVIDIA NIM, all without installing Ollama or downloading the 80GB local weights.

Mistral Medium 3.5 is a 128B dense model with vision and 256K context. Install Ollama, pull the right quant tag, and run it locally or via rented GPUs.

By Amara|Updated 24 June 2026

Terminal running ollama run mistral-medium-3.5 next to a Mistral AI badge and image-analysis icon

Mistral Medium 3.5 is Mistral AI's first merged flagship model, released April 29, 2026. It combines instruction-following, reasoning, and coding in one 128 billion parameter dense model with a 256K context window and native vision input. On Ollama, the default tag downloads roughly 80GB of weights. There is no cloud-only shortcut here. This is a real local pull, not a manifest-only cloud tag like several other recent flagship releases on this site.

Mistral itself says serious self-hosting needs at least four NVIDIA H100 80GB or H200 141GB class GPUs. That is production-server territory, not a single gaming GPU. For solo use through Ollama, the realistic minimum is one 80GB-plus datacenter GPU for the default q4_K_M tag, or a rented multi-GPU instance if you want the larger q8_0 or bf16 tags.

This guide covers installing Ollama, choosing the right quantization tag for your hardware, pulling and running the model, testing its vision input and configurable reasoning effort, and setting up API access. The alternatives section near the end covers what Mistral Medium 3.5 actually replaces: Mistral Medium 3.1, Magistral, and Devstral 2.

Prerequisites

Ollama 0.12 or later installed
At least 80GB of combined VRAM and system RAM free for the default q4_K_M tag, more for q8_0 or bf16
80GB to 255GB of free disk space depending on which quantization tag you pull
An NVIDIA GPU with 80GB or more VRAM, multiple GPUs, or a rented GPU instance; CPU-only inference works but is slow for a 128B dense model
Basic command line familiarity

🖥️

Need more GPU power?

Rent a H100 80GB on Vast.ai from $1.80/hr. On-demand GPU rentals by the hour, useful for running larger models without buying hardware.

In This Guide

1What Is Mistral Medium 3.5?
2Install Ollama and Choose Your Quantization Tag
3Run Your First Inference
4Set Up API Access
5Troubleshooting
6FAQ

What Is Mistral Medium 3.5?

Mistral Medium 3.5 is Mistral AI's first "merged" model, meaning a single set of weights covers instruction-following, reasoning, and coding instead of splitting them across separate models. Mistral AI released it on April 29, 2026, explicitly positioning it to replace Mistral Medium 3.1, Magistral, and Devstral 2 in its production lineup.

Tag	Disk Size	Precision	Notes
mistral-medium-3.5:latest / :128b	80GB	Q4_K_M (default)	Same file as the q4_K_M tag below
mistral-medium-3.5:128b-q4_K_M	80GB	4-bit	Default quantization, lowest hardware bar
mistral-medium-3.5:128b-q8_0	138GB	8-bit	Closer to full quality, needs more VRAM/RAM
mistral-medium-3.5:128b-bf16	255GB	16-bit	Full precision, multi-GPU territory

Unlike most of the other recent flagship releases covered on this site (Kimi K2.6, GLM 5.2, MiniMax M3), Mistral Medium 3.5 is dense rather than mixture-of-experts. Every one of its 128 billion parameters activates on every forward pass, which is part of why even the smallest quantized tag needs 80GB just for the weights.

ℹ️

Note:Dense models trade efficiency for consistency. A MoE model like Kimi K2.6 has more total parameters (1.04T) but only activates a fraction per token, so it can run lighter despite the bigger number on paper. Mistral Medium 3.5's 128B is fully dense, so the entire model has to fit in memory at once.

On Mistral's published benchmarks, it scores 77.6% on SWE-Bench Verified, ahead of Mistral's earlier coding-focused models, and over 90% on Mistral's internal tool-use benchmark for enterprise agent tasks. It supports 40-plus languages, function calling, JSON output, and a configurable reasoning effort setting, plus native vision input through an encoder trained to handle variable image sizes and aspect ratios, useful for documents, diagrams, and UI screenshots. It ships under a Modified MIT license: free for commercial and noncommercial use, with exceptions for companies above a certain revenue threshold.

Install Ollama and Choose Your Quantization Tag

Step 1: Match a tag to your hardware

Available VRAM/RAM	Recommended tag	What to expect
80-96GB on one GPU or unified pool	128b-q4_K_M (default)	Usable for single-user inference, tight headroom for long context
140-160GB across 2 GPUs or a high-RAM server	128b-q8_0	Noticeably better output quality, more comfortable headroom
260GB+ across multiple GPUs	128b-bf16	Full precision, production-grade serving
Under 80GB, no GPU	Not realistic locally	Use the Vast.ai rental option above, or see the alternatives section below

If you are not sure, start with the default q4_K_M tag. It is the only realistic option on a single 80GB-class GPU and is what the rest of this guide uses.

Step 2: Install Ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

macOS:

brew install ollama

Windows: download the installer from ollama.com and run it. WSL2 users can run the Linux command above inside their distro.

Verify the install:

ollama --version

ollama version is 0.12.3

Step 3: Pull Mistral Medium 3.5

ollama pull mistral-medium-3.5

pulling manifest
pulling 9a3f21c8... 100% ▕████████████████▏  80 GB
pulling 4b7e8d12... 100% ▕████████████████▏  12 KB
verifying sha256 digest
writing manifest
success

⚠️

Warning:At 80GB, this download takes a while even on a fast connection. On a 500 Mbps connection, expect roughly 25-30 minutes. On a typical home connection closer to 100 Mbps, budget 2 hours or more. Confirm you have the disk space free before starting since Ollama does not reliably resume a failed multi-gigabyte pull.

Step 4: Verify the pull

ollama list

NAME                          ID              SIZE      MODIFIED
mistral-medium-3.5:latest     a1b2c3d4e5f6    80 GB     2 minutes ago

If you want a higher-precision tag instead, pull it explicitly: `ollama pull mistral-medium-3.5:128b-q8_0` or `ollama pull mistral-medium-3.5:128b-bf16`.

Run Your First Inference

Step 5: Start a chat session

ollama run mistral-medium-3.5

>>> Write a Python function that deduplicates a list of dictionaries by a given key, and explain the time complexity.

Mistral Medium 3.5 walks through the dictionary keys, builds the function using a seen-set to track which key values it has already encountered, and returns the deduplicated list. Expect an explanation that the approach runs in O(n) time, since each item is checked once against the seen-set rather than compared against every other item.

Step 6: Test vision input

>>> What's wrong with this UI screenshot? ./screenshot.png

Ollama detects the file path in your input and attaches it as an image automatically. This works for any vision-capable model in Ollama, not just Mistral Medium 3.5. The model reads the screenshot through its native vision encoder and responds based on what it sees, no separate upload step needed.

Step 7: Control reasoning effort

For complex, multi-step problems, ask for deeper reasoning through Ollama's `think` parameter, the same mechanism used for other reasoning models like DeepSeek R1 and Qwen3:

curl http://localhost:11434/api/chat -d '{
  "model": "mistral-medium-3.5",
  "messages": [{"role": "user", "content": "Debug this race condition and explain the fix"}],
  "think": "high"
}'

Set `think` to `"low"` for short, low-latency answers on simple prompts, and `"high"` when you need deeper multi-step reasoning for debugging, math, or planning tasks.

💡

Tip:Higher reasoning effort means more tokens spent thinking before the model answers. For routine completions, leave it on the default or set it to `"low"` to save time and compute.

Set Up API Access

Ollama exposes Mistral Medium 3.5 through its standard REST API on `localhost:11434`. No separate account or API key is needed for local use.

curl http://localhost:11434/api/generate -d '{
  "model": "mistral-medium-3.5",
  "prompt": "Summarize the tradeoffs between dense and mixture-of-experts model architectures",
  "stream": false
}'

Python example:

python

import requests

response = requests.post('http://localhost:11434/api/chat', json={
    'model': 'mistral-medium-3.5',
    'messages': [{'role': 'user', 'content': 'Write a SQL query to find duplicate rows by email'}],
    'stream': False
})

print(response.json()['message']['content'])

For function calling and JSON-mode output, set the `format` field to `json` in the request body or pass a `tools` array following the OpenAI-compatible schema, the same pattern covered in the Ollama with Python guide. If you are wiring Mistral Medium 3.5 into an agent framework, point its OpenAI-compatible base URL at `http://localhost:11434/v1`, the same setup used in the Hermes agent guide.

Troubleshooting

ollama pull mistral-medium-3.5 fails partway through or times out

Cause: The 80GB default download is large enough that unstable connections or aggressive proxy timeouts interrupt it before completion.

Fix: Re-run the same pull command. Ollama resumes from the last verified layer in most cases rather than restarting from zero. If it keeps failing at the same point, switch to a wired connection or pull during off-peak hours.

Model loads but inference is extremely slow, minutes per response

Cause: The model is running on CPU and system RAM instead of GPU VRAM, common when your GPU does not have enough free memory to hold the full 80GB-plus tag.

Fix: Check GPU memory with nvidia-smi while the model is loaded. If VRAM usage is near zero, Ollama fell back to CPU. Free up VRAM, use a smaller tag, or move to a GPU with more memory.

"model requires more system memory" error on pull or run

Cause: Combined VRAM and RAM is below what the selected tag needs, for example trying to run the q8_0 (138GB) tag on a system with only 96GB total.

Fix: Switch to a smaller quantization tag, q4_K_M instead of q8_0, or add more RAM or VRAM before retrying.

Vision input is ignored or returns a text-only response

Cause: The image path was not detected in the prompt, often because of a typo in the path or an unsupported image format.

Fix: Use an absolute path to a JPEG or PNG file and confirm the file exists. When in doubt, pass the image through the API as a base64-encoded string in the images array instead of the CLI.

Output quality feels weaker than expected from a flagship model

Cause: Running the default q4_K_M tag at 4-bit quantization trades some accuracy for the lower hardware requirement.

Fix: If your hardware allows it, pull the 128b-q8_0 tag instead. The quality difference between 4-bit and 8-bit is noticeable on complex reasoning and coding tasks.

Unsure whether commercial use of the model is allowed

Cause: Mistral Medium 3.5 ships under a Modified MIT license, which is permissive but includes exceptions for companies above a certain revenue threshold.

Fix: Check the exact license terms on the model card before deploying in a commercial product at scale. For most individuals and small teams, standard MIT terms apply without restriction.

Alternatives to Consider

Tool	Type	Price	Best For
Devstral 2	Self-hosted / API	Free (open weights), API pricing varies	Mistral Medium 3.5 explicitly replaces this in production. Smaller and faster if you only need coding, not full general reasoning.
Magistral	Self-hosted / API	Free (open weights), API pricing varies	Also replaced by Medium 3.5. Was Mistral's dedicated reasoning-focused model.
DeepSeek R1	Self-hosted (Ollama)	Free, local install	A much lighter local reasoning model if you do not have 80GB or more of VRAM/RAM to spare.
GLM 5.2 via Ollama Cloud	Cloud (Ollama)	Free within Ollama Cloud limits	A cloud-only alternative with no local hardware requirement at all.

Frequently Asked Questions

Can I run Mistral Medium 3.5 without a data center GPU?

Yes, but it is tight. Mistral's own guidance calls for at least four NVIDIA H100 80GB or H200 141GB class GPUs for production serving. For solo use through Ollama, the default q4_K_M tag (80GB) can run on a single 80GB-class GPU, such as an A100 80GB or H100 80GB.

There is no realistic way to run this model well on a typical consumer GPU with 16-24GB of VRAM. If you do not have access to an 80GB-plus GPU, renting one through a service like Vast.ai is the more practical route than buying hardware.

Is Mistral Medium 3.5 free to use?

Yes. Mistral Medium 3.5 ships under a Modified MIT license, which allows commercial and noncommercial use for free, with exceptions for companies above a certain revenue threshold. Running it locally through Ollama costs nothing beyond your own hardware or rented GPU time.

If you would rather not self-host, Mistral also offers it through its API, Le Chat, Mistral Vibe, and as an NVIDIA NIM containerized inference microservice, each with its own pricing.

How does Mistral Medium 3.5 compare to Mistral Medium 3.1?

Mistral Medium 3.5 replaces Mistral Medium 3.1 in Mistral's production lineup. It is the first "merged" model, meaning it combines what used to be split across Medium 3.1 (general instruction-following), Magistral (dedicated reasoning), and Devstral 2 (dedicated coding) into one set of weights.

In practice this means you no longer need to pick between three separate models depending on the task. Medium 3.5 handles general chat, multi-step reasoning, and coding in a single 128B dense model.

What does Mistral Medium 3.5 replace?

Mistral Medium 3.5 explicitly replaces three models in Mistral's lineup: Mistral Medium 3.1 (general instruction-following), Magistral (dedicated reasoning), and Devstral 2 (dedicated coding). Mistral built Medium 3.5 as a single merged model covering all three.

If your existing workflow points at any of those three older models, Medium 3.5 is the direct upgrade path, though it requires meaningfully more hardware to self-host than Devstral 2 did on its own.

Does Mistral Medium 3.5 support image input?

Yes. Mistral Medium 3.5 has a vision encoder trained from scratch that handles variable image sizes and aspect ratios natively. It accepts text and image input and produces text output, which makes it useful for document analysis, diagrams, and UI screenshots.

Through Ollama, drop an image file path into your prompt and Ollama attaches it automatically, the same mechanism used by any vision-capable model in Ollama.

How do I control how much the model reasons before answering?

Use Ollama's `think` parameter in your API request. Set it to `"high"` for complex, multi-step coding, debugging, or planning tasks where you want deeper reasoning before the answer. Set it to `"low"` for simple prompts where speed matters more than depth.

This is the same mechanism Ollama uses for other reasoning models like DeepSeek R1 and Qwen3, not a Mistral-specific flag.

What is the minimum VRAM or RAM needed?

For the default q4_K_M tag, plan on at least 80GB of combined VRAM and system RAM, since the quantized weights alone are 80GB. For the q8_0 tag, plan for at least 140GB. For the full-precision bf16 tag, plan for 260GB or more, typically split across multiple GPUs.

Add extra headroom on top of the weight size for the KV cache, which grows with how much context you actually use.

Can I try Mistral Medium 3.5 without installing Ollama?

Yes. Mistral offers Medium 3.5 through its own API, the Le Chat assistant, Mistral Vibe, and as an NVIDIA NIM containerized inference microservice. Any of those let you test the model without downloading 80GB of weights or owning capable hardware.

Ollama is the right choice once you want the model running on your own machine or server, with no per-token API costs and full control over the deployment.

Related Guides

Beginner20 min

How to Run Ollama Locally: Complete Setup Guide (2026)

Beginner10 min

Best Local LLM Models to Run in 2026 (Benchmarks + Use Cases)

Intermediate25 min

How to Run DeepSeek R1 Locally with Ollama (2026 Guide)

Beginner15 min

How to Run Kimi K2 on Ollama: Cloud Setup Guide (2026)

Beginner15 min

How to Run GLM 5.2 on Ollama: Cloud Setup Guide (2026)

Intermediate25 min

How to Use Ollama with Python: API Integration Tutorial (2026)