Tool DiscoveryTool Discovery
Local AIAdvanced30 min to complete15 min read

How to Run Mistral Medium 3.5 Locally with Ollama (2026 Guide)

Mistral Medium 3.5 is a 128B dense model with vision and 256K context. Install Ollama, pull the right quant tag, and run it locally or via rented GPUs.

AmaraBy Amara|Updated 24 June 2026
Terminal running ollama run mistral-medium-3.5 next to a Mistral AI badge and image-analysis icon

Mistral Medium 3.5 is Mistral AI's first merged flagship model, released April 29, 2026. It combines instruction-following, reasoning, and coding in one 128 billion parameter dense model with a 256K context window and native vision input. On Ollama, the default tag downloads roughly 80GB of weights. There is no cloud-only shortcut here. This is a real local pull, not a manifest-only cloud tag like several other recent flagship releases on this site.

Mistral itself says serious self-hosting needs at least four NVIDIA H100 80GB or H200 141GB class GPUs. That is production-server territory, not a single gaming GPU. For solo use through Ollama, the realistic minimum is one 80GB-plus datacenter GPU for the default q4_K_M tag, or a rented multi-GPU instance if you want the larger q8_0 or bf16 tags.

This guide covers installing Ollama, choosing the right quantization tag for your hardware, pulling and running the model, testing its vision input and configurable reasoning effort, and setting up API access. The alternatives section near the end covers what Mistral Medium 3.5 actually replaces: Mistral Medium 3.1, Magistral, and Devstral 2.

Prerequisites

  • Ollama 0.12 or later installed
  • At least 80GB of combined VRAM and system RAM free for the default q4_K_M tag, more for q8_0 or bf16
  • 80GB to 255GB of free disk space depending on which quantization tag you pull
  • An NVIDIA GPU with 80GB or more VRAM, multiple GPUs, or a rented GPU instance; CPU-only inference works but is slow for a 128B dense model
  • Basic command line familiarity
đŸ–Ĩī¸

Need more GPU power?

Rent a H100 80GB on Vast.ai from $1.80/hr. On-demand GPU rentals by the hour, useful for running larger models without buying hardware.

What Is Mistral Medium 3.5?

Mistral Medium 3.5 is Mistral AI's first "merged" model, meaning a single set of weights covers instruction-following, reasoning, and coding instead of splitting them across separate models. Mistral AI released it on April 29, 2026, explicitly positioning it to replace Mistral Medium 3.1, Magistral, and Devstral 2 in its production lineup.

TagDisk SizePrecisionNotes
mistral-medium-3.5:latest / :128b80GBQ4_K_M (default)Same file as the q4_K_M tag below
mistral-medium-3.5:128b-q4_K_M80GB4-bitDefault quantization, lowest hardware bar
mistral-medium-3.5:128b-q8_0138GB8-bitCloser to full quality, needs more VRAM/RAM
mistral-medium-3.5:128b-bf16255GB16-bitFull precision, multi-GPU territory

Unlike most of the other recent flagship releases covered on this site (Kimi K2.6, GLM 5.2, MiniMax M3), Mistral Medium 3.5 is dense rather than mixture-of-experts. Every one of its 128 billion parameters activates on every forward pass, which is part of why even the smallest quantized tag needs 80GB just for the weights.

â„šī¸
Note:Dense models trade efficiency for consistency. A MoE model like Kimi K2.6 has more total parameters (1.04T) but only activates a fraction per token, so it can run lighter despite the bigger number on paper. Mistral Medium 3.5's 128B is fully dense, so the entire model has to fit in memory at once.

On Mistral's published benchmarks, it scores 77.6% on SWE-Bench Verified, ahead of Mistral's earlier coding-focused models, and over 90% on Mistral's internal tool-use benchmark for enterprise agent tasks. It supports 40-plus languages, function calling, JSON output, and a configurable reasoning effort setting, plus native vision input through an encoder trained to handle variable image sizes and aspect ratios, useful for documents, diagrams, and UI screenshots. It ships under a Modified MIT license: free for commercial and noncommercial use, with exceptions for companies above a certain revenue threshold.

Install Ollama and Choose Your Quantization Tag

Step 1: Match a tag to your hardware

Available VRAM/RAMRecommended tagWhat to expect
80-96GB on one GPU or unified pool128b-q4_K_M (default)Usable for single-user inference, tight headroom for long context
140-160GB across 2 GPUs or a high-RAM server128b-q8_0Noticeably better output quality, more comfortable headroom
260GB+ across multiple GPUs128b-bf16Full precision, production-grade serving
Under 80GB, no GPUNot realistic locallyUse the Vast.ai rental option above, or see the alternatives section below

If you are not sure, start with the default q4_K_M tag. It is the only realistic option on a single 80GB-class GPU and is what the rest of this guide uses.

Step 2: Install Ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

macOS:

brew install ollama

Windows: download the installer from ollama.com and run it. WSL2 users can run the Linux command above inside their distro.

Verify the install:

ollama --version
ollama version is 0.12.3

Step 3: Pull Mistral Medium 3.5

ollama pull mistral-medium-3.5
pulling manifest
pulling 9a3f21c8... 100% ▕████████████████▏  80 GB
pulling 4b7e8d12... 100% ▕████████████████▏  12 KB
verifying sha256 digest
writing manifest
success
âš ī¸
Warning:At 80GB, this download takes a while even on a fast connection. On a 500 Mbps connection, expect roughly 25-30 minutes. On a typical home connection closer to 100 Mbps, budget 2 hours or more. Confirm you have the disk space free before starting since Ollama does not reliably resume a failed multi-gigabyte pull.

Step 4: Verify the pull

ollama list
NAME                          ID              SIZE      MODIFIED
mistral-medium-3.5:latest     a1b2c3d4e5f6    80 GB     2 minutes ago

If you want a higher-precision tag instead, pull it explicitly: `ollama pull mistral-medium-3.5:128b-q8_0` or `ollama pull mistral-medium-3.5:128b-bf16`.

Run Your First Inference

Step 5: Start a chat session

ollama run mistral-medium-3.5
>>> Write a Python function that deduplicates a list of dictionaries by a given key, and explain the time complexity.

Mistral Medium 3.5 walks through the dictionary keys, builds the function using a seen-set to track which key values it has already encountered, and returns the deduplicated list. Expect an explanation that the approach runs in O(n) time, since each item is checked once against the seen-set rather than compared against every other item.

Step 6: Test vision input

>>> What's wrong with this UI screenshot? ./screenshot.png

Ollama detects the file path in your input and attaches it as an image automatically. This works for any vision-capable model in Ollama, not just Mistral Medium 3.5. The model reads the screenshot through its native vision encoder and responds based on what it sees, no separate upload step needed.

Step 7: Control reasoning effort

For complex, multi-step problems, ask for deeper reasoning through Ollama's `think` parameter, the same mechanism used for other reasoning models like DeepSeek R1 and Qwen3:

curl http://localhost:11434/api/chat -d '{
  "model": "mistral-medium-3.5",
  "messages": [{"role": "user", "content": "Debug this race condition and explain the fix"}],
  "think": "high"
}'

Set `think` to `"low"` for short, low-latency answers on simple prompts, and `"high"` when you need deeper multi-step reasoning for debugging, math, or planning tasks.

💡
Tip:Higher reasoning effort means more tokens spent thinking before the model answers. For routine completions, leave it on the default or set it to `"low"` to save time and compute.

Set Up API Access

Ollama exposes Mistral Medium 3.5 through its standard REST API on `localhost:11434`. No separate account or API key is needed for local use.

curl http://localhost:11434/api/generate -d '{
  "model": "mistral-medium-3.5",
  "prompt": "Summarize the tradeoffs between dense and mixture-of-experts model architectures",
  "stream": false
}'

Python example:

python
import requests

response = requests.post('http://localhost:11434/api/chat', json={
    'model': 'mistral-medium-3.5',
    'messages': [{'role': 'user', 'content': 'Write a SQL query to find duplicate rows by email'}],
    'stream': False
})

print(response.json()['message']['content'])

For function calling and JSON-mode output, set the `format` field to `json` in the request body or pass a `tools` array following the OpenAI-compatible schema, the same pattern covered in the Ollama with Python guide. If you are wiring Mistral Medium 3.5 into an agent framework, point its OpenAI-compatible base URL at `http://localhost:11434/v1`, the same setup used in the Hermes agent guide.

Troubleshooting

ollama pull mistral-medium-3.5 fails partway through or times out

Cause: The 80GB default download is large enough that unstable connections or aggressive proxy timeouts interrupt it before completion.

Fix: Re-run the same pull command. Ollama resumes from the last verified layer in most cases rather than restarting from zero. If it keeps failing at the same point, switch to a wired connection or pull during off-peak hours.

Model loads but inference is extremely slow, minutes per response

Cause: The model is running on CPU and system RAM instead of GPU VRAM, common when your GPU does not have enough free memory to hold the full 80GB-plus tag.

Fix: Check GPU memory with nvidia-smi while the model is loaded. If VRAM usage is near zero, Ollama fell back to CPU. Free up VRAM, use a smaller tag, or move to a GPU with more memory.

"model requires more system memory" error on pull or run

Cause: Combined VRAM and RAM is below what the selected tag needs, for example trying to run the q8_0 (138GB) tag on a system with only 96GB total.

Fix: Switch to a smaller quantization tag, q4_K_M instead of q8_0, or add more RAM or VRAM before retrying.

Vision input is ignored or returns a text-only response

Cause: The image path was not detected in the prompt, often because of a typo in the path or an unsupported image format.

Fix: Use an absolute path to a JPEG or PNG file and confirm the file exists. When in doubt, pass the image through the API as a base64-encoded string in the images array instead of the CLI.

Output quality feels weaker than expected from a flagship model

Cause: Running the default q4_K_M tag at 4-bit quantization trades some accuracy for the lower hardware requirement.

Fix: If your hardware allows it, pull the 128b-q8_0 tag instead. The quality difference between 4-bit and 8-bit is noticeable on complex reasoning and coding tasks.

Unsure whether commercial use of the model is allowed

Cause: Mistral Medium 3.5 ships under a Modified MIT license, which is permissive but includes exceptions for companies above a certain revenue threshold.

Fix: Check the exact license terms on the model card before deploying in a commercial product at scale. For most individuals and small teams, standard MIT terms apply without restriction.

Alternatives to Consider

ToolTypePriceBest For
Devstral 2Self-hosted / APIFree (open weights), API pricing variesMistral Medium 3.5 explicitly replaces this in production. Smaller and faster if you only need coding, not full general reasoning.
MagistralSelf-hosted / APIFree (open weights), API pricing variesAlso replaced by Medium 3.5. Was Mistral's dedicated reasoning-focused model.
DeepSeek R1Self-hosted (Ollama)Free, local installA much lighter local reasoning model if you do not have 80GB or more of VRAM/RAM to spare.
GLM 5.2 via Ollama CloudCloud (Ollama)Free within Ollama Cloud limitsA cloud-only alternative with no local hardware requirement at all.

Frequently Asked Questions

Can I run Mistral Medium 3.5 without a data center GPU?

Yes, but it is tight. Mistral's own guidance calls for at least four NVIDIA H100 80GB or H200 141GB class GPUs for production serving. For solo use through Ollama, the default q4_K_M tag (80GB) can run on a single 80GB-class GPU, such as an A100 80GB or H100 80GB.

There is no realistic way to run this model well on a typical consumer GPU with 16-24GB of VRAM. If you do not have access to an 80GB-plus GPU, renting one through a service like Vast.ai is the more practical route than buying hardware.

Is Mistral Medium 3.5 free to use?

Yes. Mistral Medium 3.5 ships under a Modified MIT license, which allows commercial and noncommercial use for free, with exceptions for companies above a certain revenue threshold. Running it locally through Ollama costs nothing beyond your own hardware or rented GPU time.

If you would rather not self-host, Mistral also offers it through its API, Le Chat, Mistral Vibe, and as an NVIDIA NIM containerized inference microservice, each with its own pricing.

How does Mistral Medium 3.5 compare to Mistral Medium 3.1?

Mistral Medium 3.5 replaces Mistral Medium 3.1 in Mistral's production lineup. It is the first "merged" model, meaning it combines what used to be split across Medium 3.1 (general instruction-following), Magistral (dedicated reasoning), and Devstral 2 (dedicated coding) into one set of weights.

In practice this means you no longer need to pick between three separate models depending on the task. Medium 3.5 handles general chat, multi-step reasoning, and coding in a single 128B dense model.

What does Mistral Medium 3.5 replace?

Mistral Medium 3.5 explicitly replaces three models in Mistral's lineup: Mistral Medium 3.1 (general instruction-following), Magistral (dedicated reasoning), and Devstral 2 (dedicated coding). Mistral built Medium 3.5 as a single merged model covering all three.

If your existing workflow points at any of those three older models, Medium 3.5 is the direct upgrade path, though it requires meaningfully more hardware to self-host than Devstral 2 did on its own.

Does Mistral Medium 3.5 support image input?

Yes. Mistral Medium 3.5 has a vision encoder trained from scratch that handles variable image sizes and aspect ratios natively. It accepts text and image input and produces text output, which makes it useful for document analysis, diagrams, and UI screenshots.

Through Ollama, drop an image file path into your prompt and Ollama attaches it automatically, the same mechanism used by any vision-capable model in Ollama.

How do I control how much the model reasons before answering?

Use Ollama's `think` parameter in your API request. Set it to `"high"` for complex, multi-step coding, debugging, or planning tasks where you want deeper reasoning before the answer. Set it to `"low"` for simple prompts where speed matters more than depth.

This is the same mechanism Ollama uses for other reasoning models like DeepSeek R1 and Qwen3, not a Mistral-specific flag.

What is the minimum VRAM or RAM needed?

For the default q4_K_M tag, plan on at least 80GB of combined VRAM and system RAM, since the quantized weights alone are 80GB. For the q8_0 tag, plan for at least 140GB. For the full-precision bf16 tag, plan for 260GB or more, typically split across multiple GPUs.

Add extra headroom on top of the weight size for the KV cache, which grows with how much context you actually use.

Can I try Mistral Medium 3.5 without installing Ollama?

Yes. Mistral offers Medium 3.5 through its own API, the Le Chat assistant, Mistral Vibe, and as an NVIDIA NIM containerized inference microservice. Any of those let you test the model without downloading 80GB of weights or owning capable hardware.

Ollama is the right choice once you want the model running on your own machine or server, with no per-token API costs and full control over the deployment.

Related Guides