Local AIIntermediate25 min to complete14 min read

How to Run DeepSeek R1 Locally with Ollama (2026 Guide)

Q: What is the minimum RAM to run DeepSeek R1 locally?

Minimum: deepseek-r1:1.5b needs 2 GB RAM. For practical use, deepseek-r1:7b or 8b needs 8 GB RAM. The 1.5B model works but has noticeably lower reasoning quality. 7B/8B variants produce meaningfully better output.

Q: Is DeepSeek R1 safe to run locally — are there privacy concerns?

Running locally via Ollama is fully private — no data leaves your machine. The privacy concern applies to DeepSeek's cloud API and web chat (subject to Chinese data law), not the local open-weight model. Ollama has no telemetry.

Q: How does DeepSeek R1 compare to GPT-4o for coding tasks?

R1 distillations compete with GPT-4o on HumanEval and SWE-Bench. The 70B variant exceeds GPT-4o on several coding benchmarks. R1 excels at multi-step reasoning tasks (algorithm design, debugging). For simple completions, standard models are faster.

Q: Can I use DeepSeek R1 as the backend for n8n or other automation tools?

Yes. Set n8n's OpenAI base URL to http://localhost:11434/v1 and API key to any string. Select deepseek-r1:14b as model. Note: R1's block may break structured output parsing — add a function node to strip it before downstream use.

Q: Which DeepSeek R1 variant should I choose for a 16 GB RAM machine?

Use deepseek-r1:14b (9 GB) on a 16 GB machine — leaves ~7 GB for OS. The 8B models underuse available RAM; 32B requires 32 GB and swaps on 16 GB. 14B is the sweet spot: 10-15 tokens/sec CPU, 30-50 tokens/sec GPU.

Q: How much does it cost to run DeepSeek R1 on a cloud VPS versus the DeepSeek API?

DeepSeek API: $0.55/million tokens (~$0.28/month for 1K queries). Contabo VPS 40 (deepseek-r1:32b): €30.25/month flat. Break-even: ~110,000 queries/month. API wins for light personal use; VPS wins for batch processing or high volume.

Q: Does DeepSeek R1 work offline once downloaded?

Yes, fully offline after initial download. Model weights stored in ~/.ollama/models. No internet required for inference. Ollama does not phone home during use. Only network needed: initial pull and future updates.

Q: Can I run multiple DeepSeek R1 models simultaneously?

Yes. Set OLLAMA_MAX_LOADED_MODELS=2 or higher. Running 7B + 14B simultaneously needs ~23 GB RAM. Useful for multi-user or multi-application setups where different clients need different model variants from the same Ollama server.

Run DeepSeek R1 locally using Ollama on Linux, macOS, or a VPS. Model size comparison, RAM requirements, and step-by-step setup for R1 7B through 70B.

By Amara|Updated 1 June 2026

DeepSeek R1 8B model running via Ollama in a terminal showing chain-of-thought think blocks with model stats panel

DeepSeek R1 is a reasoning model released by Chinese AI lab DeepSeek in January 2026. It uses chain-of-thought reasoning, producing a visible thinking process before its final answer. On math, coding, and logic benchmarks, R1 matches or exceeds OpenAI o1 at a fraction of the inference cost. The model is fully open-weight and available to run on your own hardware.

This guide covers running DeepSeek R1 locally using Ollama. You will choose the right model size for your hardware, pull the model, run inference from the command line, access the REST API, and optionally connect a web interface. The guide covers R1 variants from the 1.5B parameter distillation (runs on 4 GB RAM) through the 70B full model (requires 48 GB RAM or a high-VRAM GPU).

For users who want DeepSeek R1 on a cloud server without local hardware requirements, Contabo Cloud VPS 40 provides 48 GB RAM at €30.25/month, which is enough to run R1 32B comfortably. The 70B model requires 64+ GB RAM — Contabo Cloud VPS 50 at €37.00/month covers that.

Prerequisites

Linux (Ubuntu 20.04+), macOS 12+, or Windows 10 with WSL2
RAM requirement varies by model size — see the model comparison table in this guide
10-60 GB free disk space depending on model variant
Ollama 0.6.x or higher (installation covered in this guide)
(Optional) NVIDIA GPU with 8+ GB VRAM for hardware-accelerated inference

🖥️

Need a VPS?

Run this on a Contabo Cloud VPS 40 starting at €30.25/mo. Reliable Linux VPS with NVMe storage, ideal for self-hosted AI workloads.

In This Guide

1What DeepSeek R1 Is and How It Differs from Standard LLMs
2Install Ollama
3Choose Your Model Size and Pull It
4Run Your First Inference
5Use the Ollama REST API
6Connect Open-WebUI for a Chat Interface
7Enable GPU Acceleration
8Ollama Configuration for DeepSeek R1
9Troubleshooting
10FAQ

What DeepSeek R1 Is and How It Differs from Standard LLMs

DeepSeek R1 is a reasoning model, not a standard autoregressive language model. The difference is the inference process: R1 generates a step-by-step thinking chain (wrapped in `` tags) before producing its final answer. This reasoning process is visible in Ollama's output and is the source of R1's performance improvement on complex tasks.

DeepSeek trained R1 using reinforcement learning from its own generated reasoning traces, without relying on supervised fine-tuning on human-labelled reasoning data. The full R1 model has 671 billion parameters with a Mixture-of-Experts (MoE) architecture, but DeepSeek also released distilled versions based on Llama 3 and Qwen 2.5 backbones. These are the versions most practical to run locally.

DeepSeek R1 variants available on Ollama (as of March 2026):

Model	Parameters	Architecture	RAM Required	Disk Size	Best For
deepseek-r1:1.5b	1.5B	Qwen 2.5 distill	2 GB	1.1 GB	Testing, low-RAM devices
deepseek-r1:7b	7B	Qwen 2.5 distill	8 GB	4.7 GB	General use, good balance
deepseek-r1:8b	8B	Llama 3 distill	8 GB	4.9 GB	Coding tasks, instruction following
deepseek-r1:14b	14B	Qwen 2.5 distill	16 GB	9.0 GB	Better reasoning, 16 GB RAM machines
deepseek-r1:32b	32B	Qwen 2.5 distill	32 GB	20 GB	High-quality reasoning, 32 GB servers
deepseek-r1:70b	70B	Llama 3 distill	48 GB	43 GB	Near-full R1 quality
deepseek-r1:671b	671B	MoE (full model)	400+ GB	404 GB	Research / multi-GPU clusters only

The distilled models (1.5B through 70B) are trained to replicate R1's reasoning behaviour in smaller dense architectures. They retain most of R1's reasoning improvement over standard Llama and Qwen models while running on consumer hardware.

ℹ️

Note:The 671B full MoE model is not practical for single-machine consumer use. Focus on the 7B, 14B, or 32B distillations depending on your RAM.

Install Ollama

Ollama is the runtime that manages DeepSeek R1 model files and handles inference. Install it on your local machine or VPS.

Linux and macOS

# One-command installer
curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version
# Expected: ollama version 0.6.x

On Linux, the installer creates a systemd service. Check it is active:

systemctl status ollama
# Expected: active (running)

Windows (WSL2)

Install Ubuntu via WSL2, then run the Linux installer inside the WSL2 terminal. Ollama on Windows also has a native installer available at ollama.com/download — download the `.exe` file, run it, and verify in PowerShell:

powershell

ollama --version

Docker (for VPS deployments)

# Pull and run Ollama in a container with model persistence
docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Choose Your Model Size and Pull It

Choose the largest model your RAM supports. Larger models produce significantly better reasoning output on complex tasks.

Quick decision guide:

8 GB RAM: use `deepseek-r1:7b` or `deepseek-r1:8b`
16 GB RAM: use `deepseek-r1:14b`
32 GB RAM: use `deepseek-r1:32b`
48 GB RAM: use `deepseek-r1:70b`

Check your available RAM before pulling:

# Linux
free -h

# macOS
vm_stat | grep "Pages free"

# Example output (Linux, 32 GB machine):
# total        used        free
# Mem:          31Gi        4.2Gi       27Gi

Pull the model (example: 14B variant):

ollama pull deepseek-r1:14b

# Expected output:
# pulling manifest
# pulling 6e9f90f02bb3...  100% ████████ 9.0 GB / 9.0 GB
# pulling 11ce4ee3e170...  100% ████████ 1.8 KB / 1.8 KB
# success

Download times vary by connection speed. The 14B model (9 GB) takes approximately 8 minutes on a 150 Mbps connection.

After the pull completes, confirm the model is available:

ollama list

# Expected output:
# NAME                    ID              SIZE    MODIFIED
# deepseek-r1:14b         ea35dfe18182    9.0 GB  2 minutes ago

Run Your First Inference

Interactive CLI Session

Start an interactive session directly in the terminal:

ollama run deepseek-r1:14b

After the model loads (10-20 seconds on first run), type a prompt:

>>> Solve this step by step: A train travels 120 km at 60 km/h, then 80 km at 40 km/h. What is the average speed for the entire journey?

DeepSeek R1 generates a visible reasoning block before the final answer:

<think>
To find average speed, I need total distance divided by total time.

Total distance = 120 km + 80 km = 200 km

Time for first leg = 120 km / 60 km/h = 2 hours
Time for second leg = 80 km / 40 km/h = 2 hours
Total time = 4 hours

Average speed = 200 km / 4 hours = 50 km/h
</think>

The average speed for the entire journey is 50 km/h.

Here is the calculation:
- Leg 1: 120 km at 60 km/h = 2 hours
- Leg 2: 80 km at 40 km/h = 2 hours
- Total: 200 km in 4 hours = 50 km/h

The `` block shows R1's internal reasoning. The content after the closing `` tag is the final answer.

Exit the interactive session:

>>> /bye

Single-Shot Inference (Non-Interactive)

ollama run deepseek-r1:14b "What is the time complexity of merge sort and why?"

Check Inference Speed

After running a prompt, Ollama prints performance stats:

eval count:    312 token(s)
eval duration: 22.4s
eval rate:     13.9 tokens/s

On CPU-only inference with the 14B model: expect 8-15 tokens/second on a 4-core VPS. With an NVIDIA GPU: 40-120 tokens/second depending on VRAM.

💡

Tip:The `` reasoning block adds latency — R1 generates more tokens per response than standard models. For tasks where you do not need the reasoning trace, shorter prompts like "Answer directly without showing working" reduce the think block length.

Use the Ollama REST API

Ollama exposes a REST API at `http://localhost:11434` and an OpenAI-compatible endpoint at `http://localhost:11434/v1`. Both work for DeepSeek R1.

Native Ollama API

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:14b",
    "prompt": "Explain how quicksort works in three sentences.",
    "stream": false
  }'

The response JSON includes the reasoning trace in the `response` field, along with token counts and timing:

json

{
  "model": "deepseek-r1:14b",
  "response": "<think>
Quicksort works by...
</think>

Quicksort is a divide-and-conquer algorithm...",
  "done": true,
  "eval_count": 198,
  "eval_duration": 14200000000
}

OpenAI-Compatible API

Applications built for the OpenAI API work with Ollama's compatible endpoint:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ollama" \
  -d '{
    "model": "deepseek-r1:14b",
    "messages": [
      {"role": "user", "content": "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."}
    ]
  }'

Stripping the Think Block from API Responses

If your application only needs the final answer (not the reasoning trace), filter the response with a simple Python snippet:

python

import re

def strip_think(response: str) -> str:
    """Remove DeepSeek R1 <think>...</think> block from response."""
    return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()

# Example usage
raw = "<think>\nLet me work through this...\n</think>\n\nThe answer is 42."
clean = strip_think(raw)
print(clean)
# Output: The answer is 42.

Connect Open-WebUI for a Chat Interface

Open-WebUI gives DeepSeek R1 a ChatGPT-style browser interface, including support for rendering the `` reasoning blocks in a collapsible format.

Quick Start with Docker

docker run -d \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

ℹ️

Note:On Linux, replace `host.docker.internal` with the Docker bridge IP (typically `172.17.0.1`). Find it with: `ip route | grep docker0`

Open `http://localhost:3000` in your browser. Select `deepseek-r1:14b` from the model dropdown. Open-WebUI renders the `` content in a collapsible "Thinking..." accordion, keeping the interface clean while preserving access to the full reasoning trace.

For a detailed Open-WebUI setup guide, see How to Set Up Open-WebUI with Ollama.

Enable GPU Acceleration

GPU inference reduces response time from 10-15 tokens/second (CPU) to 40-120 tokens/second (GPU), depending on the model size and GPU VRAM.

NVIDIA GPU (CUDA)

Ollama detects NVIDIA GPUs automatically on Linux when the NVIDIA CUDA Toolkit is installed. Verify GPU detection:

ollama run deepseek-r1:14b "Hello"

# Check GPU usage in a second terminal:
nvidia-smi

# Expected: Python or ollama process using GPU memory

If Ollama is not using the GPU, install the CUDA toolkit:

# Ubuntu 22.04
sudo apt install nvidia-cuda-toolkit

Partial GPU Offloading (Mixed CPU + GPU)

If the model does not fully fit in VRAM, Ollama offloads as many layers as possible to the GPU and runs the rest on CPU. This is automatic. For the 14B model (9 GB) on a GPU with 8 GB VRAM, Ollama offloads approximately 28 of 40 layers to the GPU, resulting in 25-35 tokens/second instead of the full-GPU 60+ tokens/second.

Check how many layers are GPU-offloaded by reviewing the Ollama server log:

journalctl -u ollama -f | grep "offload"

# Example output:
# llama_new_context_with_model: n_ctx      = 16384
# llm_load_tensors: offloading 28 repeating layers to GPU
# llm_load_tensors: offloaded 28/41 layers to GPU

Apple Silicon (Metal)

On Apple Silicon Macs, Ollama uses the Metal GPU framework automatically. No configuration is required — install Ollama and pull the model as normal. The M2 Pro with 16 GB unified memory runs deepseek-r1:14b at approximately 20-30 tokens/second.

Ollama Configuration for DeepSeek R1

Ollama's behaviour is controlled via environment variables set before starting the service.

Variable	Default	Purpose
`OLLAMA_HOST`	`127.0.0.1:11434`	Set to `0.0.0.0:11434` to accept connections from other machines
`OLLAMA_MODELS`	`~/.ollama/models`	Custom path for model storage (useful if /home is small)
`OLLAMA_NUM_PARALLEL`	`1`	Number of simultaneous inference requests
`OLLAMA_MAX_LOADED_MODELS`	`1`	Maximum models kept in memory at once
`OLLAMA_KEEP_ALIVE`	`5m`	How long to keep a model in memory after last use
`OLLAMA_FLASH_ATTENTION`	`0`	Set to `1` to enable Flash Attention (reduces VRAM use)

Set Variables on Linux (systemd)

sudo systemctl edit ollama

Add to the override file:

ini

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_FLASH_ATTENTION=1"

Restart the service:

sudo systemctl restart ollama

⚠️

Warning:Setting `OLLAMA_HOST=0.0.0.0:11434` exposes the API to all network interfaces. If your VPS has a public IP, protect the endpoint with a firewall rule that allows port 11434 only from trusted IPs, or put Nginx in front with authentication.

Troubleshooting

Model pull fails mid-download with "context deadline exceeded"

Cause: Network timeout during large model download — common with 14B+ models over slow or unstable connections

Fix: Run `ollama pull deepseek-r1:14b` again. Ollama resumes incomplete downloads from where it left off using the cached partial file. No need to restart from the beginning.

Inference is extremely slow (under 3 tokens/second)

Cause: Model does not fit in RAM — system is using swap memory for inference

Fix: Check RAM usage: `free -h`. If swap is active, switch to a smaller model variant. deepseek-r1:7b requires 8 GB RAM. deepseek-r1:1.5b requires only 2 GB. Alternatively, add more RAM or upgrade to a VPS with more memory.

CUDA out of memory error when loading model

Cause: GPU VRAM is insufficient for the selected model size

Fix: Enable Flash Attention: set `OLLAMA_FLASH_ATTENTION=1`. If the error persists, switch to a smaller model. Ollama automatically does partial GPU offloading — if the full model does not fit in VRAM, it loads what it can on GPU and the rest on CPU.

The <think> block is very long (1000+ tokens) before answering

Cause: R1 over-thinks simple prompts — the reasoning model explores multiple paths even for trivial questions

Fix: Add "Answer directly." or "Be concise." to the prompt. This reduces think block length significantly. For tasks that do not benefit from reasoning (simple factual questions), standard Llama or Qwen models are faster.

API returns 404 on `http://localhost:11434/v1/chat/completions`

Cause: Ollama version is older than 0.1.24 — the OpenAI-compatible endpoint was added in that version

Fix: Update Ollama: `curl -fsSL https://ollama.com/install.sh | sh` — the installer updates an existing installation. Verify with `ollama --version` after update.

Model loads on first run but subsequent runs start from scratch

Cause: OLLAMA_KEEP_ALIVE is set too low, unloading the model between requests

Fix: Increase keep-alive: set `OLLAMA_KEEP_ALIVE=30m` or `-1` to keep the model loaded indefinitely. Edit the systemd override file and restart the service.

Alternatives to Consider

Tool	Type	Price	Best For
LM Studio	Desktop app	Free	Windows and macOS users who prefer a GUI over a CLI. Supports DeepSeek R1 GGUF models directly. Easier for beginners but less flexible for server deployments.
Jan	Desktop app	Free / open-source	A privacy-focused Electron app for running local models. Supports DeepSeek R1 via GGUF. Good for personal use on desktop, not suited for server or API deployments.
llama.cpp (direct)	CLI	Free / open-source	Maximum control and performance tuning. Ollama uses llama.cpp internally — running llama.cpp directly removes the Ollama abstraction layer. Suitable for developers who need custom quantisation or build options.
Together AI (DeepSeek R1 API)	Cloud	$0.18 per 1M tokens (input)	Running R1 70B or the full 671B model without local hardware. Together AI hosts DeepSeek R1 and offers an OpenAI-compatible API. Cost-effective for low-volume usage compared to running a high-RAM VPS.

Frequently Asked Questions

What is the minimum RAM to run DeepSeek R1 locally?

The smallest DeepSeek R1 variant — deepseek-r1:1.5b — requires approximately 2 GB of RAM. It is a 1.5 billion parameter distillation of the full R1 model, trained on Qwen 2.5. Reasoning quality is noticeably lower than the larger variants, but it runs on almost any machine and is useful for testing the R1 setup before committing to a larger model.

For practical everyday use, the 7B or 8B variants are the recommended minimum — these require 8 GB RAM and produce reasoning output that is meaningfully better than standard (non-reasoning) LLMs of similar size.

Is DeepSeek R1 safe to run locally — are there privacy concerns?

Running DeepSeek R1 locally via Ollama means all inference happens on your hardware. No prompts, responses, or data leave your machine. There are no telemetry calls from the Ollama runtime or the model files themselves.

The privacy concern around DeepSeek relates to their cloud API and web interface (chat.deepseek.com), which is subject to Chinese data law and DeepSeek's privacy policy. When you run the open-weight model locally through Ollama, you are using the model weights only — there is no connection to DeepSeek's servers.

How does DeepSeek R1 compare to GPT-4o for coding tasks?

On standard coding benchmarks (HumanEval, SWE-Bench), DeepSeek R1 distillations perform competitively with GPT-4o, and the 70B variant exceeds GPT-4o on several benchmarks. The reasoning chain is particularly useful for debugging and algorithm design, where seeing the step-by-step problem decomposition helps verify correctness.

In practice, R1's advantage is most visible on problems that require multi-step reasoning: algorithm optimisation, debugging logic errors, and mathematical proof-writing. For simple code completion tasks (writing a function from a clear specification), standard models like Llama 3.3 8B are faster and comparable in quality.

Can I use DeepSeek R1 as the backend for n8n or other automation tools?

Yes. Ollama exposes an OpenAI-compatible API at `http://localhost:11434/v1`, which n8n's AI nodes accept as a custom OpenAI base URL. Set the base URL to your Ollama endpoint, set the API key to any non-empty string (Ollama ignores it), and select `deepseek-r1:14b` as the model.

The reasoning trace in R1's output can interfere with structured automation workflows — if n8n expects a clean JSON response, R1's `` block before the answer may break parsing. Use a post-processing function node to strip the think block before passing the response downstream.

Which DeepSeek R1 variant should I choose for a 16 GB RAM machine?

Use deepseek-r1:14b. At 9 GB model size, it leaves approximately 7 GB for the operating system and other processes on a 16 GB machine. This is the sweet spot for 16 GB RAM: the 8B models leave too much RAM unused, while the 32B model requires 32 GB and will cause heavy swap usage on 16 GB.

The 14B variant is a Qwen 2.5 distillation and produces consistently good reasoning on math, coding, and logic tasks. It runs at approximately 10-15 tokens/second on CPU and 30-50 tokens/second on a mid-range NVIDIA GPU.

How much does it cost to run DeepSeek R1 on a cloud VPS versus the DeepSeek API?

DeepSeek's own API charges $0.55 per million input tokens for R1. At 500 tokens per query, that is $0.000275 per query — for 1,000 queries per month, the cost is $0.28. For light usage, the API is cheaper than a VPS.

The crossover point depends on usage volume. A Contabo Cloud VPS 40 at €30.25/month runs deepseek-r1:32b 24/7. If you send more than 110,000 queries per month (roughly 3,600 per day), the VPS becomes cheaper than the API. For personal use, the API is more cost-effective. For businesses processing documents or running batch jobs, the fixed-cost VPS wins.

Does DeepSeek R1 work offline once downloaded?

Yes. Once you have pulled the model with `ollama pull deepseek-r1:14b`, all inference runs locally with no internet connection. Ollama does not phone home during inference, and the model weights are stored in `~/.ollama/models` on your machine.

The only network requirements are the initial pull (9 GB for the 14B model) and any future model updates. After that, the setup works fully offline — useful for air-gapped environments, travel, or areas with unreliable connectivity.

Can I run multiple DeepSeek R1 models simultaneously?

Ollama supports loading multiple models simultaneously with the `OLLAMA_MAX_LOADED_MODELS` environment variable (default is 1). Set it to 2 or 3 to keep multiple models in memory at once.

In practice, running two R1 models simultaneously requires enough RAM for both. Running deepseek-r1:7b and deepseek-r1:14b at the same time needs approximately 23 GB RAM. Simultaneous loading is mainly useful if you are serving multiple users or applications with different model preferences from the same Ollama instance.

Related Guides

Beginner20 min

How to Run Ollama Locally: Complete Setup Guide (2026)

Beginner10 min

Best Local LLM Models to Run in 2026 (Benchmarks + Use Cases)

Beginner15 min

How to Run Gemma 4 on Ollama: Complete Setup Guide (2026)

Beginner15 min

How to Run Kimi K2 on Ollama: Cloud Setup Guide (2026)