Tool DiscoveryTool Discovery
Docker & VPSIntermediate30 min to complete16 min read

How to Set Up LiteLLM Proxy with Docker (2026)

Set up LiteLLM proxy with Docker to use 100+ LLMs via a single OpenAI-compatible API. Covers config file, virtual keys, cost tracking, and load balancing.

AmaraBy Amara|Updated 2 June 2026
LiteLLM proxy setup diagram: OpenAI, Anthropic, Ollama, and Groq providers routing through a single LiteLLM proxy endpoint to your application

LiteLLM is an open-source proxy that sits between your application and whatever LLM you're calling. Send it an OpenAI-format chat completions request, tell it which model you want, and it handles provider authentication and request translation itself. GPT-4o, Claude Sonnet, Gemini 1.5 Pro, Groq-hosted Llama, a local Ollama instance. Same request format for all of them, different model name in the payload.

The project had 26,000+ GitHub stars by mid-2026. Teams use it mostly for two things: keeping API keys out of application code, and setting per-developer spend limits through virtual keys. The third use case, provider fallback, is less common but genuinely useful when your primary provider starts rate-limiting you mid-run.

This guide covers the full Docker-based setup. You'll start with a single-provider config, add multiple providers including a local Ollama model, generate budget-capped virtual keys, connect PostgreSQL for persistent cost tracking, and finish with load balancing and fallback routing. All examples use LiteLLM v1.40.x and Docker Compose v2.

Prerequisites

  • Docker 24.x or later with Docker Compose v2 (run `docker compose version` to confirm)
  • At least one LLM API key (OpenAI, Anthropic, Groq, or similar) OR Ollama running locally on port 11434
  • 1 GB free RAM minimum for the proxy process (2 GB recommended for concurrent team usage)
  • Basic familiarity with YAML config files
  • (Optional) PostgreSQL 14+ for persistent cost tracking across restarts
  • (Optional) A VPS for deploying LiteLLM as a shared team gateway accessible outside your machine
🖥️

Need a VPS?

Run this on a Contabo Cloud VPS 10 starting at €5.45/mo. Reliable Linux VPS with NVMe storage, ideal for self-hosted AI workloads.

What is LiteLLM?

LiteLLM (github.com/BerriAI/litellm, MIT license) is a Python proxy that accepts OpenAI-format API requests and routes them to the right provider SDK. When your code calls `http://localhost:4000/v1/chat/completions` with model `claude-sonnet`, LiteLLM intercepts it, translates the request to Anthropic's Messages API format, adds your Anthropic API key, and returns the response in the OpenAI format your application expects.

Every provider uses the same config syntax. Switching from GPT-4o to Gemini 1.5 Pro means changing one field in `litellm_config.yaml`. No SDK calls to update, no auth logic to rewrite.

Where LiteLLM fits in a typical AI stack

Without LiteLLM, your application carries direct dependencies on multiple provider SDKs. Each has its own auth format, rate limit behavior, and error structure. With LiteLLM, your application calls one local endpoint. Provider credentials, model routing, retry logic, and error normalization all live in the proxy config. Your application code doesn't need to know which provider it's talking to.

FeatureLiteLLM (self-hosted)OpenRouter (cloud)Direct API calls
Unified OpenAI-format endpointYesYesNo
Cost tracking per virtual keyYesLimitedNo
Budget caps per key or teamYesNoNo
Load balancing across providersYesNoNo
Ollama / local model supportYesNoYes
Data leaves your networkNoYesYes
Setup requiredDockerNoneNone

The setup makes most sense in a few situations: when a team shares API keys across developers, when you need automatic failover if a provider starts returning 429s, or when you're mixing cloud APIs with a local Ollama instance and want one endpoint for both.

Step 1: Create the Configuration File

The `litellm_config.yaml` file defines which models LiteLLM exposes and which provider credentials to use for each. It is the only required configuration file.

Create the project directory

mkdir litellm-proxy
cd litellm-proxy

Write the minimal config

Create `litellm_config.yaml` with one model entry to start:

yaml
model_list:
  - model_name: gpt-5.2
    litellm_params:
      model: openai/gpt-5.2
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

The `model_name` field is what your application sends in the `model` parameter. The `litellm_params.model` value uses the provider-prefix format: `provider/model-id`. The `os.environ/VARIABLE_NAME` syntax tells LiteLLM to read the value from an environment variable at runtime instead of hardcoding it in the YAML.

The `master_key` is the admin key required on every API request. It also lets you generate budget-capped virtual keys for team members. Use a long random string.

Create the environment file

Store secrets in a `.env` file in the same directory:

# .env
OPENAI_API_KEY=sk-your-actual-openai-key
LITELLM_MASTER_KEY=sk-your-chosen-master-key
⚠️
Warning:Add `.env` to `.gitignore` before your first commit. API keys committed to version control are routinely harvested by automated scanners within minutes.
echo ".env" >> .gitignore

Step 2: Start LiteLLM with Docker Compose

Create `docker-compose.yaml` in the same directory as `litellm_config.yaml`:

yaml
version: "3.8"

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "8"]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health/liveliness"]
      interval: 30s
      timeout: 10s
      retries: 3

The `main-stable` tag pulls the latest stable LiteLLM release. The `--num_workers 8` flag sets the number of Gunicorn workers. For a single developer, 4 workers is enough. For a team gateway handling concurrent requests, 8 to 16 workers handles load more cleanly.

Start the proxy:

docker compose up -d

Check the startup logs:

docker compose logs litellm
# Expected:
# INFO: Application startup complete.
# Uvicorn running on http://0.0.0.0:4000 (Press CTRL+C to quit)

The first run pulls the LiteLLM Docker image, which is approximately 2 GB. On a standard 100 Mbps connection, this takes 2 to 4 minutes. Subsequent starts are instant since the image is cached locally.

ℹ️
Note:LiteLLM v1.40.x uses `/health/liveliness` as the health check endpoint. Earlier versions used `/health`. If you run an older image and the healthcheck fails, update the test URL in the compose file. Check github.com/BerriAI/litellm/releases for the changelog before switching image tags.

Step 3: Test the Proxy and Send Your First Request

Confirm the proxy is healthy before making model calls:

curl http://localhost:4000/health
# Expected:
# {"status":"healthy","litellm_version":"1.40.x","environment_variables":{...}}

Send a test chat completion. Replace the master key with what you set in `.env`:

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-your-master-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.2",
    "messages": [{"role": "user", "content": "Reply with the word OK only."}],
    "max_tokens": 5
  }'

A working proxy returns a standard OpenAI-format JSON response with a `choices` array. The `model` field in the response shows which provider model handled the request.

Using the Python OpenAI SDK

Because LiteLLM is OpenAI-compatible, any code that uses the OpenAI Python SDK works unchanged. Swap only `base_url` and `api_key`:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-your-master-key"
)

response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "Reply with OK only."}]
)
print(response.choices[0].message.content)
# Output: OK

This pattern works with any library that accepts an OpenAI client: LangChain, LlamaIndex, Instructor, the Vercel AI SDK, and the raw `openai` package in Python, Node.js, or TypeScript. Nothing in the calling code needs to change when you add, remove, or swap providers in the LiteLLM config.

If you prefer using Ollama's native Python library without a proxy layer, see the Ollama Python API guide.

Step 4: Add Multiple LLM Providers

Extend `litellm_config.yaml` with additional providers. Each entry uses the same structure; only the provider prefix and API key variable change.

Add OpenAI, Anthropic, and Groq

yaml
model_list:
  - model_name: gpt-5.2
    litellm_params:
      model: openai/gpt-5.2
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: llama-groq
    litellm_params:
      model: groq/llama-3.3-70b-versatile
      api_key: os.environ/GROQ_API_KEY

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

Add the new keys to `.env`:

ANTHROPIC_API_KEY=sk-ant-your-key
GROQ_API_KEY=gsk_your-groq-key

Update the `environment` block in `docker-compose.yaml`:

yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - GROQ_API_KEY=${GROQ_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}

Restart to apply the config change:

docker compose restart litellm

Add a local Ollama model

If Ollama is running on the same machine, use `host.docker.internal` so the container can reach the host's port 11434:

yaml
  - model_name: llama-local
    litellm_params:
      model: ollama/llama3.3:8b
      api_base: http://host.docker.internal:11434

On Linux, `host.docker.internal` is not automatically available. Add this to the `litellm` service block in `docker-compose.yaml`:

yaml
    extra_hosts:
      - "host.docker.internal:host-gateway"

After restarting, the `llama-local` model is accessible through the proxy at no API cost. Requests to it never leave your machine.

💡
Tip:Ollama must have the model already pulled before LiteLLM routes to it. Run `ollama pull llama3.3:8b` on the host first. See the Ollama setup guide for full installation steps if Ollama is not yet running.

Step 5: Generate and Use Virtual API Keys

Virtual keys let each developer or application use their own API key with independent budget caps and model restrictions. The master key creates and manages virtual keys. Virtual keys are what end users and CI systems use for day-to-day requests.

Generate a virtual key with a budget cap

curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-your-master-key" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-5.2", "claude-sonnet", "llama-local"],
    "max_budget": 20.0,
    "budget_duration": "30d",
    "key_alias": "dev-alice"
  }'

The response includes a `key` field with the new virtual key:

json
{
  "key": "sk-abc123generated...",
  "models": ["gpt-5.2", "claude-sonnet", "llama-local"],
  "max_budget": 20.0,
  "budget_duration": "30d",
  "key_alias": "dev-alice"
}

Give `sk-abc123generated...` to the developer. They use it exactly like an OpenAI API key, pointed at `http://localhost:4000/v1`. When the key reaches the $20 budget, subsequent requests return a 429 until the 30-day period resets.

Key parameters

ParameterTypePurpose
`models`string[]Whitelist of model names this key can access
`max_budget`numberMaximum USD spend before the key is blocked
`budget_duration`stringReset period: `1d`, `7d`, `30d`, `1mo`
`key_alias`stringHuman-readable label in the dashboard
`team_id`stringGroup keys under a team for aggregate spend reporting
`rpm_limit`numberMaximum requests per minute for this key

List and revoke keys

# List all active keys
curl http://localhost:4000/key/list \
  -H "Authorization: Bearer sk-your-master-key"

# Revoke a specific key
curl -X DELETE http://localhost:4000/key/delete \
  -H "Authorization: Bearer sk-your-master-key" \
  -H "Content-Type: application/json" \
  -d '{"keys": ["sk-abc123generated..."]}'

Step 6: Enable Cost Tracking with PostgreSQL

By default, LiteLLM tracks spend in an in-memory SQLite database that resets when the container restarts. Connect PostgreSQL to persist spend data across restarts and deployments.

Update docker-compose.yaml to add PostgreSQL

yaml
version: "3.8"

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=postgresql://litellm:${DB_PASSWORD}@db:5432/litellm
    command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "8"]
    depends_on:
      db:
        condition: service_healthy
    restart: unless-stopped

  db:
    image: postgres:16
    environment:
      - POSTGRES_DB=litellm
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  postgres_data:

Add `DB_PASSWORD` to `.env`:

DB_PASSWORD=a-strong-random-password-here

Start the full stack:

docker compose down
docker compose up -d

LiteLLM runs its own database migrations on first startup with a new PostgreSQL URL. Check the logs to confirm migration completed:

docker compose logs litellm | grep -i "migration"
# Expected: INFO: Running migrations...
# Expected: INFO: Migration complete

Query spend data

# Spend summary for a specific virtual key
curl "http://localhost:4000/key/info?key=sk-abc123" \
  -H "Authorization: Bearer sk-your-master-key"

# Global spend by model (specify date range)
curl "http://localhost:4000/global/spend/models?start_date=2026-06-01&end_date=2026-06-30" \
  -H "Authorization: Bearer sk-your-master-key"
ℹ️
Note:The LiteLLM UI at `http://localhost:4000/ui` shows a visual breakdown of spend per key, model, and time period. Log in with your master key. The dashboard is available in LiteLLM v1.40.x and later.

Step 7: Configure Load Balancing and Fallbacks

LiteLLM load-balances automatically when multiple config entries share the same `model_name`. Requests to `gpt-5.2` distribute across all `gpt-5.2` entries using the routing strategy you set.

Load balance across multiple API keys

Use this to spread requests across several OpenAI API keys and stay within per-key rate limits:

yaml
model_list:
  - model_name: gpt-5.2
    litellm_params:
      model: openai/gpt-5.2
      api_key: os.environ/OPENAI_API_KEY_1

  - model_name: gpt-5.2
    litellm_params:
      model: openai/gpt-5.2
      api_key: os.environ/OPENAI_API_KEY_2

router_settings:
  routing_strategy: least-busy
  num_retries: 3
  retry_after: 5

The `least-busy` strategy sends each request to the entry with the fewest active in-flight requests. Other available strategies:

StrategyBehavior
`simple-shuffle`Random selection across entries
`latency-based-routing`Routes to the historically fastest entry
`usage-based-routing`Routes to the entry with the most remaining budget

Fallback to a backup provider

Fallbacks automatically retry failed requests using a different model or provider:

yaml
router_settings:
  routing_strategy: least-busy
  num_retries: 3
  retry_after: 5
  fallbacks:
    - {"gpt-5.2": ["claude-sonnet"]}
    - {"claude-sonnet": ["llama-groq"]}

With this config, a 429 rate limit or 500 error from `gpt-5.2` after 3 retries automatically escalates to `claude-sonnet`. If that also fails, the request goes to `llama-groq`. Your application sees a normal successful response in all cases.

Context window fallbacks

For document processing where input length is unpredictable, fall back to a model with a larger context window:

yaml
router_settings:
  context_window_fallbacks:
    - {"gpt-5.2": ["claude-sonnet"]}

When a request exceeds GPT-5.2's context limit, LiteLLM automatically retries with Claude Sonnet, which supports a 200,000-token context window. The fallback model must already exist in your `model_list`.

Apply any config change by restarting the proxy:

docker compose restart litellm

Configuration Reference

Key environment variables

VariableDefaultPurpose
`LITELLM_MASTER_KEY`None (required)Admin key for all API requests and key management
`DATABASE_URL`SQLite in-memoryPostgreSQL URL for persistent storage: `postgresql://user:pass@host/db`
`LITELLM_LOG``INFO`Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`
`PORT``4000`Listening port (also set via `--port` CLI flag)

Provider-specific keys (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GROQ_API_KEY`, etc.) are read from environment variables by the `os.environ/VARIABLE_NAME` syntax in `litellm_config.yaml`.

Key config.yaml settings

yaml
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

  # Disable auth for local-only development (never use in production)
  no_auth: false

  # Global maximum spend across all keys
  max_budget: 100.0

  # Rate limit for all requests (requests per minute)
  rpm_limit: 200

litellm_settings:
  # Retries on provider errors before declaring failure
  num_retries: 3

  # Request timeout in seconds per provider call
  request_timeout: 120

  # Drop unsupported parameters instead of erroring (useful when mixing providers)
  drop_params: true

  # Enable verbose logging for request/response debugging
  set_verbose: false

Available provider prefixes (selected)

ProviderConfig prefixExample model value
OpenAI`openai/``openai/gpt-5.2`
Anthropic`anthropic/``anthropic/claude-sonnet-4-6`
Google Gemini`gemini/``gemini/gemini-1.5-pro`
Groq`groq/``groq/llama-3.3-70b-versatile`
Mistral`mistral/``mistral/mistral-large-latest`
Ollama`ollama/``ollama/llama3.3:8b`
Azure OpenAI`azure/``azure/gpt-5.2`
AWS Bedrock`bedrock/``bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0`

Full provider list and config options are at docs.litellm.ai/docs/providers.

Troubleshooting

AuthenticationError: 401 on every request

Cause: The Bearer token in the Authorization header does not match the LITELLM_MASTER_KEY or any active virtual key

Fix: Verify the request uses `Authorization: Bearer sk-your-key` with no extra spaces. Confirm LITELLM_MASTER_KEY in .env matches what the container received: run `docker compose exec litellm env | grep LITELLM_MASTER_KEY` to inspect the container's environment.

`NotFoundError`: model not found after adding it to config

Cause: The container was not restarted after editing config.yaml, or the model_name in the request does not exactly match the entry in model_list

Fix: Run `docker compose restart litellm`. Then call `GET http://localhost:4000/v1/models` with your master key to list all loaded model names and confirm the exact string your application should send.

Ollama requests fail with connection refused or timeout

Cause: `host.docker.internal` does not resolve on Linux, so the Docker container cannot reach Ollama on the host machine's port 11434

Fix: Add `extra_hosts: ["host.docker.internal:host-gateway"]` to the litellm service block in docker-compose.yaml. Restart the stack. Confirm Ollama is running on the host: `curl http://localhost:11434/api/tags`.

Config changes have no effect after `docker compose restart`

Cause: The volume mount path in docker-compose.yaml does not match the `--config` flag path, or there is a YAML syntax error in litellm_config.yaml silently ignored at startup

Fix: Confirm the volume line reads `./litellm_config.yaml:/app/config.yaml` and the command includes `--config /app/config.yaml`. Validate the YAML file with `python -c "import yaml; yaml.safe_load(open('litellm_config.yaml'))"` before restarting.

Virtual key returns 429 BudgetExceededError

Cause: The key has consumed its max_budget for the current billing period

Fix: Check remaining budget: `GET /key/info?key=sk-abc123` with the master key. Raise the limit: `POST /key/update` with `{"key": "sk-abc123", "max_budget": 50.0}`. Budget resets automatically at the end of the budget_duration period.

`No healthy deployments available` under concurrent load

Cause: All entries in a load-balancing group are simultaneously rate-limited or failing, and num_retries was exhausted

Fix: Add a fallback model from a different provider in router_settings. Increase num_retries to 5. Check OpenAI (status.openai.com) and Anthropic (status.anthropic.com) status pages for active incidents. Reduce --num_workers if the proxy is overwhelming providers with parallel requests.

Alternatives to Consider

ToolTypePriceBest For
OpenRouterCloudPay-per-token (no base fee)Teams that want access to 100+ models through one API without running any infrastructure. OpenRouter handles routing, provider failover, and billing. No server required. All requests go through OpenRouter's cloud rather than your own network, which matters for data privacy requirements.
PortKeyCloudFree tier, $49/month for teamsTeams who want LiteLLM-style features (fallbacks, budgets, logging) managed as a SaaS product. PortKey has a polished observability dashboard and zero self-hosting overhead. All traffic routes through PortKey servers. Self-hosting is possible but less commonly used.
HeliconeCloudFree tier (10k requests/month), $25/monthTeams focused on LLM observability: logging every request with latency, cost, and user-level tracking. Less capable at routing and load balancing than LiteLLM, but the analytics dashboard and prompt version tracking are more developed. Open-source self-hosting option available.
Cloudflare AI GatewayCloudFree tier availableProjects already using Cloudflare infrastructure. The AI Gateway proxies OpenAI and Anthropic requests through Cloudflare's network with caching and request logging. No Docker setup needed, but it supports fewer providers than LiteLLM and has no virtual key management or team budget controls.

Frequently Asked Questions

Is LiteLLM free to self-host?

Yes. The proxy is MIT-licensed. You pay for the LLM API calls it routes, billed directly by OpenAI, Anthropic, Groq, or whichever provider you use, and the server it runs on.

BerriAI also offers managed LiteLLM Cloud with enterprise features. The self-hosted version has no restrictions on proxy and routing functionality versus the cloud offering. For most teams, a Contabo Cloud VPS 10 at €5.45/month (4 vCPU, 8 GB RAM) handles the proxy, PostgreSQL, and 8 Gunicorn workers without needing a larger plan.

Can LiteLLM proxy local Ollama models alongside cloud APIs?

Yes. Add any Ollama model using the `ollama/` prefix and set `api_base: http://host.docker.internal:11434` in the model entry. On Linux, also add `extra_hosts: ["host.docker.internal:host-gateway"]` to the Docker Compose service definition.

You can mix Ollama entries and cloud API entries in the same model list freely. For example, your application can route creative writing tasks to `claude-sonnet`, coding tasks to a local `llama-local` Ollama model, and fast summarization to `llama-groq` on Groq's inference API, all through one LiteLLM endpoint with the model name as the only differentiator.

See the Ollama setup guide if Ollama is not yet running on your machine.

How do I add Anthropic Claude models to LiteLLM?

Add a model entry with the `anthropic/` prefix:

yaml
- model_name: claude-sonnet
  litellm_params:
    model: anthropic/claude-sonnet-4-6
    api_key: os.environ/ANTHROPIC_API_KEY

Get an Anthropic API key at console.anthropic.com. Add `ANTHROPIC_API_KEY=sk-ant-your-key` to `.env` and add `- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}` to the environment section of `docker-compose.yaml`. Restart the proxy.

For Claude's 200,000-token context window, no additional config is needed. LiteLLM passes the full messages array to the Anthropic API as-is after translating the format.

Does LiteLLM work with LangChain, LlamaIndex, and other AI frameworks?

Yes. Any framework that accepts an OpenAI-compatible endpoint works with LiteLLM. Change only the `base_url` and `api_key` when initializing the OpenAI client:

LangChain:

python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-your-master-key",
    model="gpt-5.2"
)

LlamaIndex:

python
from llama_index.llms.openai import OpenAI
llm = OpenAI(base_url="http://localhost:4000/v1", api_key="sk-your-master-key", model="gpt-5.2")

All chains, agents, memory classes, and retrieval pipelines work without modification. This is the core benefit of the OpenAI-compatible API design.

Can I run LiteLLM with only Ollama and no cloud API keys?

Yes. A minimal Ollama-only config requires no cloud credentials:

yaml
model_list:
  - model_name: llama3
    litellm_params:
      model: ollama/llama3.3:8b
      api_base: http://host.docker.internal:11434

general_settings:
  master_key: sk-local-dev-key

With this config, no request leaves your network. The proxy accepts OpenAI-format requests, routes them to Ollama, and returns standard responses. Useful for local development, privacy-sensitive applications, and air-gapped environments. The master key is still required for LiteLLM's internal authentication even in local-only setups.

What is the latency overhead added by the LiteLLM proxy?

The proxy adds 2 to 10 milliseconds per request for parsing, routing, and response formatting. Most LLM calls take 500 ms to 30 seconds of inference time, so that overhead is not the thing you'll be measuring.

Under high concurrency, the real bottleneck is `--num_workers` and RAM. Each Gunicorn worker uses roughly 200 to 400 MB. With 2 GB dedicated to LiteLLM, you're looking at 4 to 6 workers before memory pressure kicks in. A Contabo VPS 10 (8 GB total) comfortably runs 8 workers alongside PostgreSQL.

For streaming at 20+ concurrent users, set `--num_workers 16` and make sure the server has at least 6 GB free.

How do I update LiteLLM to a newer version?

Pull the latest stable image and recreate the container:

docker compose pull litellm
docker compose up -d --force-recreate litellm

The `main-stable` tag always points to the latest stable release. LiteLLM ships multiple releases per week, so the image may update frequently if you pull regularly. Review the changelog at github.com/BerriAI/litellm/releases before updating production deployments. Breaking config changes appear occasionally in minor versions, particularly around the `router_settings` and `general_settings` keys.

If you pin to a specific version tag (e.g., `ghcr.io/berriai/litellm:v1.40.5`), update the tag in `docker-compose.yaml` and run `docker compose up -d`.

How do I view per-model and per-key spend in LiteLLM?

With PostgreSQL configured, LiteLLM persists spend data and exposes it through the API:

# Spend for a specific virtual key
curl "http://localhost:4000/key/info?key=sk-abc123" \
  -H "Authorization: Bearer sk-your-master-key"

# Global spend broken down by model over a date range
curl "http://localhost:4000/global/spend/models?start_date=2026-06-01&end_date=2026-06-30" \
  -H "Authorization: Bearer sk-your-master-key"

The UI dashboard at `http://localhost:4000/ui` shows the same data visually per key, model, and time period. Log in with your master key. Without PostgreSQL, spend data is in-memory only and resets when the container stops. In-memory tracking is fine for single-session testing but not for ongoing production monitoring.

Related Guides