Loading...

Local AI,
no more waiting on your Mac.

macOS-native MLX server with smart caching. Claude Code, OpenClaw, and Cursor respond in 5 seconds, not 90.

Download DMG View on GitHub →

Apache 2.0  ·  Apple Silicon  ·  macOS 14+  ·  150+ GitHub stars

oMLX Dashboard — Dark Mode
oMLX Dashboard — Light Mode
0
tok/s prompt processing
0
throughput with batching
<5s
TTFT from 2nd turn
SSD KV cache (no eviction)
Qwen3.5-122B-A10B-4bit · M3 Ultra 512GB

Why oMLX

Built for the way
agents actually work.

Coding agents invalidate the KV cache dozens of times per session. oMLX persists every cache block to SSD — so when the agent circles back to a previous prefix, it's restored from disk in milliseconds, not recomputed from scratch.

01 — CORE
Paged SSD KV caching
Cache blocks are persisted to disk in safetensors format. Two-tier architecture: hot blocks stay in RAM, cold blocks go to SSD with LRU policy. Previously seen prefixes are restored across requests and server restarts — never recomputed.
02 — THROUGHPUT
Continuous batching
Handles concurrent requests through mlx-lm's BatchGenerator. Up to 4.14× generation speedup at 8× concurrency. No more queuing behind a single request.
03 — APP
Native macOS menu bar app
Start, stop, and monitor the server from your menu bar. Web dashboard for model management, chat, and real-time metrics. Signed, notarized, with in-app auto-update. Not Electron.
04 — MODELS
Multi-model serving
LLM, VLM, embedding, and reranker models loaded simultaneously. LRU eviction when memory runs low. Browse and download models directly from the admin dashboard.
05 — API
OpenAI + Anthropic drop-in
Compatible with Claude Code, OpenClaw, Cursor, and any OpenAI-compatible client. Native /v1/messages Anthropic endpoint. Web dashboard generates the exact config command for each tool.
06 — TOOLS
Tool calling + MCP
Supports all major tool calling formats: JSON, Qwen, Gemma, GLM, MiniMax. MCP tool integration and tool result trimming for oversized outputs. Configurable per model.

Performance

Real numbers,
real hardware.

All benchmarks on M3 Ultra 512GB. Single request and continuous batching across four popular models.

Single request performance
Qwen3.5-122B-A10B-4bit  ·  M3 Ultra 512GB
Context Prompt TPS Token TPS Peak Mem
1k768 tok/s56.6 tok/s65.5 GB
8k941 tok/s54.0 tok/s69 GB
16k886 tok/s48.3 tok/s71 GB
32k765 tok/s42.4 tok/s73 GB
Continuous batching
pp1024 / tg128  ·  no cache reuse
Batch Token TPS Speedup
56.6 tok/s1.00×
92.1 tok/s1.63×
135.1 tok/s2.39×
190.2 tok/s3.36×
Single request performance
Qwen3-Coder-Next-8bit  ·  M3 Ultra 512GB
Context Prompt TPS Token TPS Peak Mem
1k1,462 tok/s58.7 tok/s80 GB
8k2,009 tok/s54.9 tok/s83 GB
16k1,896 tok/s52.3 tok/s83 GB
32k1,624 tok/s45.1 tok/s85 GB
Continuous batching
pp1024 / tg128  ·  no cache reuse
Batch Token TPS Speedup
58.7 tok/s1.00×
100.5 tok/s1.71×
164.0 tok/s2.79×
243.3 tok/s4.14×
Single request performance
MiniMax-M2.5-8bit  ·  M3 Ultra 512GB
Context Prompt TPS Token TPS Peak Mem
1k588 tok/s34.0 tok/s227 GB
4k704 tok/s30.3 tok/s228 GB
8k663 tok/s26.3 tok/s229 GB
32k426 tok/s14.9 tok/s235 GB
Continuous batching
pp1024 / tg128  ·  no cache reuse
Batch Token TPS Speedup
34.0 tok/s1.00×
49.7 tok/s1.46×
109.8 tok/s3.23×
126.3 tok/s3.71×
Single request performance
GLM-5-4bit  ·  M3 Ultra 512GB
Context Prompt TPS Token TPS Peak Mem
1k187 tok/s16.7 tok/s392 GB
4k180 tok/s13.7 tok/s394 GB
16k117 tok/s12.0 tok/s403 GB
32k78 tok/s10.7 tok/s415 GB
Continuous batching
pp1024 / tg128  ·  no cache reuse
Batch Token TPS Speedup
16.7 tok/s1.00×
23.7 tok/s1.42×
47.0 tok/s2.81×
60.3 tok/s3.61×

"The Qwen3.5 models running on oMLX is so fast that it makes running local AI on Mac worthwhile. It is so much faster than LMStudio and the tool calling is so much more reliable."
— GitHub comment, issue #62

FAQ

Common questions.

Ollama and LM Studio cache the KV state in memory, but when the context shifts mid-session — which happens constantly with coding agents — the entire cache gets invalidated and recomputed from scratch. oMLX persists every KV cache block to SSD, so previously cached portions are always recoverable. TTFT drops from 30–90 seconds to under 5 seconds on long contexts.
Apple Silicon (M1 or later) with macOS 14+. 16GB RAM is the minimum, but 64GB+ is recommended for comfortable use with larger models. The sweet spot for daily coding work is an M-series Pro/Max with 64GB+.
Yes. oMLX provides both OpenAI-compatible (/v1/chat/completions) and Anthropic-compatible (/v1/messages) API endpoints. It works as a drop-in backend for all three. The web dashboard has a one-click config generator — select your model, copy the command, paste into terminal.
No. oMLX reuses your existing LM Studio model directory — just point it at your models folder. You can also browse and download models directly from the built-in HuggingFace downloader in the admin dashboard.
Any MLX-format model from HuggingFace. This includes Qwen, LLaMA, Mistral, Gemma, DeepSeek, MiniMax, GLM, and more. Reasoning models (DeepSeek, MiniMax, Qwen) get automatic <think> tag handling. Vision-Language Models are supported since v0.2.0 with the same paged SSD caching.

Up and running
in two minutes.

Download the DMG or install from source. Reuses your existing LM Studio model directory — no re-download needed.

macOS App Recommended
Drag to Applications. The welcome screen walks you through model directory, server start, and first model download. Signed and notarized.
Download DMG
From source
Requires Python 3.10+ and Apple Silicon. Connects to any OpenAI-compatible client on localhost:8000.
# clone and install
git clone https://github.com/jundot/omlx
cd omlx && pip install -e .

# start serving
omlx serve --model-dir ~/models