MLX Inference Benchmark: 4 Frameworks on Apple M5 Max with 35B LLM

LargitData
May 10, 2026, 2:43 p.m.

快速摘要

The MLX inference framework benchmark from ywchiu/mlx_benchmark_lab evaluates five MLX-derived inference engines on an Apple M5 Max with 64 GB unified memory: rapid-mlx, omlx, dflash-mlx, mlx-vlm, and MTPLX. The first four are tested on Qwen3.6-35B-A3B-4bit (35B MoE); MTPLX is tested on Qwen3.6-27B-MTPLX-Optimized-Speed (27B dense + MTP depth=3). Key findings: omlx is the top choice for long-context enterprise workloads on the 35B MoE model — leading from 4K tokens onward and still reaching 82.1 tps at 32K. dflash-mlx peaks at 167.3 tps for short contexts but collapses to 12.6 tps at 32K. rapid-mlx is a flexible middle-ground option; mlx-vlm is the only multimodal framework. The MTPLX addendum shows that 27B + MTP fits 32K context with only 22.12 GB peak memory on a 64 GB Mac — a far smaller footprint than 35B MoE. Enterprise on-premise AI deployment should choose frameworks based on input length and multimodal needs; QubicX's framework abstraction layer can switch engines automatically.

Benchmark data source: github.com/ywchiu/mlx_benchmark_lab
Full JSONL results, plotting scripts, and bilingual reports are all open-sourced in this repository.

MLX is Apple's machine learning framework purpose-built for its Silicon chips, enabling Mac systems to run large language model (LLM) inference directly through Unified Memory without requiring NVIDIA GPUs. With five inference engines now available — rapid-mlx, omlx, dflash-mlx, mlx-vlm, and MTPLX — Apple Silicon has become an attractive option for enterprise on-premise AI deployment. Based on the public ywchiu/mlx_benchmark_lab benchmark (Apple M5 Max, 64 GB unified memory), this article analyzes the performance and stability of all five frameworks across context lengths. The first four frameworks are tested on a 35B-parameter quantized MoE model; the newly added MTPLX is tested on its officially recommended Qwen3.6-27B-MTPLX-Optimized-Speed (27B dense + MTP). Enterprise on-premise AI selection guidance follows.

What Is MLX? The LLM Inference Framework Built for Apple Silicon

MLX is an open-source numerical computing framework released by Apple's machine learning research team in late 2023, deeply optimized for the unified memory architecture of Apple Silicon (M1/M2/M3/M4/M5 series). Unlike traditional CUDA workflows that must move tensor data between CPU and GPU, MLX tensors can be shared zero-copy across CPU, GPU, and Neural Engine, dramatically reducing memory bandwidth bottlenecks. This architectural advantage makes Mac Studio and MacBook Pro with 64 GB or more of unified memory viable platforms for running 30B–70B parameter LLMs on-premise.

On top of MLX, the community has developed multiple inference engines specialized for LLMs, each making different trade-offs. The five mainstream frameworks covered in this benchmark are: rapid-mlx (flexible features with paged KV cache and multi-token prediction), omlx (balanced long-context stability and overall performance), dflash-mlx (fastest at short context using speculative decoding), mlx-vlm (the only framework supporting image, video, and audio multimodal input), and MTPLX (built-in Multi-Token Prediction with the prefill-ladder automated context-scaling benchmark tool, targeting sustained high-throughput inference).

Benchmark Methodology and Hardware Setup

Testing uses an Apple M5 Max with 64 GB unified memory. The first four frameworks (rapid-mlx, omlx, dflash-mlx, mlx-vlm) are tested on mlx-community/Qwen3.6-35B-A3B-4bit — a 4-bit quantized mixture-of-experts (MoE) model with 35B total parameters and 3B active parameters. MTPLX is tested on its officially recommended Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed (27B dense) with sustained mode, MTP (depth=3), disable-thinking, and 128 generated tokens per context. All frameworks are launched as local servers via OpenAI-compatible APIs with prefix caching explicitly disabled to measure genuine cold-start prefill performance. Each context length is tested with five repeated runs, with median, mean, and standard deviation calculated. The first four frameworks are tested at 64, 512, 2,048, 4,096, 8,192, 16,384, and 32,768 tokens; MTPLX at 512, 1,024, 2,048, 4,096, 8,192, 16,384, and 32,768 tokens. Note: MTPLX uses a different model (27B dense + MTP) from the others (35B MoE-A3B), so absolute decode tok/s figures are not directly comparable — only context-length scaling behaviour and memory footprint should be inferred from the MTPLX row.

Benchmark Results: Decode Speed and Long-Context Performance

MLX inference framework decode TPS chart — Figure 1: Decode tokens/sec across seven context lengths for the four MLX inference frameworks.

Context Length	rapid-mlx	omlx	dflash-mlx	mlx-vlm	MTPLX¹
64 tokens	124.9	123.7	167.3	95.5	—
512 tokens	119.5	119.4	122.9	94.8	59.8
1,024 tokens	—	—	—	—	49.6
2,048 tokens	102.5	121.1	160.1	88.5	55.7
4,096 tokens	97.6	120.4	104.5	91.4	43.3
8,192 tokens	90.3	118.0	96.3	87.2	43.1
16,384 tokens	83.2	105.3	84.1	83.1	41.4
32,768 tokens	72.3	82.1	12.6 ⚠️	67.7	31.3

Unit: tokens/sec (median); higher is better. Source: ywchiu/mlx_benchmark_lab, 2026-05-09 (first four) / 2026-05-16 (MTPLX addendum).
¹ MTPLX uses Qwen3.6-27B-MTPLX-Optimized-Speed (27B dense + MTP depth=3) — different model size and architecture from the 35B MoE-A3B used by the others. Absolute numbers are not directly comparable; treat MTPLX values as a context-scaling trend reference only.

Observation 1: dflash-mlx delivers roughly a 35% decode-speed advantage at short context (≤ 2K) thanks to speculative decoding, but suffers catastrophic degradation to 12.6 tps at 32K — roughly six times slower than the others — revealing its architecture does not scale to long context.

Observation 2: omlx is the all-around long-context champion: it leads from 4K tokens onward, sustains over 100 tps at 16K, reaches 82.1 tps at 32K, and exhibits the lowest standard deviation — making it the most stable framework.

Observation 3: Every framework degrades when scaling from 64 to 32,768 tokens, but the magnitude varies enormously — from 34% for omlx to 92% for dflash-mlx. TTFT shows even greater divergence, spanning nearly three orders of magnitude — dflash-mlx spikes to 31 seconds at 32K context, more than double the next worst.

Observation 4 (MTPLX addendum): MTPLX with 27B dense + MTP sustains roughly 50–60 tok/s at short context, holds 41 tok/s at 16K, and only drops to 31 tok/s at 32K. Absolute throughput is lower than the 35B MoE figures above — but this is mostly a model-size effect, not a framework defect. Where MTPLX shines is memory efficiency: 32K context peaks at only 22.12 GB (versus the 40 GB-class footprint typical of 35B MoE), leaving substantial headroom on a 64 GB Mac for running multiple models concurrently. Prefill throughput degrades gracefully from 800 to 530 tok/s, giving a fairly linear scaling profile.

Context	Decode tok/s	Prefill tok/s	TTFT	Peak memory
512	59.76	800.21	0.65s	15.58 GB
1k	49.56	879.06	1.17s	16.18 GB
2k	55.69	720.81	2.84s	17.29 GB
4k	43.28	693.90	5.90s	17.73 GB
8k	43.09	664.82	12.32s	18.37 GB
16k	41.40	646.68	25.35s	19.62 GB
32k	31.34	530.79	61.74s	22.12 GB

MTPLX prefill-ladder run: sustained mode, MTP depth=3, disable-thinking, 128 tokens generated per context. Source: ywchiu/mlx_benchmark_lab, 2026-05-16.

MLX inference framework decode speed box plot — Figure 2: Decode-speed box plot per framework — omlx has the tightest distribution; dflash-mlx shows extreme outliers at long context.

MLX inference framework long-context degradation curve — Figure 5: Decode-speed degradation versus the 64-token baseline. omlx degrades only 34%, while dflash-mlx degrades 92% — a dramatic gap.

MLX inference framework TTFT chart — Figure 4: Time-to-first-token (TTFT) vs context length — spans nearly three orders of magnitude. dflash-mlx spikes to 31 seconds at 32K, more than double the next worst.

MLX inference framework decode stddev chart — Figure 3: Decode standard deviation vs context length. omlx exhibits the lowest variance, making it the most SLA-friendly; rapid-mlx shows TTFT jitter at short context.

MLX inference framework prefill TPS chart — Figure 6: Prefill throughput (tokens/sec) per framework — prefill processes the input prompt and directly affects first-token latency.

How to Choose the Right MLX Framework

omlx — Top choice for long context and enterprise production: leads across the board from 4K tokens onward with the lowest standard deviation, making it best for enterprise RAG, document summarization, and long-form legal or financial analysis. High stability translates directly to easier SLA attainment.

dflash-mlx — Specialized for short-context high throughput: hits 167.3 tps at 64-token context, ideal for applications where input length is predictable and tightly bounded under 2K tokens — structured data classification, SQL generation, short customer-service replies. Long-context use must be strictly avoided.

rapid-mlx — Flexible middle-ground option: never the fastest at any context length but stays consistently competitive while offering paged KV cache, prefix cache, and multi-token prediction. Suits R&D teams exploring experimental techniques.

mlx-vlm — The only multimodal option: roughly 25–30% slower on pure-text workloads, but the only one of the four supporting image, video, and audio input. For post-OCR image understanding, video summarization, or multimodal customer-service bots, it is currently the only choice.

MTPLX — Multi-Token Prediction for sustained throughput: ships with the prefill-ladder automated benchmark tool and built-in MTP acceleration, paired with its official Optimized-Speed model series (e.g. Qwen3.6-27B-MTPLX-Optimized-Speed). At 27B size it can sustain 32K context with only 22 GB peak memory — well suited to running multiple 27B-class models concurrently on a single Mac with predictable input lengths. The trade-off: model variety is currently narrow, limited mostly to the official Optimized-Speed family.

Implications for Enterprise On-Premise AI Deployment

The key insight for enterprise on-premise AI is that hardware and software must be evaluated together — surface-level GPU TFLOPS numbers are insufficient. With unified memory and the right MLX framework, Apple Silicon can run a 35B-parameter quantized model at 80–120 tokens/sec on hardware costing as little as USD 4,000 per Mac Studio — more than sufficient for many internal enterprise RAG or smart customer-service use cases.

Compared to a single NVIDIA H100 starting above USD 30,000, Apple Silicon offers an extremely attractive alternative path for SMBs and distributed deployments. To bring Apple Silicon into formal production, however, enterprises must establish a complete MLOps process covering framework selection, version management, health checks, and performance monitoring.

LargitData's QubicX on-premise AI platform supports multiple hardware backends (NVIDIA GPU, AMD GPU, and Apple Silicon) with a built-in framework abstraction layer that automatically selects the most suitable inference engine based on submitted context length — eliminating the need for enterprise IT to make and switch this judgment manually. For Taiwanese enterprises that demand data sovereignty, low latency, and predictable costs, this kind of automated on-premise AI orchestration is the crucial bridge between benchmarks and production deployment.

Full raw JSONL results, plotting script (plot_results.py), and bilingual reports are open-sourced — reproduce on your own Mac: github.com/ywchiu/mlx_benchmark_lab