faster-whisper vs Whisper.cpp vs WhisperX: 2026 Shootout
OpenAI’s Whisper changed what was possible with local speech-to-text. The reference implementation is also slow enough to make it impractical for most production use. Three open-source projects fixed that problem in completely different ways, and choosing between them incorrectly costs you either portability, speed, or the features you actually need.
Versions covered: faster-whisper v1.2.1 (October 31, 2025), Whisper.cpp v1.8.4 (March 19, 2025), WhisperX v3.8.5 (April 1, 2025).
The quick answer
| Situation | Best choice |
|---|---|
| Python transcription pipeline on NVIDIA GPU | faster-whisper |
| macOS, iOS, Android, or Windows without Python | Whisper.cpp |
| Word-level timestamps for subtitles or search | WhisperX |
| Speaker diarization (who said what) | WhisperX |
| Raspberry Pi, mobile, or browser via WebAssembly | Whisper.cpp |
| Apple Silicon laptop, Metal or Core ML acceleration | Whisper.cpp |
| Batched high-throughput audio processing | faster-whisper |
| Embedding transcription in a C++ application | Whisper.cpp |
| Production audio pipeline, Python data stack | faster-whisper or WhisperX |
| Transcribe a file on macOS right now | Whisper.cpp |
WhisperX wraps faster-whisper, so it inherits most of its performance characteristics. The real decision is: (a) faster-whisper alone for raw throughput, (b) WhisperX when you need timestamps or speaker labels, or (c) Whisper.cpp when Python isn’t available or you need a platform the others don’t support.
What each tool actually is
faster-whisper (SYSTRAN/faster-whisper, MIT license) reimplements OpenAI’s Whisper using CTranslate2 — a C++ inference engine for transformer models that runs computations in INT8 or FP16 instead of full FP32. The result is up to 4× faster inference with equivalent accuracy and meaningfully lower VRAM usage. It’s a Python library, installs via pip, and requires NVIDIA CUDA 12 for GPU acceleration. v1.2.1 added Silero-VAD V6 for improved voice activity detection and fixed a batched-inference bug where <|nocaptions|> tokens were incorrectly generated, causing hallucinated text on borderline audio segments.
Whisper.cpp (ggml-org/whisper.cpp, MIT license) is a C/C++ port built on the ggml tensor library — the same runtime behind llama.cpp. It compiles to a standalone binary with no Python runtime required. The supported hardware list is the widest of any Whisper implementation: NVIDIA CUDA, Apple Metal and Core ML (including the Neural Engine), AMD Vulkan, Intel OpenVINO, WebAssembly, Raspberry Pi, iOS, and Android. It allocates zero memory at runtime after model load. v1.8.4 is a maintenance release incorporating ggml performance improvements across all supported backends.
WhisperX (m-bain/whisperX, BSD-2-Clause license) is a Python layer on top of faster-whisper that adds three capabilities the base implementation lacks: voice activity detection preprocessing (via Silero-VAD, to avoid transcribing silence), word-level forced alignment using wav2vec2 models (reducing timestamp drift from ~1 second to under 100ms), and speaker diarization using pyannote.audio. The project claims 70× realtime transcription speed using batched inference on large-v2 with GPU. The practical result is that WhisperX is slower than bare faster-whisper per audio minute — the alignment pass costs time — but it produces output that actually tells you when each word was spoken and who said it.
The dependency chain matters: WhisperX calls faster-whisper under the hood. Whisper.cpp is a separate codebase with no shared code.
Hardware and system requirements
| faster-whisper v1.2.1 | Whisper.cpp v1.8.4 | WhisperX v3.8.5 | |
|---|---|---|---|
| Language/runtime | Python 3.9+ | C/C++ binary | Python 3.9+ |
| License | MIT | MIT | BSD-2-Clause |
| GPU required? | No (CPU fallback) | No (CPU fallback) | No (CPU fallback) |
| NVIDIA CUDA | CUDA 12 (cuBLAS, cuDNN 9) | Yes | CUDA 12.8 |
| Apple Silicon (Metal) | No | Yes | No |
| Apple Neural Engine (Core ML) | No | Yes | No |
| AMD GPU | No | Vulkan | No |
| Windows | Yes | Yes | Yes |
| iOS / Android | No | Yes | No |
| Raspberry Pi | No | Yes | No |
| WebAssembly | No | Yes | No |
| Word-level timestamps | No | No | Yes |
| Speaker diarization | No | No | Yes |
VRAM usage for a 13-minute audio clip benchmarked by SYSTRAN on an RTX 3070 Ti (8 GB):
| Configuration | VRAM | Transcription time |
|---|---|---|
| large-v3 FP16 (standard) | ~4.5 GB | ~1m03s |
| large-v3 INT8 (quantized) | ~2.9 GB | ~59s |
| large-v3 FP16 batched (batch=8) | ~4.5 GB | ~17s |
| large model CPU INT8 (i7-12700K) | n/a | ~1m42s (small model) |
Whisper.cpp on-disk model sizes (RAM footprint roughly matches):
| Model | Memory |
|---|---|
| tiny | ~273 MB |
| base | ~388 MB |
| small | ~852 MB |
| medium | ~2.1 GB |
| large-v2/v3/v3-turbo | ~3.9 GB |
On an M2 Pro with Whisper.cpp and Metal acceleration, a 60-second clip processes in roughly 6 seconds using large-v3-turbo — approximately 10× realtime. Enable Core ML to run the encoder on the Apple Neural Engine and you gain an additional ~3× speedup over Metal-only. For Apple Silicon users, Whisper.cpp is the fastest local transcription option available in 2026.
WhisperX requires under 8 GB VRAM for large-v2 with beam_size=5, consistent with the faster-whisper numbers since it uses the same backend. The additional pyannote diarization model adds modest overhead on top.
For testing GPU-heavy transcription workloads before committing to hardware, RunPod rents A100 and H100 instances by the hour. For guidance on selecting a GPU for local AI work, runaihome.com covers hardware tradeoffs in depth.
Installation
faster-whisper
pip install faster-whisper
CUDA 12 with cuBLAS and cuDNN 9 is required for GPU acceleration. If you’re on CUDA 11, downgrade ctranslate2 to version 3.24.0.
Basic usage:
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
For batched inference — significantly faster on long files or when processing many files:
from faster_whisper import WhisperModel, BatchedInferencePipeline
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
pipeline = BatchedInferencePipeline(model=model)
segments, info = pipeline.transcribe("audio.mp3", batch_size=16)
Whisper.cpp
Build from source — the only setup path:
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release
# Download the model
bash ./models/download-ggml-model.sh large-v3-turbo
# Transcribe a file
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav
On macOS with Metal acceleration:
cmake -B build -DWHISPER_METAL=1
cmake --build build -j --config Release
For Core ML (Apple Neural Engine — runs the encoder ~3× faster than Metal alone on M-series):
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release
The binary accepts wav input directly. For mp3/m4a/other formats, ffmpeg handles conversion: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav.
WhisperX
pip install whisperx
CUDA 12.8 is required. ffmpeg must be installed separately (brew install ffmpeg on macOS, apt install ffmpeg on Ubuntu). Speaker diarization requires accepting the pyannote.audio model license on Hugging Face and generating an access token.
Command-line transcription with diarization:
whisperx audio.mp3 --model large-v2 --diarize --hf_token YOUR_HF_TOKEN
Programmatic usage with word-level timestamps:
import whisperx
device = "cuda"
audio = whisperx.load_audio("audio.mp3")
model = whisperx.load_model("large-v2", device, compute_type="float16")
result = model.transcribe(audio, batch_size=16)
# Align to get word-level timestamps
model_a, metadata = whisperx.load_align_model(
language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
print(result["word_segments"]) # [{word, start, end, score}, ...]
Accuracy and feature depth
The core ASR accuracy is the same across all three. All use OpenAI’s Whisper model weights — feed them identical audio with identical model size and you get identical transcript text, modulo implementation-specific edge cases. Where they diverge:
Voice activity detection. Whisper hallucinates when fed silence or non-speech audio. faster-whisper v1.2.1 includes Silero-VAD V6 for pre-filtering. WhisperX also applies VAD before the model sees the audio. Whisper.cpp has a basic --vad flag but less sophisticated preprocessing than either Python implementation. For clean speech recordings the difference is minor. For interview audio with long pauses or music beds, VAD makes a measurable difference in output cleanliness.
Timestamp precision. All three produce segment-level timestamps (the start and end of each text chunk). Only WhisperX produces word-level timestamps via forced alignment with wav2vec2. The native Whisper segment timestamps have ~1-second drift; WhisperX’s word-level alignment is under 100ms. If you’re generating subtitle files, building searchable audio indexes, or need to clip audio to specific spoken phrases, WhisperX is the only option here.
Diarization. Whisper doesn’t know who is speaking. WhisperX integrates pyannote.audio to add speaker labels to segments. The diarization works well on clean, clearly-separated speech — podcast interviews, recorded meetings with distinct speakers. It degrades on overlapping speech and requires a Hugging Face token to pull the pyannote model.
Multilingual support. All three support Whisper’s 99 languages. WhisperX requires language-specific wav2vec2 models for alignment, and not all languages have good coverage. Words containing digits, symbols, or non-Latin characters may not receive word-level timestamps in v3.8.5 — a known limitation in the alignment model.
Platform reach. This is where Whisper.cpp is in a different category. It compiles and runs on hardware the Python implementations can’t target at all: an iPhone app, a Raspberry Pi 4, a browser-based transcription demo in WebAssembly. That matters for an entire class of projects.
When NOT to use each
Don’t use faster-whisper if:
- You’re on macOS or Apple Silicon and want GPU acceleration. faster-whisper doesn’t support Metal. Whisper.cpp uses it natively.
- You’re building a non-Python application. The library dependency is mandatory.
- You need speaker diarization or word-level timestamps. That’s WhisperX’s job.
- Your GPU is AMD. No ROCm or Vulkan support.
Don’t use Whisper.cpp if:
- You need word-level timestamps or diarization. The output is segment-level, the same as vanilla Whisper.
- You’re working in a Python pipeline and don’t want to manage a compiled binary or subprocess calls.
- You’re batching many large audio files on NVIDIA and care about throughput — faster-whisper’s batched pipeline is faster per-file on CUDA hardware.
- You need the output as a Python data structure without writing subprocess wrappers.
Don’t use WhisperX if:
- You’re on macOS or don’t have NVIDIA hardware. WhisperX is effectively Linux/Windows + CUDA.
- You only need a transcript with no timestamps or speaker labels. The alignment pass adds latency for no benefit.
- You’re processing high volumes of audio where raw throughput matters — the alignment step costs time compared to bare faster-whisper.
- Your environment restricts Hugging Face model license acceptance, or you can’t use pyannote’s gated models.
Recommended use by project type
Content pipeline or podcast transcription: faster-whisper with INT8 quantization. Fast, installs in one command, handles long files cleanly. Upgrade to WhisperX if you need chapter markers or searchable transcript indexes.
Meeting transcription with speaker IDs: WhisperX with diarization. This is the use case it was built for. Set --min-speakers and --max-speakers to help the diarizer produce cleaner results.
macOS audio tool or CLI utility: Whisper.cpp. Build once, runs on any Mac including M-series, no Python dependency to manage.
Mobile app (iOS or Android): Whisper.cpp. There are no usable mobile Python runtimes for this workload. Whisper.cpp ships with iOS and Android integration examples.
Browser-based transcription: Whisper.cpp via WebAssembly. smaller models (tiny, base) run acceptably in a browser. Larger models are impractical.
Subtitle generation: WhisperX. The word-level timestamps are the difference between “good enough” subtitles and accurate ones. Feed the output through ffmpeg or a subtitle library to generate SRT/VTT.
Automated audio pipelines with downstream AI processing: faster-whisper or WhisperX feeding into a local RAG system. See AnythingLLM’s local RAG setup for how transcribed content can flow into a document index. For building multi-step automation around transcription, Flowise vs n8n vs LangGraph 2026 covers the workflow tooling options.
The verdict
faster-whisper v1.2.1 is the right default for Python developers on NVIDIA hardware who want fast local transcription without friction. WhisperX v3.8.5 is the answer when the output format matters — timestamps, speaker labels, subtitle-quality alignment. The two aren’t really competing; they form a natural progression.
Whisper.cpp v1.8.4 competes in a different dimension: portability. It’s not trying to be the fastest Python library. It’s the option when you need Whisper on hardware or platforms where Python is impractical — Apple Silicon with Metal, mobile, WebAssembly, or embedded systems. On macOS specifically, Whisper.cpp with Core ML is faster than anything the Python implementations offer today.
If you’re starting a new project on a CUDA Linux or Windows box, start with faster-whisper. Add WhisperX if you need timestamps or diarization. Reach for Whisper.cpp if you leave the Python/NVIDIA ecosystem.
1V1 PLAYBOOK · LOCAL LLM
Cut your local AI bill from $400/month cloud GPU to $47/month at home.
4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.
Get it for $19 (early bird) →Sources
- SYSTRAN/faster-whisper — GitHub repository, README, and benchmarks
- faster-whisper v1.2.1 release notes
- ggml-org/whisper.cpp — GitHub repository and README
- whisper.cpp v1.8.4 release
- m-bain/whisperX — GitHub repository and README
- WhisperX v3.8.5 release
- Whisper.cpp Apple Silicon Metal and Core ML benchmarks
- Modal: Choosing between Whisper variants (faster-whisper, WhisperX)
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →