May 20, 2026

faster-whisper vs Whisper.cpp vs WhisperX: 2026 Shootout

By AIFoss · 12 min read

whisperaispeechtotextopensourcepython

OpenAI’s Whisper changed what was possible with local speech-to-text. The reference implementation is also slow enough to make it impractical for most production use. Three open-source projects fixed that problem in completely different ways, and choosing between them incorrectly costs you either portability, speed, or the features you actually need.

Versions covered: faster-whisper v1.2.1 (October 31, 2025), Whisper.cpp v1.8.4 (March 19, 2025), WhisperX v3.8.5 (April 1, 2025).

The quick answer

Situation	Best choice
Python transcription pipeline on NVIDIA GPU	faster-whisper
macOS, iOS, Android, or Windows without Python	Whisper.cpp
Word-level timestamps for subtitles or search	WhisperX
Speaker diarization (who said what)	WhisperX
Raspberry Pi, mobile, or browser via WebAssembly	Whisper.cpp
Apple Silicon laptop, Metal or Core ML acceleration	Whisper.cpp
Batched high-throughput audio processing	faster-whisper
Embedding transcription in a C++ application	Whisper.cpp
Production audio pipeline, Python data stack	faster-whisper or WhisperX
Transcribe a file on macOS right now	Whisper.cpp

WhisperX wraps faster-whisper, so it inherits most of its performance characteristics. The real decision is: (a) faster-whisper alone for raw throughput, (b) WhisperX when you need timestamps or speaker labels, or (c) Whisper.cpp when Python isn’t available or you need a platform the others don’t support.

What each tool actually is

faster-whisper (SYSTRAN/faster-whisper, MIT license) reimplements OpenAI’s Whisper using CTranslate2 — a C++ inference engine for transformer models that runs computations in INT8 or FP16 instead of full FP32. The result is up to 4× faster inference with equivalent accuracy and meaningfully lower VRAM usage. It’s a Python library, installs via pip, and requires NVIDIA CUDA 12 for GPU acceleration. v1.2.1 added Silero-VAD V6 for improved voice activity detection and fixed a batched-inference bug where <|nocaptions|> tokens were incorrectly generated, causing hallucinated text on borderline audio segments.

Whisper.cpp (ggml-org/whisper.cpp, MIT license) is a C/C++ port built on the ggml tensor library — the same runtime behind llama.cpp. It compiles to a standalone binary with no Python runtime required. The supported hardware list is the widest of any Whisper implementation: NVIDIA CUDA, Apple Metal and Core ML (including the Neural Engine), AMD Vulkan, Intel OpenVINO, WebAssembly, Raspberry Pi, iOS, and Android. It allocates zero memory at runtime after model load. v1.8.4 is a maintenance release incorporating ggml performance improvements across all supported backends.

WhisperX (m-bain/whisperX, BSD-2-Clause license) is a Python layer on top of faster-whisper that adds three capabilities the base implementation lacks: voice activity detection preprocessing (via Silero-VAD, to avoid transcribing silence), word-level forced alignment using wav2vec2 models (reducing timestamp drift from ~1 second to under 100ms), and speaker diarization using pyannote.audio. The project claims 70× realtime transcription speed using batched inference on large-v2 with GPU. The practical result is that WhisperX is slower than bare faster-whisper per audio minute — the alignment pass costs time — but it produces output that actually tells you when each word was spoken and who said it.

The dependency chain matters: WhisperX calls faster-whisper under the hood. Whisper.cpp is a separate codebase with no shared code.

Hardware and system requirements

	faster-whisper v1.2.1	Whisper.cpp v1.8.4	WhisperX v3.8.5
Language/runtime	Python 3.9+	C/C++ binary	Python 3.9+
License	MIT	MIT	BSD-2-Clause
GPU required?	No (CPU fallback)	No (CPU fallback)	No (CPU fallback)
NVIDIA CUDA	CUDA 12 (cuBLAS, cuDNN 9)	Yes	CUDA 12.8
Apple Silicon (Metal)	No	Yes	No
Apple Neural Engine (Core ML)	No	Yes	No
AMD GPU	No	Vulkan	No
Windows	Yes	Yes	Yes
iOS / Android	No	Yes	No
Raspberry Pi	No	Yes	No
WebAssembly	No	Yes	No
Word-level timestamps	No	No	Yes
Speaker diarization	No	No	Yes

VRAM usage for a 13-minute audio clip benchmarked by SYSTRAN on an RTX 3070 Ti (8 GB):

Configuration	VRAM	Transcription time
large-v3 FP16 (standard)	~4.5 GB	~1m03s
large-v3 INT8 (quantized)	~2.9 GB	~59s
large-v3 FP16 batched (batch=8)	~4.5 GB	~17s
large model CPU INT8 (i7-12700K)	n/a	~1m42s (small model)

Whisper.cpp on-disk model sizes (RAM footprint roughly matches):

Model	Memory
tiny	~273 MB
base	~388 MB
small	~852 MB
medium	~2.1 GB
large-v2/v3/v3-turbo	~3.9 GB

On an M2 Pro with Whisper.cpp and Metal acceleration, a 60-second clip processes in roughly 6 seconds using large-v3-turbo — approximately 10× realtime. Enable Core ML to run the encoder on the Apple Neural Engine and you gain an additional ~3× speedup over Metal-only. For Apple Silicon users, Whisper.cpp is the fastest local transcription option available in 2026.

WhisperX requires under 8 GB VRAM for large-v2 with beam_size=5, consistent with the faster-whisper numbers since it uses the same backend. The additional pyannote diarization model adds modest overhead on top.

For testing GPU-heavy transcription workloads before committing to hardware, RunPod rents A100 and H100 instances by the hour. For guidance on selecting a GPU for local AI work, runaihome.com covers hardware tradeoffs in depth.

Installation

faster-whisper

pip install faster-whisper

CUDA 12 with cuBLAS and cuDNN 9 is required for GPU acceleration. If you’re on CUDA 11, downgrade ctranslate2 to version 3.24.0.

Basic usage:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

For batched inference — significantly faster on long files or when processing many files:

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
pipeline = BatchedInferencePipeline(model=model)
segments, info = pipeline.transcribe("audio.mp3", batch_size=16)

Whisper.cpp

Build from source — the only setup path:

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release

# Download the model
bash ./models/download-ggml-model.sh large-v3-turbo

# Transcribe a file
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav

On macOS with Metal acceleration:

cmake -B build -DWHISPER_METAL=1
cmake --build build -j --config Release

For Core ML (Apple Neural Engine — runs the encoder ~3× faster than Metal alone on M-series):

cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

The binary accepts wav input directly. For mp3/m4a/other formats, ffmpeg handles conversion: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav.

WhisperX

pip install whisperx

CUDA 12.8 is required. ffmpeg must be installed separately (brew install ffmpeg on macOS, apt install ffmpeg on Ubuntu). Speaker diarization requires accepting the pyannote.audio model license on Hugging Face and generating an access token.

Command-line transcription with diarization:

whisperx audio.mp3 --model large-v2 --diarize --hf_token YOUR_HF_TOKEN

Programmatic usage with word-level timestamps:

import whisperx

device = "cuda"
audio = whisperx.load_audio("audio.mp3")
model = whisperx.load_model("large-v2", device, compute_type="float16")

result = model.transcribe(audio, batch_size=16)

# Align to get word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)

print(result["word_segments"])  # [{word, start, end, score}, ...]

Accuracy and feature depth

The core ASR accuracy is the same across all three. All use OpenAI’s Whisper model weights — feed them identical audio with identical model size and you get identical transcript text, modulo implementation-specific edge cases. Where they diverge:

Voice activity detection. Whisper hallucinates when fed silence or non-speech audio. faster-whisper v1.2.1 includes Silero-VAD V6 for pre-filtering. WhisperX also applies VAD before the model sees the audio. Whisper.cpp has a basic --vad flag but less sophisticated preprocessing than either Python implementation. For clean speech recordings the difference is minor. For interview audio with long pauses or music beds, VAD makes a measurable difference in output cleanliness.

Timestamp precision. All three produce segment-level timestamps (the start and end of each text chunk). Only WhisperX produces word-level timestamps via forced alignment with wav2vec2. The native Whisper segment timestamps have ~1-second drift; WhisperX’s word-level alignment is under 100ms. If you’re generating subtitle files, building searchable audio indexes, or need to clip audio to specific spoken phrases, WhisperX is the only option here.

Diarization. Whisper doesn’t know who is speaking. WhisperX integrates pyannote.audio to add speaker labels to segments. The diarization works well on clean, clearly-separated speech — podcast interviews, recorded meetings with distinct speakers. It degrades on overlapping speech and requires a Hugging Face token to pull the pyannote model.

Multilingual support. All three support Whisper’s 99 languages. WhisperX requires language-specific wav2vec2 models for alignment, and not all languages have good coverage. Words containing digits, symbols, or non-Latin characters may not receive word-level timestamps in v3.8.5 — a known limitation in the alignment model.

Platform reach. This is where Whisper.cpp is in a different category. It compiles and runs on hardware the Python implementations can’t target at all: an iPhone app, a Raspberry Pi 4, a browser-based transcription demo in WebAssembly. That matters for an entire class of projects.

When NOT to use each

Don’t use faster-whisper if:

You’re on macOS or Apple Silicon and want GPU acceleration. faster-whisper doesn’t support Metal. Whisper.cpp uses it natively.
You’re building a non-Python application. The library dependency is mandatory.
You need speaker diarization or word-level timestamps. That’s WhisperX’s job.
Your GPU is AMD. No ROCm or Vulkan support.

Don’t use Whisper.cpp if:

You need word-level timestamps or diarization. The output is segment-level, the same as vanilla Whisper.
You’re working in a Python pipeline and don’t want to manage a compiled binary or subprocess calls.
You’re batching many large audio files on NVIDIA and care about throughput — faster-whisper’s batched pipeline is faster per-file on CUDA hardware.
You need the output as a Python data structure without writing subprocess wrappers.

Don’t use WhisperX if:

You’re on macOS or don’t have NVIDIA hardware. WhisperX is effectively Linux/Windows + CUDA.
You only need a transcript with no timestamps or speaker labels. The alignment pass adds latency for no benefit.
You’re processing high volumes of audio where raw throughput matters — the alignment step costs time compared to bare faster-whisper.
Your environment restricts Hugging Face model license acceptance, or you can’t use pyannote’s gated models.

Recommended use by project type

Content pipeline or podcast transcription: faster-whisper with INT8 quantization. Fast, installs in one command, handles long files cleanly. Upgrade to WhisperX if you need chapter markers or searchable transcript indexes.

Meeting transcription with speaker IDs: WhisperX with diarization. This is the use case it was built for. Set --min-speakers and --max-speakers to help the diarizer produce cleaner results.

macOS audio tool or CLI utility: Whisper.cpp. Build once, runs on any Mac including M-series, no Python dependency to manage.

Mobile app (iOS or Android): Whisper.cpp. There are no usable mobile Python runtimes for this workload. Whisper.cpp ships with iOS and Android integration examples.

Browser-based transcription: Whisper.cpp via WebAssembly. smaller models (tiny, base) run acceptably in a browser. Larger models are impractical.

Subtitle generation: WhisperX. The word-level timestamps are the difference between “good enough” subtitles and accurate ones. Feed the output through ffmpeg or a subtitle library to generate SRT/VTT.

Automated audio pipelines with downstream AI processing: faster-whisper or WhisperX feeding into a local RAG system. See AnythingLLM’s local RAG setup for how transcribed content can flow into a document index. For building multi-step automation around transcription, Flowise vs n8n vs LangGraph 2026 covers the workflow tooling options.

The verdict

faster-whisper v1.2.1 is the right default for Python developers on NVIDIA hardware who want fast local transcription without friction. WhisperX v3.8.5 is the answer when the output format matters — timestamps, speaker labels, subtitle-quality alignment. The two aren’t really competing; they form a natural progression.

Whisper.cpp v1.8.4 competes in a different dimension: portability. It’s not trying to be the fastest Python library. It’s the option when you need Whisper on hardware or platforms where Python is impractical — Apple Silicon with Metal, mobile, WebAssembly, or embedded systems. On macOS specifically, Whisper.cpp with Core ML is faster than anything the Python implementations offer today.

If you’re starting a new project on a CUDA Linux or Windows box, start with faster-whisper. Add WhisperX if you need timestamps or diarization. Reach for Whisper.cpp if you leave the Python/NVIDIA ecosystem.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Was this article helpful?