Jun 23, 2026

MOSS-TTS 1.5 Review 2026: Apache Voice Cloning on 8GB

By AIFoss · 10 min read

ttsvoice-cloningselfhostedaiapache

TL;DR: MOSS-TTS 1.5 is an 8B open TTS model that clones a voice from a short reference clip and — unlike XTTS v2 and F5-TTS — ships under Apache 2.0, so you can actually use it in a paid product. It fits on an 8GB GPU with the llama.cpp path and has MLX builds for Apple Silicon. The catch: cloning fidelity trails XTTS v2 slightly, and setup is rougher than a one-click app.

	MOSS-TTS 1.5	F5-TTS	XTTS v2
Best for	Commercial cloning + long-form	Personal cloning projects	Personal cloning, broad community
License	Apache 2.0 (commercial OK)	CC-BY-NC (non-commercial)	CPML (non-commercial, vendor defunct)
Zero-shot cloning	Yes, from a short clip	Yes, ~3s reference	Yes, ~6s reference
Min VRAM	~8GB (llama.cpp build)	~8–12GB	~6–8GB
The catch	Setup friction, newer ecosystem	Can’t ship commercially	No one left to sell a license

Honest take: If you need voice cloning inside something you’ll sell, MOSS-TTS 1.5 is the first open model that’s both good and legally clean — pick it over F5-TTS and XTTS v2 the moment money is involved.

What MOSS-TTS 1.5 actually is

MOSS-TTS is the speech-generation family from the OpenMOSS team (the group behind the MOSS LLM work) and MOSI.AI. Version 1.5 of the flagship model landed on May 26, 2026, alongside MOSS-SoundEffect-v2.0. It is an 8-billion-parameter model using an architecture the repo calls MossTTSDelay, and every model in the family is released under the Apache License 2.0.

That license line is the whole story for anyone building a product. Voice cloning in the open-source world has been a legal minefield: XTTS v2 is under Coqui’s CPML (non-commercial), and Coqui Inc. shut down in January 2024, so there is literally no one left to sell you a commercial license. F5-TTS ships its weights under CC-BY-NC-4.0 — also non-commercial. MOSS-TTS 1.5 is the rare zero-shot cloning model you can drop into a paid app, a client deliverable, or an internal tool at work without a lawyer flagging it.

The family is broader than one checkpoint:

MOSS-TTS-v1.5 — 8B, the main quality model.
MOSS-TTS-Local-Transformer-v1.5 — 4B, MossTTSLocal architecture, 48kHz stereo output, released June 18, 2026.
MOSS-TTS-Nano — ~100M params, runs on CPU, launched April 13, 2026.

This review focuses on the 8B v1.5 model, since that’s the one the queue topic and most of the r/LocalLLaMA discussion centers on.

What it does well

Cloning quality is genuinely competitive. On the standard Seed-TTS-eval benchmark, the 8B MossTTSDelay model reports an English word error rate of 1.84% and English speaker similarity of 70.86%, with Chinese CER of 1.37% and Chinese speaker similarity of 76.98%. The 4B local-transformer variant pushes similarity higher (73.28% English, 79.62% Chinese). For context, sub-2% WER means the model rarely mangles or skips words — the failure mode that makes most local TTS unusable for real narration.

Long-form stability is the standout feature. The model card claims up to one hour of coherent audio in a single run while holding a consistent speaker identity. Most open TTS models drift, change timbre, or fall apart past a few minutes. If you’re producing audiobooks, podcasts, or long documentation read-throughs, that single-run stability matters more than a fractional similarity-score win.

31 languages, up from 20 in the 1.0 release, covering Chinese, English, French, German, Spanish, Japanese, Korean, Arabic, Hindi, Thai, and Vietnamese among others.

Control you don’t usually get. v1.5 adds reliable punctuation-driven pausing and explicit inline pause markers — you can write [pause 3.2s] directly in your text. There’s phoneme-level pronunciation control via mixed Pinyin/IPA input for names and jargon the model would otherwise butcher. The repo also gives a handy planning rule: 1 second of audio ≈ 12.5 tokens, so you can estimate generation length before you run anything.

It runs on hardware you own. After the llama.cpp optimization work, the OpenMOSS team states the 8B model now fits onto 8GB GPUs. That puts it within reach of a RTX 3060 12GB or even an 8GB card, instead of demanding a 24GB workstation GPU. For a model that clones voices at this quality, that’s the headline that makes it practical.

Install and first run

There are two install paths. The standard PyTorch runtime:

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install -e ".[torch-runtime]"

Or the torch-free path, which is what gets you onto an 8GB card and onto edge devices. It uses GGUF weights plus an ONNX audio tokenizer instead of dragging in the full PyTorch stack:

pip install -e ".[llama-cpp-onnx]"

A minimal zero-shot clone looks like this — point the model at a short reference clip and a transcript, then synthesize new text in that voice:

from moss_tts import MossTTS

tts = MossTTS.from_pretrained("OpenMOSS-Team/MOSS-TTS-v1.5")

audio = tts.generate(
    text="The quarterly numbers are in, and they look better than we feared. [pause 0.8s] Let's walk through them.",
    ref_audio="samples/narrator.wav",   # short reference clip of the target voice
    ref_text="This is the reference transcript.",
    language="en",
)
audio.save("out.wav")

Expect the first run to download several GB of weights and the ONNX tokenizer. On a 12GB card the 8B model loads comfortably; on 8GB you’ll want the llama.cpp/GGUF build and should close other GPU apps first.

Apple Silicon and ComfyUI

Two integration points matter for this audience.

MLX on Apple Silicon. MOSS-TTS and the MOSS audio tokenizer support mlx-audio, and the community has published quantized builds such as mlx-community/MOSS-TTS-8B-8bit. On a Mac with unified memory this is the cleanest route — no CUDA, no driver wrangling. If you’re already running local models on a Mac, the same logic from our Ollama MLX backend setup guide applies: MLX builds trade a little quality headroom for a big jump in setup simplicity and memory efficiency on M-series chips.

ComfyUI. There’s a community extension, comfyui-moss-tts, that wires the model into ComfyUI’s node graph. If you already run an image pipeline, you can bolt TTS onto the same canvas — useful for generating narrated video assets in one workflow. If you’re new to ComfyUI nodes, our ComfyUI custom nodes guide covers how to install and manage third-party packs without breaking your install.

How it compares

The real decision is rarely “MOSS vs. everything.” It’s “which cloning model can I legally ship, and is it good enough?” Here’s the honest breakdown against the two models people actually reach for.

vs. XTTS v2 — XTTS v2 is still the community’s reference point for cloning fidelity from ~6 seconds of audio across 17 languages, and its tooling ecosystem is enormous. But the CPML license is non-commercial and, with Coqui gone, unfixable. MOSS-TTS 1.5 gets you most of the way on quality with a license you can build on and a wider 31-language footprint. If you’re doing a personal project and want the largest pile of tutorials, XTTS v2 still wins on ecosystem. For anything commercial, it’s disqualified.

vs. F5-TTS — F5-TTS clones from roughly 3 seconds of reference audio and is one of the fastest-moving local TTS projects, with excellent few-shot results. Same blocker: CC-BY-NC weights mean no commercial use. F5-TTS is arguably easier to get a quick demo running. MOSS-TTS 1.5 wins on long-form stability (the one-hour single-run claim) and, again, on licensing.

vs. Kokoro / Piper — worth naming because they come up constantly. Kokoro (Apache 2.0) and Piper (MIT) are both commercial-friendly and excellent, but neither clones voices — they ship fixed voice sets. If you don’t need cloning, they’re lighter and simpler. The moment you need a specific person’s voice, they’re out and MOSS-TTS is the commercial-friendly answer.

Licensing for the whole local-TTS field is messier than most people realize — if you’re choosing a stack to build on, our open-source LLM licensing guide explains why “open weights” and “open source” are not the same promise, and the same traps apply to audio models.

When NOT to use MOSS-TTS 1.5

You need a one-click GUI. This is a Python/llama.cpp install with reference clips and transcripts. If you want a desktop app where you paste text and hit play, MOSS-TTS isn’t there yet — the Nano model and community wrappers help, but the 8B flow is developer-facing.
You don’t need cloning. If fixed voices are fine, Kokoro or Piper give you better-documented, lighter deployments. Cloning is MOSS-TTS’s reason to exist; skip the complexity if you don’t use it.
You’re on an 8GB card and also running an LLM. “Fits on 8GB” assumes the GPU is mostly free. Co-hosting an LLM and MOSS-TTS on one 8GB card will thrash. Either get more VRAM or offload the TTS to CPU/Nano.
You’re cloning someone without consent. Apache 2.0 covers the code, not your ethics or the law. Voice cloning of real people without permission is a legal and reputational landmine. Use it on consenting speakers, licensed voice talent, or your own voice.

Scaling past a single GPU

If you’re batch-generating hours of audio — a back catalog of audiobooks, a localization run across all 31 languages — a single consumer GPU becomes the bottleneck fast. This is a clean case for renting GPU time by the hour instead of buying a second card: spin up an A100 or L40S on RunPod, run the batch, and shut it down. For picking the right card for sustained local generation versus a rental, our friends at runaihome.com cover consumer GPU choices for AI audio and inference workloads in depth.

Voice generation also slots neatly into a larger local setup. If you’re assembling a full self-hosted toolchain, see how the pieces fit in our open-source AI stack for 2026 — TTS is increasingly the last mile that turns a text pipeline into a finished product.

FAQ

Is MOSS-TTS 1.5 really free for commercial use? Yes. Every model in the MOSS-TTS family is licensed Apache 2.0, which permits commercial use, modification, and redistribution. That’s the main reason to choose it over XTTS v2 (non-commercial) or F5-TTS (non-commercial). As always, confirm the license file in the repo for your exact checkpoint before shipping.

How much reference audio do I need to clone a voice? MOSS-TTS does zero-shot cloning from a short reference clip plus its transcript. The model card doesn’t publish a hard minimum, and v1.5 specifically improved “long-reference, short-text” cloning stability. In practice, a few clean seconds of speech is the working range for models in this class; more clean reference audio generally improves similarity.

Can it run on a Mac? Yes. MOSS-TTS supports mlx-audio, and there are quantized MLX builds like mlx-community/MOSS-TTS-8B-8bit. On Apple Silicon with enough unified memory, MLX is the simplest path.

What sample rate does it output? The 8B MossTTSDelay model targets standard TTS quality; the newer MossTTSLocal-4B v1.5 variant outputs 48kHz stereo, while earlier local models used 24kHz mono. Pick the local-transformer variant if you specifically need high-fidelity stereo.

How does it compare to ElevenLabs? Open benchmarks (Seed-TTS-eval: 1.84% English WER, ~71% speaker similarity) put it in serious-contender territory, but commercial APIs still tend to lead on the most expressive, emotion-heavy delivery. The trade is control, privacy, zero per-character cost, and no vendor lock-in — at the price of running it yourself.

Sources

OpenMOSS/MOSS-TTS — GitHub repository (versions, license, install, MLX/llama.cpp/ComfyUI support, release dates)
MOSS-TTS model card (Seed-TTS-eval benchmarks, sample rates, 1-hour generation, token-rate metric)
OpenMOSS/MOSS-TTSD — GitHub repository (multi-speaker dialogue, zero-shot cloning details)
mlx-community/MOSS-TTS-8B-8bit — Hugging Face (Apple Silicon MLX quantization)
Local TTS & Voice Cloning Licenses 2026 — PromptQuorum (XTTS v2 CPML, F5-TTS CC-BY-NC, Kokoro/Piper permissive licensing)
Best Local TTS Models 2026 — Local AI Master (competitor capability and reference-audio context)

Was this article helpful?