DeepSeek V4 Pro Review 2026: MIT 1.6T MoE for Self-Hosters

deepseekllmmoelocal-llmselfhostedcodingbenchmarks

TL;DR: DeepSeek V4 Pro is a 1.6T-parameter (49B active) MoE model released April 24, 2026 under the MIT license, with full open weights on HuggingFace. The benchmarks are frontier-class, but at ~862 GB of weights it is a datacenter model — no consumer rig runs it. For self-hosters, the open-source story that actually matters is V4-Flash (284B / 13B active) and the MIT license that lets you deploy either commercially.

V4 Pro (self-hosted)V4 Pro (API)V4-Flash (self-hosted)
Best forSovereign datacenter inferenceFrontier quality, zero opsReal single-node self-hosting
Min hardware8× H100 / 4× H200 (FP8)API only1× A100 80GB (FP8)
Weights size~862 GBn/a~158 GB
LicenseMITProprietary endpointMIT
Context1M tokens1M tokens1M tokens
CostHardware + power$0.435/$0.87 per 1MHardware + power

Honest take: If you have a GPU cluster and a compliance reason, self-host V4 Pro. Everyone else should run V4-Flash locally or hit the V4 Pro API — at $0.87 per million output tokens, paying for Pro is cheaper than the electricity to fake it on quantized hardware.


What DeepSeek V4 Pro Is

DeepSeek released the V4 series on April 24, 2026. There are two open-weight checkpoints: V4-Pro (1.6T total parameters, ~49B activated per token) and V4-Flash (284B total, ~13B activated). Both are Mixture-of-Experts models, both ship under the MIT license, and both support a context window of up to 1 million tokens.

V4-Pro was pre-trained on 33T tokens. The headline architectural change is a hybrid attention scheme combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). The point of that combination is long-context efficiency: at a 1M-token context, DeepSeek reports V4-Pro needs roughly 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2. That is the difference between a 1M context that is a marketing number and one you can actually serve.

Key specs as of release:

  • Parameters: 1.6T total / ~49B active (MoE) for Pro; 284B / ~13B for Flash
  • Context: up to 1,000,000 tokens
  • License: MIT — commercial use, fine-tuning, redistribution, no revenue cap
  • Released: April 24, 2026 by DeepSeek
  • HuggingFace: deepseek-ai/DeepSeek-V4-Pro, deepseek-ai/DeepSeek-V4-Flash
  • Serving: requires vLLM ≥ 0.7.0 or SGLang ≥ 0.4.4

The MIT license is the load-bearing detail. A 1.6T model that beats most of the closed frontier on coding, released with weights you can legally deploy commercially and fine-tune, is a different category of thing than an API you rent. Whether you can run it is a separate question — and the honest answer for almost everyone is “not on your own hardware.”


The $10B Open-Source Bet, In Plain Terms

The reason this release got outsized attention isn’t just the benchmarks. In May 2026, DeepSeek’s first outside funding round started closing. Reporting varies on the exact figure: Bloomberg framed it as a 70 billion yuan ($10 billion) raise in valuation terms, while CNBC and The Information later pegged the actual money raised closer to $7–7.4 billion, with founder Liang Wenfeng reportedly committing around 20 billion yuan of his own capital. Don’t anchor on a single dollar figure — the reports genuinely disagree, and the round had not officially closed as of early June 2026.

What’s consistent across every report is the strategic stance: Liang told investors DeepSeek will keep releasing open-source models rather than pivot to short-term commercialization. For a self-hoster, that pledge is worth more than the exact size of the round. It signals that V4-Pro’s MIT weights are a deliberate strategy, not a one-off, which lowers the risk of building a private stack on the DeepSeek line and having the rug pulled in a future “open-weights but non-commercial” relicense.


Benchmark Reality Check

DeepSeek positions V4-Pro-Max (the maximum-reasoning-effort mode) as a frontier coding model. Independent and aggregator sources report the following for V4-Pro, and you should treat secondary-source benchmark numbers as directional rather than gospel:

BenchmarkV4-Pro (reported)What it measures
SWE-bench Verified80.6%Real GitHub issue resolution
LiveCodeBench93.5%Competitive coding, contamination-resistant
Codeforces (rating)3206Algorithmic problem solving
GPQA Diamond90.1Graduate-level science reasoning
MMLU-Pro87.5Broad knowledge, harder MMLU variant

If those hold up, V4-Pro sits in the same conversation as the top closed models on coding — at open weights and a fraction of the API price. The number that matters most for the “should I pay or self-host” decision is SWE-bench Verified: 80.6% is genuinely strong, and it’s the kind of agentic coding workload where a 1M context plus cheap cached input changes how you’d structure a coding agent.

One caveat worth stating plainly: aggregator sites are not the model card, and “Max” reasoning mode inflates latency and token cost. Benchmark a representative slice of your own workload before you treat any of these as a procurement decision.


Can You Actually Self-Host It?

This is where the romance meets the spec sheet. V4-Pro’s weights are roughly 862 GB. Full BF16 is around 3.2 TB. The realistic deployment targets:

  • FP8: ~500 GB minimum — at least 4× H200 (141 GB each) or 8× H100 (80 GB each).
  • INT4: a 4× H100 cluster (320 GB) becomes viable, with measurable quality loss on reasoning and math.
  • Consumer GPUs: not happening for Pro. A single RTX 4090 holds 24 GB. You would need dozens, and the interconnect would be the bottleneck long before VRAM.

For the cluster-class hardware Pro demands, renting is almost always the right first move. A few hours on rented H100/H200 nodes via RunPod costs less than the depreciation on a single owned card, and you can validate the deployment before committing capital. If you’re sizing a permanent local build for big MoE models, the GPU-server tradeoffs are covered in more depth at runaihome.com.

V4-Flash is the model self-hosters will actually run. Its FP8 instruct checkpoint is ~158 GB. Add ~10 GB for a full 1M-token KV cache and runtime overhead, and you’re budgeting roughly 170–175 GB — a single 8×24 GB rig, a 2× H100 node, or a single A100 80GB at reduced context. Unsloth published GGUF quants for V4-Flash within about 48 hours of release; community reports put Q4_K_M as the sweet spot, fitting on 1× 80 GB or 2× 48 GB while staying close to FP8 quality. Aggressive INT4 (GGUF/AWQ/GPTQ) can squeeze Flash to ~80 GB — potentially 4× RTX 4090 — but the quality drop on math and complex instruction-following is real, not theoretical.


A Minimal vLLM Deployment

Here’s a realistic single-node V4-Flash launch on an 8×80 GB box. The flags matter more than usual for a model this size: tensor parallelism across all GPUs, an explicit context cap, and the trust-remote-code flag for the custom attention.

# Requires vLLM >= 0.7.0
pip install "vllm>=0.7.0"

# Serve V4-Flash with an OpenAI-compatible endpoint
vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code \
  --port 8000

Expected startup log (abbreviated):

INFO server_args.py: Using FP8 weights, 158.2 GiB total
INFO worker.py: TP=8, KV cache allocated for 131072 tokens
INFO api_server.py: Started server on http://0.0.0.0:8000

Then it’s a drop-in OpenAI client:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Refactor this function for readability."}]
  }'

Start at --max-model-len 131072 even though the model supports 1M. The full context KV cache is what eats your VRAM headroom, and most workloads never need a million tokens. Scale the context up only when a real task demands it. If you’d rather a fuller production setup with Nginx, auth, and multiple models, that’s covered in the vLLM production setup guide.


A Real Problem: The Custom Attention Tripwire

The most common early failure with V4 isn’t VRAM — it’s the serving stack version. The CSA + HCA attention is implemented in custom kernels, and a vLLM or SGLang build below the minimum will either refuse to load the config or, worse, load it and produce garbage logits.

What it looks like:

ValueError: Model architecture 'DeepseekV4ForCausalLM' is not supported.

or, on a build that partially recognizes it:

KeyError: 'compressed_sparse_attention'

The fix is unglamorous: confirm vllm>=0.7.0 (or sglang>=0.4.4), and pass --trust-remote-code so the custom attention module loads. If you pinned vLLM months ago for another model, upgrading can break that model’s flags — so test V4 in a fresh virtualenv rather than upgrading your existing serving environment in place. This is the single biggest source of “the weights downloaded but it won’t run” reports in the first weeks after release.


Pricing: When Paying Beats Self-Hosting

DeepSeek made an aggressive API discount permanent for V4-Pro on May 22, 2026. The standing rates:

  • Input (cache miss): $0.435 per 1M tokens
  • Output: $0.87 per 1M tokens
  • Cached input: $0.003625 per 1M tokens

The cached-input price is the interesting one. For an agentic coding setup with a large, stable system prefix — repo context, tool definitions, instructions — context caching turns the dominant cost component into a rounding error. That’s a structural advantage of a 1M-context model with cheap cache reads, and it’s hard to replicate the economics on self-hosted hardware unless your GPUs are already paid for and otherwise idle.

Run the math honestly. A single 8×H100 node draws on the order of multiple kilowatts under load. At those output prices, you would have to be pushing very high sustained token volume before owned hardware beats the API on pure cost. The case for self-hosting V4-Pro is data sovereignty, air-gapped deployment, or fine-tuning — not saving money.


When NOT to Use DeepSeek V4 Pro

  • You want to run it on consumer GPUs. Pro is a datacenter model. Run V4-Flash instead, or use the Pro API.
  • Your workload is short-context Q&A. You’re paying (in VRAM or money) for a 1M context and frontier coding you won’t use. A 7B–32B model is faster and cheaper. See the local LLM context window guide for right-sizing.
  • You need a hard data-residency guarantee and lack cluster hardware. The API is hosted by DeepSeek; the open weights are MIT but require the GPUs to run them. If you can’t self-host and can’t send data out, this isn’t your model.
  • You’re cost-optimizing a low-volume app. A smaller open model on a single card you already own will beat both the Pro API and a Pro cluster on total cost at low volume.
  • You require a model card with first-party, audited benchmarks before deploying. Much of the public benchmark data is aggregator-sourced. Verify against your own evals first.

FAQ

Is DeepSeek V4 Pro really open source? The weights are released under the MIT license — one of the most permissive options, with no revenue cap or non-commercial clause. You can deploy commercially, fine-tune, and redistribute. The training data and full pipeline are not open, so it’s “open-weight” in the strict sense, but the license on the weights themselves is genuinely permissive.

What’s the difference between V4-Pro and V4-Flash? Pro is 1.6T parameters (49B active) and targets frontier quality; Flash is 284B (13B active) and targets practical self-hosting. Both are MIT-licensed with 1M context. Flash is the one most people will actually run on their own hardware.

Can I run V4-Flash on RTX 4090s? At aggressive INT4 quantization (~80 GB), a 4× RTX 4090 setup can fit it, with measurable quality loss. The cleaner target is FP8 (~158 GB) on a 2× H100 or 8×24 GB rig. Unsloth’s Q4_K_M GGUF is the reported quality/size sweet spot.

Why won’t the model load even though the download finished? Almost always a version problem. You need vLLM ≥ 0.7.0 or SGLang ≥ 0.4.4 plus --trust-remote-code for the custom CSA/HCA attention. Older serving builds throw an unsupported-architecture or missing-key error.

Is self-hosting Pro cheaper than the API? Rarely. At $0.435/$0.87 per million tokens and very cheap cached input, the API undercuts the electricity and depreciation of a Pro-class cluster for all but the highest sustained volumes. Self-host for sovereignty and fine-tuning, not savings.


Verdict

DeepSeek V4 Pro is the most credible open-weight frontier model of 2026 so far, and the MIT license plus DeepSeek’s reaffirmed open-source commitment make it a safe foundation to build a private stack around. The catch is physical: Pro is a datacenter model, full stop. For self-hosters, the real product is V4-Flash — single-node deployable, same license, same 1M context — paired with the Pro API for the heavy coding tasks where 80.6% SWE-bench earns its keep. Build on Flash, rent Pro by the token, and revisit owning Pro hardware only when sustained volume and a sovereignty requirement both hold.

For more on the broader ecosystem, see the open-source AI stack in 2026 and the GGUF quantization guide.

Sources

Was this article helpful?