Jun 29, 2026

Kimi K2.7 Code Local Setup 2026: vLLM, SGLang, GGUF

By AIFoss · 10 min read

kimillmmoeselfhostedcodingvllm

TL;DR: Kimi K2.7 Code is Moonshot AI’s June 12 2026 coding model — 1T total parameters, 32B active, 256K context, Modified MIT license. The native INT4 checkpoint needs 8×H200 to serve at full speed via vLLM or SGLang. There is exactly one way onto consumer hardware: Unsloth’s 1.8-bit GGUF, which runs on a single 24GB GPU by offloading every expert layer to system RAM — slowly.

What you’ll have running after this guide:

An OpenAI-compatible Kimi K2.7 Code endpoint on a multi-GPU node (vLLM or SGLang, INT4).
A single-GPU fallback using llama.cpp + Unsloth GGUF for anyone without a datacenter.
A clear-eyed sense of whether self-hosting this model is worth it versus the free API.

Honest take: If you have 8×H200, vLLM with expert parallelism is the move. If you don’t, run the GGUF for tinkering and point your editor at Moonshot’s API for real work — the math almost never favors buying the hardware.

What K2.7 Code actually is

K2.7 Code is a coding-first agentic model built on Kimi K2.6, released June 12 2026. The headline change is efficiency: Moonshot reports roughly 30% fewer thinking tokens than K2.6 for the same task quality, which directly cuts your inference cost and latency.

The architecture is a Mixture-of-Experts: 1 trillion total parameters, 384 experts, with 32 billion parameters activated per token. It uses Multi-head Latent Attention (MLA) and ships with a 256K-token context window. The weights are open and live on HuggingFace, ModelScope, and behind Moonshot’s API.

One detail that matters more than the parameter count: K2.7 Code ships as a native INT4 checkpoint. Moonshot quantized it themselves rather than releasing BF16 and leaving you to convert. That halves the storage and VRAM footprint out of the box and is the format vLLM, SGLang, and KTransformers are all tuned for.

The license, in plain terms

K2.7 Code uses Moonshot’s Modified MIT License. For practical purposes it behaves like standard MIT — use it commercially, modify it, deploy it, no fees. The single modification: if your product serves more than 100 million monthly active users or generates more than $20M/month in revenue, you must display “Kimi K2” visibly in your UI. No indie developer or normal company hits that ceiling, so for self-hosting it is effectively unrestricted. That’s a meaningfully cleaner license than the Llama Community License or Qwen’s attribution terms.

The honest hardware reality

This is the part most “setup guides” gloss over. K2.7 Code is a datacenter model. Here is what each path actually costs you.

	vLLM / SGLang (INT4)	Unsloth GGUF (consumer)	Moonshot API
Hardware	8×H200 (TP=8) or 4×MI300X	1×24GB GPU + 256GB+ RAM	Any machine
Aggregate VRAM	~640GB	24GB VRAM + RAM offload	None
Speed	Production (tens of tok/s)	Single-digit tok/s	Fast, hosted
Truly local?	Yes	Yes	No
Best for	Teams, SLA, air-gapped	Experiments, privacy tests	Most people

The INT4 checkpoint runs on 8×H200 with --tensor-parallel-size 8, or on NVIDIA B300 (8×, TP=8) and GB300 (4×, TP=4). On AMD, the verified configs are MI300X/MI325X and MI350X/MI355X at 4×, TP=4. The INT4 weights occupy roughly 640GB aggregate; if you instead run FP8 on 8×H200 SXM5 (~1128GB HBM3e total), the weights eat about 1TB, leaving ~128GB for KV cache at 256K context with small batches.

If you’re renting rather than buying, an 8×H200 node on a provider like RunPod is the realistic way to test this without a five-figure hardware order. For the full hardware breakdown across Kimi releases, runaihome.com has a Kimi K2 local inference hardware guide.

Path 1: vLLM on a multi-GPU node

K2.7 Code is new enough that parser support may not be in a stable vLLM release yet. Pin a nightly build in your startup script rather than relying on pip install vllm:

pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly
# pin the exact nightly date once it works:
# pip install vllm==<nightly-build-date>

Serve the model. The two flags people forget are the parsers — without them, agentic tool calls and the reasoning trace come back as raw text instead of structured output:

vllm serve moonshotai/Kimi-K2.7-Code \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code \
  --port 8000

--enable-expert-parallel is worth understanding. With 384 experts spread across an 8-GPU node, expert parallelism cuts the all-to-all communication overhead compared to tensor parallelism alone. The benefit is largest on long generation sequences — which is exactly what coding output is — so for this model it’s not optional tuning, it’s the default you want.

Once it’s up, it speaks the OpenAI API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/Kimi-K2.7-Code",
    "messages": [{"role": "user", "content": "Write a Python LRU cache with a TTL."}]
  }'

Expected response shape:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "choices": [{"message": {"role": "assistant", "content": "import time\nfrom collections import OrderedDict\n..."}}],
  "usage": {"prompt_tokens": 18, "completion_tokens": 240, "total_tokens": 258}
}

Path 2: SGLang

SGLang is the other engine Moonshot recommends, and the launch is a one-liner once installed:

pip install "sglang[all]"

python -m sglang.launch_server \
  --model-path moonshotai/Kimi-K2.7-Code \
  --quantization int4 \
  --tp 8 \
  --context-length 262144 \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code \
  --host 0.0.0.0 --port 8000

Both engines give you an OpenAI-compatible endpoint, so picking between them is about which you already run. For deeper vLLM tuning — auth, multiple models, Nginx — see the vLLM setup guide.

A real problem you’ll hit on AMD

If you try --tp 8 on a 4-GPU AMD box, or copy an 8×H200 command onto MI300X hardware, the server fails to start. The reason is head divisibility: K2.7 Code has 64 attention heads. With TP=4 each GPU gets 16 heads, which is valid. With TP=8 on AMD’s supported 4-GPU configs the math doesn’t work, and on MoE INT4 paths the parser also rejects expert parallelism. The fix is concrete: on AMD, keep --tp 4 and drop --enable-expert-parallel for the INT4 MoE path. On NVIDIA 8-GPU nodes, TP=8 with expert parallel is correct.

Path 3: the consumer GGUF (single 24GB GPU)

This is the only way K2.7 Code touches a normal machine, and it works because of how Unsloth quantizes MoE models. The trick: keep the active path in VRAM and offload all the rarely-touched expert layers to system RAM or a fast SSD.

Unsloth’s dynamic GGUF sizes for K2.7 Code:

Quant	Size	Runs on
Full precision	605GB	Datacenter only
UD-Q8_K_XL	595GB	Lossless, multi-node
UD-Q4_K_XL	~585GB	Multi-GPU server
UD-Q2_K_XL	345GB	Best size/quality balance
UD-TQ1_0 (1.8-bit)	~325GB	1×24GB GPU + 256GB RAM

The rule of thumb: your combined RAM + VRAM should roughly equal the quant size. It still runs if you’re short, just slower as it pages from disk. The 1.8-bit UD-TQ1_0 quant will run on a single 24GB GPU if you offload every MoE layer to system RAM or a fast SSD — which is why this needs 256GB+ of RAM to be tolerable rather than painful.

Download and run with llama.cpp:

# grab the 1.8-bit quant
huggingface-cli download unsloth/Kimi-K2.7-Code-GGUF \
  --include "*UD-TQ1_0*" --local-dir kimi-k2.7

# run, offloading experts to CPU/RAM
llama-cli \
  --model kimi-k2.7/*UD-TQ1_0*.gguf \
  --n-gpu-layers 99 \
  --override-tensor "\.ffn_.*_exps\.=CPU" \
  --ctx-size 16384 \
  --prompt "Refactor this function for readability:"

Note the modest --ctx-size: you will not get 256K context on a 24GB card with everything offloaded — start at 8K–16K. Ollama can also load these GGUFs and auto-offloads what doesn’t fit, but for K2.7 Code at this scale, llama.cpp gives you the explicit tensor-override control you actually need. If GGUF quantization tiers are new to you, the GGUF quantization guide and the GPTQ vs AWQ vs GGUF comparison cover the tradeoffs.

Is it any good? What the benchmarks say

Be skeptical here, because the public numbers are almost entirely first-party. As of late June 2026 there are no independent SWE-bench Verified, SWE-bench Pro, Terminal-Bench, or Aider Polyglot scores for K2.7 Code. Moonshot’s own suites report Kimi Code Bench v2 up 21.8% (50.9 → 62.0), Program Bench up 11.0%, and MLS Bench Lite up 31.5% over K2.6.

The one external signal is tool use: an early third-party run put K2.7 Code at 81.1% on MCPMark Verified, ahead of Opus 4.8’s 76.4 on the same suite. On agentic benchmarks like Kimi Claw 24/7 and MCP Atlas it’s roughly 10% better than K2.6. Treat all of this as directional until the standard public suites re-run. For how K2.7 stacks up against the other open coding frontier, compare with DeepSeek V4 Pro, and for the original Kimi setup paths see the Kimi K2.6 setup guide. If you care about the editor-side workflow, aicoderscope.com tracks the coding-tool integration angle.

When NOT to self-host this

Skip local hosting if any of these is true:

You don’t have 8×H200 or 4×MI300X. The GGUF path is a science project, not a daily driver — single-digit tokens per second on a 1T model is fine for a demo and miserable for agentic coding loops that fire dozens of tool calls.
You need 256K context. Offloaded GGUF can’t hold it; you need the full INT4 deployment for that.
Your usage is bursty. A node sitting idle between sessions burns money. Moonshot’s API at roughly 30% fewer thinking tokens than K2.6 is cheaper than amortizing a GPU cluster for intermittent use.

Self-host when you have the hardware already, need an air-gapped or sovereign deployment, or are serving a team continuously enough to keep the GPUs busy.

FAQ

Can I run Kimi K2.7 Code on a single RTX 4090? Only via Unsloth’s 1.8-bit UD-TQ1_0 GGUF with all expert layers offloaded to 256GB+ of system RAM, and expect single-digit tokens per second at short context. It runs; it isn’t pleasant.

Is the Modified MIT license safe for commercial use? Yes. It functions as standard MIT unless you exceed 100M monthly active users or $20M/month in revenue, at which point you must show “Kimi K2” in your UI. Below that, no restrictions.

vLLM or SGLang? Either — both expose an OpenAI-compatible endpoint and both are Moonshot-recommended for the INT4 checkpoint. Use whichever your stack already runs. On AMD, remember TP must be ≤ 4.

Why does it ship as INT4 instead of BF16? Moonshot quantized it natively to cut the storage and VRAM footprint and to match what vLLM, SGLang, and KTransformers are optimized for. You don’t convert anything — pull and serve.

Does K2.7 Code beat Claude Opus 4.8 at coding? On the one external tool-use benchmark (MCPMark Verified, 81.1% vs 76.4), it leads. On general coding there are no independent head-to-head numbers yet — wait for SWE-bench Verified re-runs before believing any ranking.

Sources

Was this article helpful?