Jun 28, 2026

Open-Source Vision Language Models 2026: Which to Self-Host

By AIFoss · 11 min read

vlmvision-language-modelsqwen3-vlselfhostedai

TL;DR: A 7–8B vision language model now fits in 8GB of VRAM and reads charts, screenshots, and scanned PDFs well enough for real work. Qwen3-VL is the best generalist you can actually run at home; DeepSeek-OCR is the specialist when the job is purely documents. The frontier (GLM-4.6V at 106B) needs a server, not a gaming GPU.

	Qwen3-VL 8B	InternVL3.5 8B	Gemma 3 4B	DeepSeek-OCR 2
Best for	General VQA + OCR + UI	Reasoning over images	Multilingual, low VRAM	Pure document parsing
Min VRAM (Q4)	~8GB	~8GB	~6GB	~8–10GB
License	Apache 2.0	MIT (check backbone)	Gemma Terms	MIT
The catch	Newer, fewer guides	Backbone license varies	Not Apache-clean	OCR only, not chat

Honest take: Start with Qwen3-VL 8B on Ollama. It’s Apache 2.0, fits a single 8–12GB card, and covers OCR, charts, tables, and UI grounding in one model. Reach for DeepSeek-OCR only when you’re processing documents at volume.

The open-source vision language model (VLM) field moved fast over the last year. The models people still cite in old blog posts — LLaVA, the original Qwen2-VL, PaliGemma, Idefics — are 2024-vintage. As of June 2026 the practical shortlist for self-hosters is shorter and far more capable. This is the comparison for someone who owns a consumer GPU (or rents one) and wants to pick one model for image understanding, OCR, or multi-modal RAG.

What a VLM actually does (and what to ignore)

A vision language model takes images plus text and answers in text. The useful tasks split into four buckets, and which model wins depends entirely on which bucket you care about:

Visual question answering (VQA) — “what’s in this image,” “describe this chart.”
OCR and document parsing — turning a scanned invoice or PDF page into structured text or Markdown.
UI grounding — pointing at the right button in a screenshot, which agent frameworks lean on.
Multi-modal RAG — embedding images and pages so a retrieval pipeline can pull them back.

Ignore the leaderboard chasing. A model topping MMMU by two points means nothing if it doesn’t fit your GPU or its license blocks commercial use. The two questions that actually decide your choice are: does it run on the VRAM I have, and can I legally ship what I build.

The 2026 shortlist

Qwen3-VL — the default generalist

Alibaba’s Qwen3-VL family is the one to beat for self-hosters. It spans 2B, 4B, 8B, 30B-A3B (MoE), 32B, and 235B-A22B, and every size shares the same Apache 2.0 license, a 262,144-token context window, and the same core skills: document OCR, chart extraction, table parsing, UI grounding, and video understanding.

The Apache 2.0 license is the headline. Unlike Gemma or Llama Vision, there’s no use-case carve-out and no acceptable-use addendum to read — you can build a commercial product on it without a lawyer. The 8B at Q4_K_M is about 6.1GB on disk and loads on an 8GB card, though you’ll want 12–16GB to avoid memory pressure when you feed it large images. The 4B (~3.3GB at Q4_K_M) runs on practically anything with 6GB.

# Qwen3-VL 8B on Ollama (vision-capable tag)
ollama pull qwen3-vl:8b
ollama run qwen3-vl:8b "Extract the table in this image as Markdown." --image invoice.png

Expected output is a clean Markdown table reproducing the line items — not a paragraph describing the image. That distinction (structured extraction vs. vague description) is where the 2026 models pulled ahead of the LLaVA generation.

InternVL3.5 — the reasoning specialist

OpenGVLab’s InternVL3.5 (released August 2025) is the model to pick when the task is reasoning over an image rather than just reading it — math diagrams, multi-step chart questions, science figures. The 8B scores 73.4 on MMMU and the flagship 241B-A28B hits 77.7, with 82.7 on MathVista, which puts it at or near the top of the open-source field and within reach of closed commercial systems.

The license is the catch. InternVL’s own code is MIT, but each model variant pairs the InternViT vision encoder with a separate LLM backbone (Qwen2.5, etc.), and the weights inherit that backbone’s license. Most variants land on Apache 2.0 or MIT in practice, but you should read the specific HuggingFace model card before shipping — don’t assume.

GLM-4.6V — the frontier, if you have the hardware

Z.ai’s GLM-4.5V (106B total / 12B active MoE, MIT-licensed) posted state-of-the-art results across 42 benchmarks when it landed in August 2025, beating Qwen2.5-VL-72B and trading blows with Gemini 2.5 Flash. The follow-up GLM-4.6V (September 30, 2025) added a 128K context window and native multimodal tool-calling — useful if you’re wiring the VLM into an agent that calls functions.

Here’s the honest part: 106B parameters do not fit a consumer GPU. Even heavily quantized, you’re looking at a multi-GPU server or a rented cloud instance. If you want to test GLM-4.6V’s frontier quality without buying an 8×GPU box, rent one by the hour on RunPod and tear it down when you’re done. For local-only deployment, the 9B GLM-4.6V-Flash is the variant that fits a single card and keeps the native tool-calling — that’s the one to pull for a home lab. Pair either with a real GPU build; runaihome.com has the hardware breakdowns.

Gemma 3 — multilingual on a budget

Google’s Gemma 3 is multimodal across its 4B, 12B, and 27B sizes (the 270M and 1B are text-only), with a 128K context window and support for 140+ languages. The 4B runs at full precision on 8GB of VRAM; the 12B fits at Q4. Ollama supports it natively (ollama run gemma3:4b).

Gemma’s strength is multilingual OCR and broad language coverage. Its weakness for this audience is licensing: Gemma ships under Google’s Gemma Terms of Use, not Apache or MIT. It’s permissive enough for most uses and allows commercial deployment, but it carries a prohibited-use policy you’re agreeing to — which is why FOSS purists reach for Qwen3-VL first.

DeepSeek-OCR 2 — the document specialist

If your only job is documents — invoices, contracts, scanned archives, multilingual PDFs — a generalist VLM is the wrong tool. DeepSeek-OCR 2 (open-sourced January 27, 2026, MIT-licensed) is a 3B model built specifically for optical character recognition and layout parsing. It scored 91.09% on OmniDocBench v1.5, and its MoE decoder runs at roughly 570M active parameters per token, so a single A100-40G processes around 200,000 pages a day.

It runs on 8–10GB of VRAM in base mode. The trade-off is that it’s not a chatbot — you don’t have a conversation with it, you feed it pages and get structured text back. For a document-heavy local RAG pipeline, running DeepSeek-OCR for ingestion and a general LLM for the chat layer beats forcing one model to do both.

The decision table

Model	Sizes	License	Min VRAM (usable)	Ollama	Best at
Qwen3-VL	2B–235B	Apache 2.0	~6GB (4B) / ~8GB (8B)	Yes	Generalist OCR, charts, UI, video
InternVL3.5	1B–241B	MIT / backbone	~8GB (8B)	Partial	Reasoning over images, MMMU
GLM-4.6V	9B + 106B	MIT	~12GB (9B Flash)	Partial	Frontier quality, tool-calling
Gemma 3	4B–27B	Gemma Terms	~8GB (4B)	Yes	Multilingual, low-VRAM
DeepSeek-OCR 2	3B	MIT	~8–10GB	Via llama.cpp	High-volume document parsing

A note on “usable” VRAM: these are the figures to load a Q4 quant with a modest context. Push to long context or large input images and real serving VRAM climbs because of the KV cache and the vision encoder’s activations. Budget headroom. If you’re tight on memory, the GGUF quantization guide explains which quant level trades the least quality for the most savings.

A real problem you’ll hit: the image just gets described, not read

The most common failure when people first run a local VLM for OCR is asking the wrong way and getting a vague description instead of extracted text. You send a screenshot of a table and the model replies “This image shows a financial table with several rows of data” — useless.

Two fixes. First, the prompt: be explicit that you want structured output. “Transcribe every cell of this table as a Markdown table, preserving column order” works far better than “what does this show.” Second, the input resolution: most VLMs downsample large images, so a dense spreadsheet screenshot loses detail. Crop to the region you care about, or split a full page into sections and process each. Qwen3-VL handles dense documents better than the LLaVA generation precisely because its preprocessing keeps more visual tokens — but it’s not magic, and feeding it a 4000px screenshot scaled down to a thumbnail will still fail.

If accuracy still isn’t there for documents specifically, that’s your signal to switch from a generalist to DeepSeek-OCR rather than fight the prompt.

When NOT to self-host a VLM

Self-hosting a vision model is not always the right call, and pretending otherwise wastes your weekend:

You process a handful of images a week. A managed API is cheaper than the GPU’s electricity. Self-hosting pays off on volume, privacy requirements, or both.
You need the absolute frontier and have no GPU budget. GLM-4.6V at 106B beats the small models, but if you can’t run it locally and won’t rent cloud GPUs, a hosted API closes the quality gap for less hassle.
Your images contain regulated PII and you haven’t secured the box. Self-hosting enables privacy; it doesn’t guarantee it. An exposed Ollama port is a leak. Lock it down first — see the Ollama + Open WebUI setup guide for binding to localhost and adding auth.
You only need OCR and want a chatbot anyway. Don’t pay generalist VRAM costs for document parsing. DeepSeek-OCR does it cheaper and better.

How to choose in one paragraph

Got an 8–12GB consumer card and want one model for everything? Qwen3-VL 8B, Apache 2.0, on Ollama. Doing heavy reasoning over diagrams and charts? InternVL3.5 8B. Working across many languages on a tight VRAM budget? Gemma 3 4B. Parsing thousands of documents? DeepSeek-OCR 2. Chasing frontier quality and willing to rent server GPUs? GLM-4.6V. For most readers of this site, the first option is the answer and the rest are situational.

FAQ

Can a vision language model run on 8GB of VRAM? Yes. Qwen3-VL 4B (~3.3GB at Q4_K_M) and Gemma 3 4B both run comfortably on 8GB. The 7–8B class loads on 8GB but performs better with 12–16GB once you feed it large images or long context.

Which open-source VLM is best for OCR specifically? DeepSeek-OCR 2 (MIT, January 2026) is purpose-built for it and scored 91.09% on OmniDocBench v1.5. For OCR mixed with general chat, Qwen3-VL handles documents well in a single model.

Are these models actually free for commercial use? Qwen3-VL (Apache 2.0) and DeepSeek-OCR / GLM-4.5V/4.6V (MIT) are commercially clean. Gemma 3 uses Google’s Gemma Terms with a prohibited-use policy. InternVL3.5 weights inherit their LLM backbone’s license — read the specific HuggingFace card before shipping.

Does Ollama support vision models? Yes. Qwen3-VL and Gemma 3 have native Ollama tags and accept images directly. InternVL and GLM vision variants are better run through vLLM or llama.cpp depending on the build.

What about LLaVA and PaliGemma? They were the reference open VLMs in 2024 but have been clearly surpassed on accuracy and OCR by the 2026 generation. There’s no reason to start a new project on them today.

Sources

Was this article helpful?