Self-Hosted AI for Dev Teams 2026: No Subscriptions
TL;DR: A 10-person team spending $190/month on GitHub Copilot Business can replace it with Tabby on a single RTX 4090 workstation and break even in under two years — while also getting private LLM chat (LibreChat) and document RAG (AnythingLLM) that SaaS doesn’t give you for that price. The catch: someone owns the server, and someone maintains the stack.
| Tabby | LibreChat | AnythingLLM | |
|---|---|---|---|
| Replaces | GitHub Copilot Business | ChatGPT Enterprise / Teams | Notion AI / private doc chat |
| Multi-user | Yes — LDAP, SSO, API tokens | Yes — LDAP, OIDC, OAuth | Yes — Admin/Manager/Default roles |
| GPU required | 8 GB+ VRAM | No (UI layer, API-backed) | No (connects to Ollama or API) |
| License | Apache 2.0 | MIT | MIT |
| Latest version | v0.32.0 (Jan 2026) | v0.8.6 (May 31, 2026) | v1.13.0 (May 26, 2026) |
Honest take: If your team has even one developer willing to own the infrastructure, the math is not close. Three years of GitHub Copilot Business for 10 developers ($6,840) buys an RTX 4090 workstation and three years of electricity.
The Three AI Workloads Teams Actually Use
Most AI tool evaluation at the team level collapses into “what’s the one tool that does everything.” There isn’t one. What teams actually need breaks into three distinct workloads with different technical requirements:
- Code completion — inline autocomplete and chat in the IDE. Latency-sensitive. Every developer uses this every hour. The bottleneck is inference speed, not feature list.
- LLM chat — a shared ChatGPT-style interface for prompting, writing, debugging, and general AI work. No special hardware if you’re routing to an API; the important things here are multi-user access control and model flexibility.
- Document RAG — ingesting internal docs, code, runbooks, and wikis so the LLM can answer questions about your actual codebase and company knowledge. Embedding workloads are batch-able and not latency-sensitive.
The open-source options that actually work in a team context are Tabby for code completion, LibreChat for the chat interface, and AnythingLLM for document RAG. Each solves one workload well. All three run on the same server.
Code Completion: Tabby v0.32.0
Tabby is the closest open-source equivalent to GitHub Copilot’s backend. It ships as a single binary or Docker container, provides IDE extensions for VS Code, JetBrains, Vim, Neovim, and Eclipse, and adds team features that Copilot’s entry tier doesn’t have: named user accounts, per-user API tokens, LDAP/SSO authentication, usage analytics, and repository indexing for context-aware completions.
v0.32.0 (January 2026) introduced generic OAuth support and improved multi-branch codebase indexing. The repository indexing feature matters: it’s what lets Tabby complete code that references your internal libraries, not just public code patterns from training data.
Model selection by team size
Tabby runs quantized models. The model you pick determines inference speed and how many concurrent developers you can serve without queuing:
| Team size | Recommended model | VRAM needed | Concurrent users (approx.) |
|---|---|---|---|
| 3–5 devs | StarCoder2-7B or Qwen2.5-Coder-7B (Q8) | 8 GB | 3–5 |
| 5–15 devs | Qwen2.5-Coder-7B (Q8) | 8–10 GB | 8–12 |
| 5–15 devs (higher quality) | Qwen2.5-Coder-14B (Q4_K_M) | 10–12 GB | 6–10 |
| 15–25 devs | Qwen2.5-Coder-32B (Q4) | 20–24 GB | 15–20 |
A single RTX 4090 with 24 GB VRAM handles up to 20 concurrent developers on a 7B model, or 15 on the 32B model with Tabby’s built-in request queuing. The 7B sweet spot covers most teams under 15 people without quality tradeoffs that developers notice in daily use.
# Start Tabby with Qwen2.5-Coder-7B on CUDA
docker run -it \
--gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve \
--model Qwen2.5-Coder-7B-Instruct \
--chat-model Qwen2.5-Coder-7B-Instruct \
--device cuda
When ready:
INFO tabby::routes > listening on 0.0.0.0:8080
After the first-run admin setup at http://localhost:8080, create user accounts under Settings → Users, configure LDAP under Settings → Security → LDAP, and index your codebase under Settings → Indexing → Add Repository. Each developer gets their own API token; usage analytics break out per user in the admin panel.
Full setup walkthrough: Tabby Team Server Setup 2026.
LLM Chat: LibreChat v0.8.6
LibreChat is what you actually want when someone asks for “a self-hosted ChatGPT for the team.” It handles multi-user auth (LDAP, OIDC, OAuth, plain email/password), lets you configure multiple model providers simultaneously, and has agent and tool support that ChatGPT’s Team tier doesn’t expose for on-premise deployments.
v0.8.6 (May 31, 2026) added Agent Skills and Subagents — packaging reusable instructions, scripts, and tool permissions into portable capabilities that agents can invoke automatically. For a dev team, that means: a “write a JIRA ticket” agent, a “summarize this PR” agent, and a “query our runbooks” agent — all shared across the team, not rebuilt by each developer individually.
What LibreChat adds that matters for teams:
- Per-user model permissions — control which users or groups can access which models
- LDAP/OIDC auth — users log in with corporate credentials; no separate account management
- Multiple providers in one UI — route some tasks to local Ollama, others to OpenAI or Anthropic, from the same interface; developers don’t juggle separate tools
- Shared agents — build once, shared across the org
- No GPU required — LibreChat is a UI and orchestration layer; it calls your Ollama instance or a commercial API, so the GPU budget goes toward Tabby inference, not here
# docker-compose.yml — LibreChat + MongoDB + Meilisearch
version: '3.8'
services:
librechat:
image: ghcr.io/danny-avila/librechat:v0.8.6
ports:
- "3080:3080"
env_file:
- .env
volumes:
- ./librechat.yaml:/app/librechat.yaml
depends_on:
- mongodb
- meilisearch
mongodb:
image: mongo:7.0
volumes:
- mongodb_data:/data/db
meilisearch:
image: getmeili/meilisearch:v1.6
volumes:
- meilisearch_data:/meili_data
volumes:
mongodb_data:
meilisearch_data:
Key .env fields for a team deployment:
# Disable self-registration — require LDAP or invite-only
ALLOW_REGISTRATION=false
# LDAP
LDAP_URL=ldap://your-ldap:389
LDAP_USER_SEARCH_BASE=ou=people,dc=example,dc=com
LDAP_SEARCH_FILTER=mail
# Model backends
OLLAMA_BASE_URL=http://your-ollama-host:11434
OPENAI_API_KEY=sk-... # optional — add if team uses cloud models alongside local
# Security
JWT_SECRET=<random-64-char-string>
SESSION_EXPIRY=604800000 # 7 days
One known gap: full role-based access control (group-level model permissions via GUI) is still in development for the 2026 roadmap. Per-user permissions exist and work; granular group policies require manual YAML config rather than an admin panel toggle. Fine for a 10-person team, annoying at 50+.
Full setup: LibreChat Setup Guide 2026.
Document RAG: AnythingLLM v1.13.0
AnythingLLM (v1.13.0, May 26, 2026) handles the “chat with your docs” workload. For a dev team that means: internal wikis, architecture decision records, runbooks, design docs, and the shared knowledge that currently lives in Notion or Confluence and takes 20 minutes to find.
The Docker version enables proper multi-user features: Admin/Manager/Default roles, per-workspace access controls, isolated document libraries per project or team, and embeddable chat widgets for internal tools. The Model Router introduced in v1.13.0 routes simple queries to a fast local model and complex analysis to a cloud model automatically — useful when you want to control API spend without asking developers to pick models manually.
# AnythingLLM — multi-user Docker mode
docker run -d \
-p 3001:3001 \
-v /path/to/anythingllm-storage:/app/server/storage \
-e MULTI_USER_MODE=true \
mintplexlabs/anythingllm:latest
After first run, navigate to http://localhost:3001 and complete admin setup. Point the LLM provider at your Ollama instance for fully local operation.
Practical workspace layout for a 10-person team:
- One workspace per major product area or project
- Assign only the relevant docs per workspace (don’t put the marketing wiki next to production runbooks)
- Give team leads Manager role, developers Default, and one person Admin
AnythingLLM doesn’t require GPU if you’re calling Ollama or an API; the embedding generation for new documents is the compute-heavy step, but it runs in the background as a batch job.
Full guide: AnythingLLM Local RAG Setup.
Infrastructure Requirements by Team Size
3–5 engineers
A developer workstation with a mid-range consumer GPU handles all three tools. No dedicated server needed.
| Component | Spec | Approximate cost |
|---|---|---|
| GPU | RTX 3090 24 GB (used) | $900–1,200 |
| RAM | 32 GB DDR5 | $120 |
| NVMe storage | 1 TB | $80 |
| Total hardware | ~$1,100–1,400 | |
| SaaS equivalent | Copilot Business × 5 | $95/mo |
| Break-even | ~12–15 months |
Run Tabby on the GPU, LibreChat and AnythingLLM in Docker containers on the same machine. Ollama handles the backend LLM for LibreChat chat and AnythingLLM RAG. This entire stack fits on one machine with 32 GB RAM.
5–15 engineers
Move to a dedicated server. The GPU is doing real work; you don’t want it sleeping on someone’s desk or getting tied up by a developer’s local gaming session.
| Component | Spec | Approximate cost |
|---|---|---|
| GPU | RTX 4090 24 GB | $1,800–2,000 |
| CPU | AMD Ryzen 9 7950X | $450 |
| RAM | 64 GB DDR5 | $200 |
| NVMe storage | 2 TB | $150 |
| PSU + case | — | ~$300 |
| Total hardware | ~$2,900–3,100 | |
| SaaS equivalent | Copilot × 10 + ChatGPT Team × 10 | $490/mo |
| Break-even | ~6–7 months |
The RTX 4090’s 24 GB VRAM is the key: it runs Qwen2.5-Coder-14B for Tabby alongside a separate LLM (e.g., Llama 3.1 8B) for LibreChat and AnythingLLM via Ollama. Context switching between models takes 3–5 seconds — acceptable for chat workloads where developers aren’t all prompting at the exact same moment.
If buying hardware is off the table, RunPod offers dedicated RTX 4090 instances at roughly $700–800/month. That’s still well under the SaaS cost for a 10-person team, with no capital expense and no hardware maintenance.
15+ engineers
At 15+ engineers with concurrent IDE usage, you want either a multi-GPU rig or separate servers per workload.
| Configuration | Approach | Notes |
|---|---|---|
| 2× RTX 4090 rig | Tabby on GPU 0 (32B model), Ollama on GPU 1 for LibreChat/AnythingLLM | One server, simpler ops |
| Separate servers | Inference server for Tabby + Ollama; CPU server for LibreChat + AnythingLLM | More resilient, easier to upgrade piecemeal |
| Cloud GPU (A100 80 GB) | Single high-VRAM instance handles all inference | RunPod ~$2/hr reserved; good for variable load |
LibreChat and AnythingLLM have minimal hardware requirements — any machine with 16 GB RAM runs both. The GPU budget goes entirely on inference.
For help speccing out a GPU server, runaihome.com covers home lab GPU builds with current hardware pricing.
Cost Comparison: Self-Hosted vs SaaS
Three-year comparison, with hardware amortized over 36 months and $40/month estimated electricity per GPU workstation. SaaS baseline uses Copilot Business ($19/user) + ChatGPT Team ($30/user).
| Team size | SaaS cost/mo | Hardware (one-time) | Self-hosted all-in/mo | 3-year savings |
|---|---|---|---|---|
| 5 engineers | $245 | ~$1,400 | ~$79 | ~$5,900 |
| 10 engineers | $490 | ~$3,100 | ~$126 | ~$13,100 |
| 15 engineers | $735 | ~$6,500 | ~$221 | ~$18,600 |
| 20 engineers | $980 | ~$9,000 | ~$290 | ~$24,800 |
Self-hosted all-in/mo = amortized hardware + electricity. Does not include operator time. SaaS side does not include a separate RAG tool cost, which would increase the SaaS baseline further.
The 4× cost difference for a 10-person team over three years is not marginal. The counter-argument: every hour someone spends maintaining the stack costs something. If no one on the team can manage a Docker Compose file and a GPU driver update, the savings erode. The question is whether you have that person — not whether the math works.
For a detailed single-developer breakdown: FOSS AI vs SaaS AI: Real 12-Month Cost for a Solo Developer in 2026.
Should Your Team Self-Host? Decision Checklist
Answer these honestly. Two or more “no” answers is a signal to stick with SaaS until the situation changes.
Self-host makes sense if:
- At least one developer can own the infra — Docker, container updates, GPU drivers
- Network or compliance policy restricts sending code or documents to external APIs
- 5+ team members are active daily AI tool users (below that, payback period stretches)
- You can tolerate planned maintenance windows (model swaps, container restarts, occasional GPU driver updates)
- You want model flexibility — different models for different tasks without per-seat pricing following you
Stick with SaaS if:
- No one on the team has Linux server experience — this is the single most common failure mode
- You’re in a regulated industry with strict audit trail requirements and no bandwidth to configure log shipping
- Team AI usage is unpredictable or bursty — usage-based SaaS scales to zero; your hardware doesn’t
- You depend on GitHub Copilot’s hosted integrations: Copilot Workspace, PR review generation, issue triage — those are GitHub-native and don’t transfer to a self-hosted Tabby instance
When This Stack Doesn’t Work
The Copilot feature gap is real. Tabby is a solid code completion tool with real team features, but GitHub Copilot has years of GitHub-specific integrations: Copilot Workspace, pull request summaries, issue triage, and IDE polish that Tabby doesn’t match. If your workflow depends on those features specifically, self-hosting removes them without a direct substitute.
Latency on CPU is a dealbreaker. If you can’t allocate GPU to Tabby, don’t bother. Code completion on CPU with a 7B model produces 4–8 second suggestion wait times. Developers turn it off within a week. The GPU is not optional for the code completion workload.
Model maintenance is real ongoing work. GGUF models go stale. Qwen2.5-Coder is strong today; there are already stronger options. Keeping up with model quality requires someone to evaluate, download, and reconfigure — roughly 30–60 minutes per cycle, maybe quarterly. This isn’t a one-time setup cost.
Multi-GPU ops is a step change in complexity. Going from one RTX 4090 to two adds complexity disproportionate to the capacity gain — CUDA device selection, separate model routing, Docker GPU binding. If you hit 15+ engineers, the separate-servers approach (one inference machine, one CPU machine for UI services) is simpler to operate than a multi-GPU single host.
See also: Open-Source vs Proprietary AI Tools: Cost Breakdown 2026 and The Open-Source AI Stack in 2026: What Works Together.
FAQ
What’s the minimum GPU to run all three tools for a small team? 8 GB VRAM (RTX 3070, RTX 4060 Ti) runs Tabby with a 7B model. LibreChat and AnythingLLM don’t need GPU if you route to an API. For 3–5 people who are okay with slightly slower completions, an 8 GB card works. For 10+ people, you want 24 GB (RTX 3090 or RTX 4090) — the 7B models are fast enough, but you need the headroom for multiple concurrent requests.
Can I use cloud GPU instead of buying hardware? Yes. RunPod dedicated RTX 4090 instances at ~$700–800/month are still well under the combined SaaS cost for a 10-person team ($490/month just for Copilot + ChatGPT Team). You trade hardware ownership and maintenance for a monthly bill that scales down if the team shrinks.
Do Tabby and LibreChat share the same Ollama instance? They can. Tabby has its own bundled inference engine (recommended for code completion latency) but can also delegate to Ollama. LibreChat and AnythingLLM both connect to Ollama natively. One Ollama instance serving both chat tools is the most common configuration — you just need enough VRAM to keep both active models loaded.
How do I handle model updates across the stack?
For Tabby: update the Docker image and change the --model flag in your start command, then restart the container. For Ollama: ollama pull <new-model> and update the model name in LibreChat’s and AnythingLLM’s settings. Ollama model swaps don’t require downtime; Tabby requires a restart (usually under 30 seconds).
Is this stack auditable for compliance? Partially. LibreChat v0.8.6 logs all conversations to MongoDB and supports Prometheus metrics and OpenTelemetry tracing. Tabby logs usage per user. AnythingLLM logs chat history per workspace. What’s not included out of the box: centralized SIEM integration or SOC 2-ready audit trails — those require custom log shipping (Loki, ELK stack) on top of the base setup.
Sources
- Tabby v0.32.0 release — TabbyML/tabby GitHub
- LibreChat v0.8.6 changelog — librechat.ai
- AnythingLLM v1.13.0 release — Mintplex-Labs/anything-llm GitHub
- GitHub Copilot plans and pricing 2026 — GitHub Docs
- Tabby hardware requirements discussion — GitHub TabbyML/tabby #2709
- LibreChat LDAP authentication docs — librechat.ai
- LibreChat 2026 roadmap — librechat.ai blog
- RTX 4090 cloud GPU pricing comparison — getdeploying.com
Recommended Gear
- RTX 4090 24GB GPU — recommended for teams of 5–20
- RTX 3090 24GB GPU — budget pick for teams of 3–5 (buy used)
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →