Fine-Tune Llama 3 with Unsloth 2026: Dataset to GGUF
TL;DR: Unsloth (v2026.5.8) cuts Llama 3.1 8B fine-tuning to 2–4 hours on a consumer GPU with 8GB+ VRAM, using 70% less memory than standard QLoRA. You get a GGUF you can drop straight into Ollama. The catch: output quality depends entirely on your dataset quality.
What you’ll have running after this guide:
- A domain-adapted Llama 3.1 8B model trained on your own dataset
- A Q4_K_M GGUF file ready to run in Ollama, LM Studio, or Jan.ai
- A repeatable training pipeline you can re-run when your data changes
Honest take: Unsloth is the right tool for single-GPU fine-tuning in 2026. axolotl has more knobs for complex pipelines; Unsloth is faster to get working and easier on VRAM.
Why fine-tune instead of just prompting?
Prompting a general model works until it doesn’t. If your use case involves a specific writing style the model keeps drifting from, domain vocabulary it consistently mangles (medical codes, legal terms, proprietary jargon), or a structured output format it forgets mid-conversation — fine-tuning fixes those problems permanently instead of requiring a 500-token system prompt every call.
The other option is RAG, which is the right answer when the knowledge lives in documents you want to retrieve. Fine-tuning is better when you want to change how the model behaves: its tone, its output structure, its fluency in a domain. These are different problems with different solutions.
Which fine-tuning framework to use
Before getting into the steps, here’s where Unsloth sits relative to the alternatives:
| Unsloth | axolotl | HF TRL (stock) | |
|---|---|---|---|
| Single-GPU speed | 2–5× faster | 1× baseline | 1× baseline |
| VRAM usage (8B QLoRA) | ~8–10 GB | ~12–14 GB | ~14–18 GB |
| Setup complexity | Low (pip install) | Medium (config YAML) | Low |
| Multi-GPU support | Limited | Strong | Strong |
| Custom training loops | Limited | Full | Full |
| Best for | Fast iteration, single GPU | Production pipelines, multi-GPU | Research, custom objectives |
Unsloth wins on a single consumer GPU. If you’re distributing across multiple cards or need custom training objectives (DPO, PPO, GRPO), axolotl or standard TRL give you more control. For this guide, single-GPU fine-tuning with Unsloth is the path.
Hardware requirements
QLoRA makes 8B-parameter fine-tuning possible on cards most developers already own:
| Model | Method | Minimum VRAM | Training time (1k examples, 3 epochs) |
|---|---|---|---|
| Llama 3.2 3B | QLoRA | 6 GB | ~30 min |
| Llama 3.1 8B | QLoRA (4-bit) | 8 GB | ~2 hours |
| Llama 3.1 8B | LoRA (16-bit) | 18 GB | ~2.5 hours |
| Llama 3.1 70B | QLoRA (4-bit) | 24 GB | ~8–12 hours |
An RTX 3090 (24GB) handles the 8B run with room to spare. An RTX 4090 cuts training time roughly in half. If you’re on 8GB VRAM (RTX 4060 or similar), drop max_seq_length to 1024 and use Llama 3.2 3B instead of 8B.
If you don’t have a suitable local GPU, RunPod rents RTX 4090 and A100 instances by the hour. A full 8B fine-tune run typically costs under $3.
OS: Linux is the primary target. Windows via WSL2 works. macOS with Apple Silicon is supported through Unsloth Studio (MLX-based). Native Windows training works but is less tested.
Python: 3.9–3.14. PyTorch 2.5+ recommended.
Step 1: Install Unsloth
pip install unsloth
Current version: 2026.5.8 (released May 26, 2026). The version numbering follows a date-based scheme — YYYY.MM.DD.
Verify:
python -c "import unsloth; print(unsloth.__version__)"
Also install the training stack:
pip install trl transformers datasets accelerate
If you hit CUDA version mismatches, Unsloth’s docs at unsloth.ai/docs have conda environment files for the most common CUDA + PyTorch combinations. The conda path is more reliable when your system has multiple CUDA versions installed.
Step 2: Get access to Llama 3.1
Llama 3.1 is gated on Hugging Face. You need to request access once:
- Create an account at huggingface.co
- Visit meta-llama/Llama-3.1-8B-Instruct and accept the license
- Generate an access token at huggingface.co/settings/tokens
- Authenticate:
huggingface-cli login
License note: Llama 3.1 uses the Meta Llama 3.1 Community License — not Apache or MIT. Commercial use is allowed for most cases, but the license kicks in specific obligations above 700 million monthly active users, and any fine-tuned model you distribute must include “Llama” in its name. Read the full terms at llama.com/llama3_1/license/ before shipping a product.
Alternatively, use Unsloth’s pre-uploaded mirror, which bypasses the individual HF approval process:
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct"
Step 3: Prepare your dataset
Unsloth’s SFTTrainer accepts three common formats.
Alpaca format (instruction/input/output):
{"instruction": "Convert this date to ISO 8601:", "input": "March 15th, 2026", "output": "2026-03-15"}
{"instruction": "Summarize this clause in plain English:", "input": "The party of the first part...", "output": "This clause means..."}
ShareGPT format (multi-turn conversations):
{"conversations": [
{"from": "human", "value": "What does EBITDA stand for?"},
{"from": "gpt", "value": "Earnings Before Interest, Taxes, Depreciation, and Amortization."}
]}
How much data?
- Under 300 examples: fine-tune the Instruct model (style and behavior shaping)
- 300–1,000 examples: Instruct or base model both work
- Over 1,000 examples: base model preferred for deeper behavior change
More data doesn’t reliably beat better data. If you have 10,000 mediocre examples and 500 carefully curated ones, the 500 will often produce a better model. Deduplicate, filter out short or malformed entries, and aim for consistent quality before you care about quantity.
Load your data:
from datasets import load_dataset
# Local JSON Lines file
dataset = load_dataset("json", data_files="your_data.jsonl", split="train")
# Or a public HF dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
Step 4: Load the model with QLoRA
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048, # drop to 1024 if you hit OOM on 8GB VRAM
dtype=None, # auto-detect: bfloat16 on Ampere+, float16 older
load_in_4bit=True, # QLoRA: model in 4-bit, adapters in 16-bit
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity, more VRAM
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0, # 0 is optimal for Unsloth's fused kernels
bias="none",
use_gradient_checkpointing="unsloth", # long-context support, less VRAM
random_state=3407,
)
load_in_4bit=True is the QLoRA switch. The base model loads compressed to 4-bit; the LoRA adapters — the actual trainable parameters — remain in 16-bit. You’re training roughly 1–5% of the total parameter count, which is why 8GB is enough.
LoRA rank (r): r=16 is the standard starting point. Raise it to 32 or 64 if you’re doing style transfer or long-form generation and have the VRAM headroom. For simple format training, r=8 is sufficient.
Step 5: Apply the chat template
This step is easy to skip and costly when you do. The Llama 3.1 Instruct model expects a specific token format at both training and inference time. Training with the wrong template and running inference with the correct one (or vice versa) produces garbled, incoherent output that’s hard to diagnose.
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3",
)
def format_prompts(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(
convo, tokenize=False, add_generation_prompt=False
) for convo in convos]
return {"text": texts}
dataset = dataset.map(format_prompts, batched=True)
If your dataset is in Alpaca format, Unsloth provides a ready-made alpaca_prompt template that handles the instruction/input/output structure correctly.
Step 6: Train
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
dataset_num_proc=2,
args=SFTConfig(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch = 8
warmup_steps=5,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
output_dir="./outputs",
optim="adamw_8bit", # saves ~2 GB VRAM vs standard Adam
seed=3407,
),
)
trainer_stats = trainer.train()
Expected duration (8B QLoRA, 1,000 examples, 3 epochs, seq_len 2048):
- RTX 3090 (24GB): ~1.5–2 hours
- RTX 4090 (24GB): ~45–60 minutes
VRAM at peak: 8–12 GB depending on batch size and context length.
Watch the logged loss. It should drop steadily through the first epoch and flatten toward the end. If training loss goes below ~0.5 while a separate validation split shows rising loss, you’re overfitting — reduce epochs or add more data.
Before committing to a full multi-hour run, use max_steps=60 in your SFTConfig to do a quick sanity check that the loss is moving and the code isn’t erroring out at step 1.
Step 7: Export to GGUF
model.save_pretrained_gguf(
"my-llama-3-finetuned",
tokenizer,
quantization_method="q4_k_m",
)
Unsloth handles the full export — internally it calls llama.cpp’s conversion and quantization tools so you don’t need to install or invoke them separately. The output is a .gguf file in the specified directory.
Quantization options:
| Method | File size (8B) | Quality retention | Use when |
|---|---|---|---|
q4_k_m | ~4.5 GB | ~95% | Default — best size/quality balance |
q5_k_m | ~5.5 GB | ~97% | More VRAM available, quality matters |
q8_0 | ~8.5 GB | ~99% | Near-lossless, before further processing |
f16 | ~16 GB | 100% | Full precision export |
For a deeper look at what Q4_K_M vs Q5_K_M means for actual output quality, the GGUF quantization guide covers the tradeoffs in detail.
Step 8: Import into Ollama
Create a Modelfile in the same directory as your exported GGUF:
FROM ./my-llama-3-finetuned/unsloth.Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
Register and run the model:
ollama create my-finetuned-llama -f Modelfile
ollama run my-finetuned-llama
The same GGUF file works in LM Studio and Jan.ai without a Modelfile — just point the app at the .gguf via their file import dialogs.
Critical: use the Llama 3 chat template in the Modelfile, not a generic one. Template mismatch at inference time is the most common cause of low-quality output after a successful training run.
Once the model is registered in Ollama, it appears automatically in Open WebUI if you have that running — no extra configuration needed.
When NOT to use Unsloth
The task needs a bigger base model. A fine-tuned 8B handles style, format, and narrow domains well. For complex reasoning — multi-step medical diagnosis, deep legal analysis — a fine-tuned 8B will still lose to a general 70B. Unsloth supports 70B QLoRA on a 24GB card, but if you’re questioning whether 8B is enough, the answer is probably no.
You need multi-GPU training. Unsloth’s custom CUDA kernels are optimized for single-GPU throughput. Two-card setups work but aren’t Unsloth’s strength. If your dataset and model size require distributing across GPUs, axolotl or HuggingFace Accelerate-native training are the better choice. The Unsloth vs axolotl comparison goes into where each framework wins.
You need structured experiment tracking. Unsloth’s default logging is minimal — loss per step, final stats. For reproducible multi-run experiments with full metric logging, add report_to="wandb" to your SFTConfig. Weights & Biases integrates cleanly with Unsloth’s trainer.
Your goal is RLHF or preference alignment. SFT (supervised fine-tuning) is the method here — showing the model examples and training it to match. DPO and GRPO (preference-based alignment methods) are supported in Unsloth but require additional setup beyond the SFT path.
Frequently Asked Questions
How long does fine-tuning Llama 3.1 8B actually take on a consumer GPU?
On an RTX 3090 with 1,000 training examples at 3 epochs: roughly 1.5–2 hours. An RTX 4090 cuts that to under an hour. Run with max_steps=60 first to verify your setup is working before committing to the full run.
Does my fine-tuned GGUF preserve the original model’s general knowledge? Mostly yes, but some forgetting happens. QLoRA preserves base capability better than full fine-tuning because you’re only modifying a small fraction of the weights. Keep your dataset focused on one domain and limit epochs to 2–3 to minimize degradation. Mixing unrelated training topics in one dataset is a reliable way to degrade general performance.
Can I fine-tune on data I don’t want to share publicly? The entire training process runs locally on your machine. Your dataset never leaves your GPU. The GGUF output contains no recoverable copies of the training text. This is one of the main reasons to run local fine-tuning vs. using a cloud training API.
What’s the minimum dataset size that actually changes the model’s behavior? 50–100 high-quality, consistent examples will produce a measurable change. Below 50, the signal is too weak to overcome the base model’s priors. For reliable style or format changes, 300–500 examples is a more comfortable floor.
My loss curve looks flat after epoch 1 — is training stuck? Not necessarily. If the initial loss drops from ~2.0 to ~0.8 in the first epoch and then stays flat, the model may have learned what it can from your dataset. Flat loss at a reasonable value is fine. Flat loss at 2.0+ usually means a data formatting problem — check that your chat template application produced correctly formatted training examples.
Sources
- Unsloth GitHub — unslothai/unsloth
- Unsloth PyPI — version 2026.5.8
- meta-llama/Llama-3.1-8B-Instruct on Hugging Face
- Meta Llama 3.1 Community License Agreement
- Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth — Hugging Face Blog
- SFT with Unsloth — Microsoft Agent Lightning Docs
- Unsloth Studio: No-Code LLM Fine-Tuning — Alchemic Technology
- Export Fine-Tuned LLM to GGUF — docs.bswen.com
Recommended Gear
Was this article helpful?
Thanks for the feedback — it helps improve future articles.
Need hands-on help?
I offer 1-on-1 technical consulting for local AI setup, GPU selection, and AI coding tool configuration — same topics covered on this site.
Book a session — $49 / hour →