May 28, 2026

Fine-Tune Llama 3 with Unsloth 2026: Dataset to GGUF

By AIFoss · 12 min read

finetuningaillmgpupython

TL;DR: Unsloth (v2026.5.8) cuts Llama 3.1 8B fine-tuning to 2–4 hours on a consumer GPU with 8GB+ VRAM, using 70% less memory than standard QLoRA. You get a GGUF you can drop straight into Ollama. The catch: output quality depends entirely on your dataset quality.

What you’ll have running after this guide:

A domain-adapted Llama 3.1 8B model trained on your own dataset
A Q4_K_M GGUF file ready to run in Ollama, LM Studio, or Jan.ai
A repeatable training pipeline you can re-run when your data changes

Honest take: Unsloth is the right tool for single-GPU fine-tuning in 2026. axolotl has more knobs for complex pipelines; Unsloth is faster to get working and easier on VRAM.

Why fine-tune instead of just prompting?

Prompting a general model works until it doesn’t. If your use case involves a specific writing style the model keeps drifting from, domain vocabulary it consistently mangles (medical codes, legal terms, proprietary jargon), or a structured output format it forgets mid-conversation — fine-tuning fixes those problems permanently instead of requiring a 500-token system prompt every call.

The other option is RAG, which is the right answer when the knowledge lives in documents you want to retrieve. Fine-tuning is better when you want to change how the model behaves: its tone, its output structure, its fluency in a domain. These are different problems with different solutions.

Which fine-tuning framework to use

Before getting into the steps, here’s where Unsloth sits relative to the alternatives:

	Unsloth	axolotl	HF TRL (stock)
Single-GPU speed	2–5× faster	1× baseline	1× baseline
VRAM usage (8B QLoRA)	~8–10 GB	~12–14 GB	~14–18 GB
Setup complexity	Low (pip install)	Medium (config YAML)	Low
Multi-GPU support	Limited	Strong	Strong
Custom training loops	Limited	Full	Full
Best for	Fast iteration, single GPU	Production pipelines, multi-GPU	Research, custom objectives

Unsloth wins on a single consumer GPU. If you’re distributing across multiple cards or need custom training objectives (DPO, PPO, GRPO), axolotl or standard TRL give you more control. For this guide, single-GPU fine-tuning with Unsloth is the path.

Hardware requirements

QLoRA makes 8B-parameter fine-tuning possible on cards most developers already own:

Model	Method	Minimum VRAM	Training time (1k examples, 3 epochs)
Llama 3.2 3B	QLoRA	6 GB	~30 min
Llama 3.1 8B	QLoRA (4-bit)	8 GB	~2 hours
Llama 3.1 8B	LoRA (16-bit)	18 GB	~2.5 hours
Llama 3.1 70B	QLoRA (4-bit)	24 GB	~8–12 hours

An RTX 3090 (24GB) handles the 8B run with room to spare. An RTX 4090 cuts training time roughly in half. If you’re on 8GB VRAM (RTX 4060 or similar), drop max_seq_length to 1024 and use Llama 3.2 3B instead of 8B.

If you don’t have a suitable local GPU, RunPod rents RTX 4090 and A100 instances by the hour. A full 8B fine-tune run typically costs under $3.

OS: Linux is the primary target. Windows via WSL2 works. macOS with Apple Silicon is supported through Unsloth Studio (MLX-based). Native Windows training works but is less tested.

Python: 3.9–3.14. PyTorch 2.5+ recommended.

Step 1: Install Unsloth

pip install unsloth

Current version: 2026.5.8 (released May 26, 2026). The version numbering follows a date-based scheme — YYYY.MM.DD.

Verify:

python -c "import unsloth; print(unsloth.__version__)"

Also install the training stack:

pip install trl transformers datasets accelerate

If you hit CUDA version mismatches, Unsloth’s docs at unsloth.ai/docs have conda environment files for the most common CUDA + PyTorch combinations. The conda path is more reliable when your system has multiple CUDA versions installed.

Step 2: Get access to Llama 3.1

Llama 3.1 is gated on Hugging Face. You need to request access once:

Create an account at huggingface.co
Visit meta-llama/Llama-3.1-8B-Instruct and accept the license
Generate an access token at huggingface.co/settings/tokens
Authenticate: huggingface-cli login

License note: Llama 3.1 uses the Meta Llama 3.1 Community License — not Apache or MIT. Commercial use is allowed for most cases, but the license kicks in specific obligations above 700 million monthly active users, and any fine-tuned model you distribute must include “Llama” in its name. Read the full terms at llama.com/llama3_1/license/ before shipping a product.

Alternatively, use Unsloth’s pre-uploaded mirror, which bypasses the individual HF approval process:

model_name = "unsloth/Meta-Llama-3.1-8B-Instruct"

Step 3: Prepare your dataset

Unsloth’s SFTTrainer accepts three common formats.

Alpaca format (instruction/input/output):

{"instruction": "Convert this date to ISO 8601:", "input": "March 15th, 2026", "output": "2026-03-15"}
{"instruction": "Summarize this clause in plain English:", "input": "The party of the first part...", "output": "This clause means..."}

ShareGPT format (multi-turn conversations):

{"conversations": [
  {"from": "human", "value": "What does EBITDA stand for?"},
  {"from": "gpt", "value": "Earnings Before Interest, Taxes, Depreciation, and Amortization."}
]}

How much data?

Under 300 examples: fine-tune the Instruct model (style and behavior shaping)
300–1,000 examples: Instruct or base model both work
Over 1,000 examples: base model preferred for deeper behavior change

More data doesn’t reliably beat better data. If you have 10,000 mediocre examples and 500 carefully curated ones, the 500 will often produce a better model. Deduplicate, filter out short or malformed entries, and aim for consistent quality before you care about quantity.

Load your data:

from datasets import load_dataset

# Local JSON Lines file
dataset = load_dataset("json", data_files="your_data.jsonl", split="train")

# Or a public HF dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")

Step 4: Load the model with QLoRA

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,   # drop to 1024 if you hit OOM on 8GB VRAM
    dtype=None,            # auto-detect: bfloat16 on Ampere+, float16 older
    load_in_4bit=True,     # QLoRA: model in 4-bit, adapters in 16-bit
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                  # LoRA rank — higher = more capacity, more VRAM
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,        # 0 is optimal for Unsloth's fused kernels
    bias="none",
    use_gradient_checkpointing="unsloth",  # long-context support, less VRAM
    random_state=3407,
)

load_in_4bit=True is the QLoRA switch. The base model loads compressed to 4-bit; the LoRA adapters — the actual trainable parameters — remain in 16-bit. You’re training roughly 1–5% of the total parameter count, which is why 8GB is enough.

LoRA rank (r): r=16 is the standard starting point. Raise it to 32 or 64 if you’re doing style transfer or long-form generation and have the VRAM headroom. For simple format training, r=8 is sufficient.

Step 5: Apply the chat template

This step is easy to skip and costly when you do. The Llama 3.1 Instruct model expects a specific token format at both training and inference time. Training with the wrong template and running inference with the correct one (or vice versa) produces garbled, incoherent output that’s hard to diagnose.

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3",
)

def format_prompts(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(
        convo, tokenize=False, add_generation_prompt=False
    ) for convo in convos]
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)

If your dataset is in Alpaca format, Unsloth provides a ready-made alpaca_prompt template that handles the instruction/input/output structure correctly.

Step 6: Train

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,   # effective batch = 8
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        output_dir="./outputs",
        optim="adamw_8bit",              # saves ~2 GB VRAM vs standard Adam
        seed=3407,
    ),
)

trainer_stats = trainer.train()

Expected duration (8B QLoRA, 1,000 examples, 3 epochs, seq_len 2048):

RTX 3090 (24GB): ~1.5–2 hours
RTX 4090 (24GB): ~45–60 minutes

VRAM at peak: 8–12 GB depending on batch size and context length.

Watch the logged loss. It should drop steadily through the first epoch and flatten toward the end. If training loss goes below ~0.5 while a separate validation split shows rising loss, you’re overfitting — reduce epochs or add more data.

Before committing to a full multi-hour run, use max_steps=60 in your SFTConfig to do a quick sanity check that the loss is moving and the code isn’t erroring out at step 1.

Step 7: Export to GGUF

model.save_pretrained_gguf(
    "my-llama-3-finetuned",
    tokenizer,
    quantization_method="q4_k_m",
)

Unsloth handles the full export — internally it calls llama.cpp’s conversion and quantization tools so you don’t need to install or invoke them separately. The output is a .gguf file in the specified directory.

Quantization options:

Method	File size (8B)	Quality retention	Use when
`q4_k_m`	~4.5 GB	~95%	Default — best size/quality balance
`q5_k_m`	~5.5 GB	~97%	More VRAM available, quality matters
`q8_0`	~8.5 GB	~99%	Near-lossless, before further processing
`f16`	~16 GB	100%	Full precision export

For a deeper look at what Q4_K_M vs Q5_K_M means for actual output quality, the GGUF quantization guide covers the tradeoffs in detail.

Step 8: Import into Ollama

Create a Modelfile in the same directory as your exported GGUF:

FROM ./my-llama-3-finetuned/unsloth.Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

ollama create my-finetuned-llama -f Modelfile
ollama run my-finetuned-llama

The same GGUF file works in LM Studio and Jan.ai without a Modelfile — just point the app at the .gguf via their file import dialogs.

Critical: use the Llama 3 chat template in the Modelfile, not a generic one. Template mismatch at inference time is the most common cause of low-quality output after a successful training run.

Once the model is registered in Ollama, it appears automatically in Open WebUI if you have that running — no extra configuration needed.

When NOT to use Unsloth

The task needs a bigger base model. A fine-tuned 8B handles style, format, and narrow domains well. For complex reasoning — multi-step medical diagnosis, deep legal analysis — a fine-tuned 8B will still lose to a general 70B. Unsloth supports 70B QLoRA on a 24GB card, but if you’re questioning whether 8B is enough, the answer is probably no.

You need multi-GPU training. Unsloth’s custom CUDA kernels are optimized for single-GPU throughput. Two-card setups work but aren’t Unsloth’s strength. If your dataset and model size require distributing across GPUs, axolotl or HuggingFace Accelerate-native training are the better choice. The Unsloth vs axolotl comparison goes into where each framework wins.

You need structured experiment tracking. Unsloth’s default logging is minimal — loss per step, final stats. For reproducible multi-run experiments with full metric logging, add report_to="wandb" to your SFTConfig. Weights & Biases integrates cleanly with Unsloth’s trainer.

Your goal is RLHF or preference alignment. SFT (supervised fine-tuning) is the method here — showing the model examples and training it to match. DPO and GRPO (preference-based alignment methods) are supported in Unsloth but require additional setup beyond the SFT path.

Frequently Asked Questions

How long does fine-tuning Llama 3.1 8B actually take on a consumer GPU? On an RTX 3090 with 1,000 training examples at 3 epochs: roughly 1.5–2 hours. An RTX 4090 cuts that to under an hour. Run with max_steps=60 first to verify your setup is working before committing to the full run.

Does my fine-tuned GGUF preserve the original model’s general knowledge? Mostly yes, but some forgetting happens. QLoRA preserves base capability better than full fine-tuning because you’re only modifying a small fraction of the weights. Keep your dataset focused on one domain and limit epochs to 2–3 to minimize degradation. Mixing unrelated training topics in one dataset is a reliable way to degrade general performance.

Can I fine-tune on data I don’t want to share publicly? The entire training process runs locally on your machine. Your dataset never leaves your GPU. The GGUF output contains no recoverable copies of the training text. This is one of the main reasons to run local fine-tuning vs. using a cloud training API.

What’s the minimum dataset size that actually changes the model’s behavior? 50–100 high-quality, consistent examples will produce a measurable change. Below 50, the signal is too weak to overcome the base model’s priors. For reliable style or format changes, 300–500 examples is a more comfortable floor.

My loss curve looks flat after epoch 1 — is training stuck? Not necessarily. If the initial loss drops from ~2.0 to ~0.8 in the first epoch and then stays flat, the model may have learned what it can from your dataset. Flat loss at a reasonable value is fine. Flat loss at 2.0+ usually means a data formatting problem — check that your chat template application produced correctly formatted training examples.

Sources

Recommended Gear

Was this article helpful?