Kamil Józwik

Fine tuning LLMs

Fine-tuning allows to adapt generalist models into specialists, but is it always the best approach?

llm

Your e-commerce company's customer support team is drowning in tickets. The generic AI chatbot you deployed handles basic queries, but it doesn't understand your product catalog, uses the wrong brand voice, and escalates far too many issues. Meanwhile, your finance team spent $80,000 trying to teach a foundation model about your company's specific policies and procedures - only to discover the model still can't reliably retrieve that information when needed.

Both teams are asking: should we fine-tune our LLM? The answer isn't obvious, because fine-tuning is one of the most misunderstood techniques in the AI toolkit. This article builds a mental model for understanding what fine-tuning actually does, when it's the right choice, and - critically - when it's not.

What Fine-Tuning Actually Is

Fine-tuning means continuing the training process on a pre-trained model using new, task-specific data. While that sounds straightforward, understanding what happens inside the model makes the difference between using this technique effectively and wasting resources.

When you fine-tune a model, you're not adding knowledge to a database or teaching it new facts in any reliable way. You're adjusting the numerical weights in its neural network through a process called gradient descent. The model processes your training examples, compares its predictions to the correct outputs, calculates the error, and updates billions of parameters by tiny amounts. These parameter updates accumulate over multiple passes through your data, gradually shifting the model's behavior.

Here's what makes fine-tuning different from training a model from scratch: you start with weights that already encode vast linguistic knowledge from pre-training. A model like GPT was trained on hundreds of billions of tokens - essentially much of the internet - at a cost of roughly $4.6 million in compute. Fine-tuning leverages all that existing knowledge and adapts it to your specific needs with hundreds or thousands of examples instead of billions, taking days or weeks instead of months, and costing hundreds or thousands of dollars instead of millions.

The mathematical process is the same as training from scratch - forward pass through the network, calculate loss, backpropagate gradients, update weights - but the learning rate is much lower and you're typically working with far less data. Think of it as making targeted adjustments to an already-educated expert rather than teaching someone from scratch.

What Fine-Tuning Changes (and What It Doesn't)

Understanding the distinction between what fine-tuning does and doesn't do is critical for making good decisions.

Fine-tuning excels at behavioral adaptation. It can teach a model to respond in a specific tone, follow particular formatting rules, use domain-specific terminology consistently, or structure outputs in specific ways. When GPT-3 became ChatGPT, that transformation happened largely through fine-tuning with reinforcement learning from human feedback - the model learned to be conversational rather than just completing text.

What fine-tuning does poorly, despite common belief, is knowledge injection. Recent research has revealed something counterintuitive: instruction fine-tuning primarily teaches models "response initiation and style tokens" rather than actual knowledge. If you fine-tune a model on 10,000 Q&A pairs about your company's products, you're not reliably adding that information to the model's "memory." You're adjusting billions of parameters by minuscule amounts, and single examples have an extremely low probability of creating persistent, retrievable knowledge.

This matters because it fundamentally changes when fine-tuning makes sense. If you need a model to know current product specifications, recent policy changes, or any information that updates regularly - fine-tuning is the wrong tool. If you need it to consistently use medical terminology, respond in a specific brand voice, or generate outputs in a particular JSON format - fine-tuning is often the right tool.

Why Fine-Tuning Exists

Fine-tuning solves several specific problems that alternatives can't address as effectively.

Specialized behavior at scale: Foundation models are generalists. They can write in many styles, handle many domains, and perform many tasks - but they don't excel at any particular one by default. A legal AI needs to understand case law conventions, use precise legal terminology, and format citations correctly. A medical diagnosis assistant needs to process clinical terminology, reason through symptom patterns, and structure differential diagnoses appropriately. Fine-tuning allows these specialized behaviors to become the model's default mode rather than something you prompt for every time.

Cost efficiency for high-volume applications: Running a large model costs money with every inference. If you're processing 100,000+ requests per month, fine-tuning a smaller model (like GPT-3.5) to replicate the quality of a larger one (like GPT-4) for your specific task can reduce costs by 10-50x. The upfront training investment amortizes quickly at that scale.

Consistent output formatting: Getting foundation models to reliably produce structured outputs - JSON with specific fields, YAML with certain schemas, standardized report formats - is challenging with prompting alone. Fine-tuning on examples of correct outputs teaches the model these patterns as habitual behavior.

Domain-specific reasoning patterns: While fine-tuning doesn't add factual knowledge reliably, it can teach reasoning patterns. A model fine-tuned on medical diagnostic workflows learns to consider differential diagnoses systematically. One fine-tuned on financial analysis learns to weigh risk factors in ways specific to that domain.

The key insight is that fine-tuning changes how a model thinks and responds, not what it knows. That distinction determines when it's the right approach.

The Spectrum of Fine-Tuning Approaches

Not all fine-tuning is created equal. The approaches differ dramatically in their resource requirements, use cases, and trade-offs.

Full Fine-Tuning

Full fine-tuning updates every parameter in the model during training. For a 7 billion parameter model, that means adjusting 7 billion numbers based on your training data.

The memory requirements are substantial - you need space not just for the model weights but for gradients, optimizer states, and activation values during training. A 7B model requires roughly 28GB of GPU memory for full fine-tuning. The training is slower because you're updating every parameter, and there's significant risk of catastrophic forgetting - the model losing its general capabilities as it specializes.

Full fine-tuning makes sense when you have abundant high-quality data (10,000+ examples), adequate GPU resources, and need maximum performance on a task that differs significantly from pre-training. In practice, it's increasingly rare outside of research labs and large enterprises, because parameter-efficient methods have closed most of the performance gap at a fraction of the cost.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods fine-tune only a small subset of model parameters while keeping most of the pre-trained weights frozen. This fundamentally changes the economics and accessibility of fine-tuning.

LoRA (Low-Rank Adaptation) is the most widely adopted PEFT method. Instead of updating the full weight matrix during training, LoRA inserts small, trainable matrices alongside the frozen weights. Mathematically, it decomposes weight updates into low-rank matrices - if you have a weight matrix W that's 4096x4096, instead of updating all ~16 million parameters, LoRA might use two matrices of rank 16, giving you only ~130,000 trainable parameters.

The practical impact is dramatic: a 7B parameter model might have only 50 million trainable parameters with LoRA - a 95%+ reduction. Memory requirements drop from 28GB to around 15GB. Training is faster, costs are lower, and you can store multiple task-specific adapters (often under 100MB each) and swap them on a single base model.

LoRA achieves comparable performance to full fine-tuning on many tasks. The trade-off is that for tasks requiring fundamental model restructuring or very different capabilities from pre-training, it may underperform full fine-tuning. But for most practical applications - domain adaptation, style adjustment, structured outputs - LoRA delivers 95%+ of full fine-tuning's performance at a fraction of the cost.

QLoRA (Quantized LoRA) takes efficiency further by combining LoRA with 4-bit quantization of the base model. Instead of storing model weights in standard 16-bit precision, QLoRA uses 4-bit representations with special techniques to minimize accuracy loss. This enables fine-tuning a 65B parameter model on a single consumer GPU with 48GB of memory - something that would normally require multiple high-end GPUs.

The memory reduction is roughly 18x overall. QLoRA introduces minimal additional accuracy loss beyond LoRA (typically 1-2%) while democratizing access to large model fine-tuning. If you're memory-constrained or working with models larger than 13B parameters, QLoRA is often the practical choice.

Other PEFT approaches exist - adapter layers, prefix tuning, prompt tuning - but LoRA and QLoRA have emerged as the de facto standards due to their balance of efficiency and performance.

When to Fine-Tune (and When Not To)

The decision to fine-tune should follow a clear hierarchy: try prompt engineering first, then RAG (Retrieval-Augmented Generation), then fine-tuning. Each step up the ladder involves more cost, complexity, and commitment.

Strong Signals for Fine-Tuning

Fine-tuning makes sense when you need consistent, specialized behavior that general prompting can't reliably achieve:

Domain-specific terminology and conventions: A financial services company fine-tuning for named entity recognition in financial news achieved 93.4% accuracy at $0.10/hour inference cost compared to a general model's 95% accuracy at $8/hour - a 50x cost reduction at 98% of the quality. The fine-tuned model understood financial entity types and context-specific meaning of terms like "bank" or "security" without needing that context re-explained in every prompt.

Specialized output formats and structures: Legal document processing at Anzen achieved 99.9% accuracy in document classification through fine-tuning on specific legal document types and clause structures. The model learned these patterns as default behavior rather than requiring explicit formatting instructions each time.

Consistent brand voice and style: An e-commerce company fine-tuned GPT-3.5 on their internal FAQs and support transcripts, achieving 85% autonomous query resolution with responses that matched their brand voice consistently. This eliminated the prompt engineering overhead of defining tone and style in every request.

High-volume cost reduction: When processing 100,000+ requests monthly, fine-tuning a smaller model to replicate a larger model's quality for your specific task creates substantial savings. One team fine-tuned GPT-3.5 to match GPT-4's performance on their evaluation task, reducing per-request costs by an order of magnitude.

Regulatory and compliance requirements: Healthcare and financial applications often require model ownership and auditability that closed-source API models can't provide. A medical documentation system achieved 85% error reduction through fine-tuning with full control over model behavior and outputs.

Strong Signals Against Fine-Tuning

Fine-tuning is the wrong choice when your needs align with these patterns:

Need for current or external knowledge: That $80,000 spent trying to teach a model about company policies and procedures? It would have been better spent on a RAG system. Fine-tuning doesn't reliably inject facts - it adjusts patterns. When information changes frequently (product specifications, current policies, market data), RAG with a vector database provides current, source-cited answers for $70-1,000 monthly instead of expensive periodic retraining.

Limited training data: Fine-tuning with fewer than 1,000 examples carries high overfitting risk and unlikely behavioral improvement. The model may memorize your small dataset rather than learning generalizable patterns. With limited data, few-shot prompting or RAG are safer bets.

Simple tasks solvable with better prompts: Many apparent fine-tuning needs are actually prompt engineering problems. Before committing months and thousands of dollars to fine-tuning, invest days in prompt refinement. The growing context windows of modern models (up to 1 million tokens in some cases) enable extensive examples and documentation directly in prompts.

Rapidly changing domains: A model fine-tuned on January's data is frozen at that point in time. If your domain involves news, current events, market trends, or regulatory changes, fine-tuning creates a maintenance burden. You'll need expensive retraining cycles to stay current, while RAG lets you update knowledge sources without retraining.

Resource constraints: Fine-tuning requires GPU infrastructure, ML expertise, quality training data, and time. A minimum deployment for a 7B model costs roughly $950/month. The development process spans 2-6 months. If you lack these resources, managed API services or RAG solutions are more practical.

The Hybrid Approach

The binary choice between fine-tuning and RAG is often a false one. Increasingly, production systems combine both to leverage their complementary strengths.

Fine-tuning teaches domain-specific reasoning patterns, terminology usage, and output formatting. RAG provides current knowledge and source attribution. Together, they enable specialized models with access to up-to-date information.

A Microsoft agriculture study demonstrated this synergy: fine-tuning improved answer similarity by 6%, RAG improved it by 5%, but combining both yielded an 11% cumulative improvement. The fine-tuned model understood agricultural terminology and reasoning patterns, while RAG provided access to current research and region-specific data.

Another common hybrid pattern involves fine-tuning for cost reduction while using RAG for knowledge currency. A team might fine-tune GPT-3.5 to replicate GPT-4's quality for their specific domain, then use RAG to keep that efficient model grounded in current information.

The key insight is that fine-tuning changes how the model processes and responds, while RAG changes what information it has access to. These are orthogonal improvements that stack.

Common Pitfalls and Misconceptions

Misconception: Fine-Tuning Adds Knowledge

This is the most expensive mistake. Fine-tuning adjusts behavioral patterns, not factual knowledge. When you fine-tune on Q&A pairs about your products, you're not creating a reliable knowledge store - you're adjusting billions of parameters by infinitesimal amounts. Recent research shows instruction fine-tuning primarily teaches response style and formatting, not actual facts.

At high performance levels, model neurons are densely packed with information. Fine-tuning can overwrite critical patterns, leading to catastrophic forgetting where the model loses previously-learned capabilities. A model fine-tuned for medical diagnosis might lose basic arithmetic abilities. One aligned for safety might have that safety compromised by just 10 adversarial examples costing $0.20.

For knowledge needs, use RAG. For behavioral needs, consider fine-tuning.

Misconception: More Data Always Better

Quality matters more than quantity. A clean, diverse, representative dataset of 2,000 examples outperforms a noisy, imbalanced set of 20,000. Public datasets like databricks-dolly-15k were found to contain ambiguous prompts, errors, and problematic content - using them wholesale often degrades rather than improves performance.

Beyond some threshold (typically 5,000-10,000 quality examples), additional data shows diminishing returns. Focus on data quality, diversity, and relevance over raw volume.

Misconception: Fine-Tuning Always Improves Performance

Fine-tuning can degrade performance. Studies found that fine-tuning on large domain datasets harmed RAG pipeline accuracy - the model couldn't effectively extract and integrate retrieved context after specialization. Safety alignment is easily compromised. General conversational abilities may be lost when over-specializing.

Evaluation must cover diverse scenarios, not just your target task. What you optimize for may come at the cost of capabilities you took for granted.

Gotcha: Overfitting Happens Fast

Small datasets enable rapid overfitting. The model memorizes training examples rather than learning patterns. Signs include training loss dropping while validation loss increases. This can happen in just a few epochs with insufficient data.

Prevention requires rigorous train/validation/test splits, early stopping based on validation performance, regularization techniques, and careful monitoring. Many teams skip proper evaluation and deploy overfit models that fail on real-world variations.

Gotcha: Infrastructure Complexity

Memory requirements extend beyond model size to include gradients, optimizer states, and activations. Getting CUDA, cuDNN, PyTorch versions aligned is brittle. Gated models on Hugging Face require manual browser authentication. GPU spot instances save 30-40% but require handling interruptions.

Teams often underestimate infrastructure setup time - it can take as long as fine-tuning itself. Start small, validate your pipeline on a 7B model before attempting 70B.

Current Landscape and Maturity

The fine-tuning ecosystem has consolidated around clear winners while still evolving in important areas.

Tooling Landscape

Open-source frameworks offer varying levels of abstraction. Hugging Face Transformers is the industry standard with the broadest model support and mature Trainer API. Axolotl provides LLM-specific workflows with YAML configs friendly to beginners. Unsloth focuses on speed and memory optimization (2-5x faster, 70% less memory). Torchtune gives PyTorch-native, low-level control for those needing extensibility.

Cloud providers bundle fine-tuning with their ML platforms: AWS SageMaker offers full MLOps integration, Google Vertex AI provides native TPU support, Azure ML emphasizes enterprise governance, Databricks unifies data and ML workflows. The trade-off is integration ease versus vendor lock-in.

Full-service platforms like OpenAI's fine-tuning API, Hugging Face AutoTrain, and Predibase prioritize simplicity over control. They work well for standard use cases but limit customization.

Serverless GPU providers (Modal, Replicate, RunPod, Vast.ai) offer flexible, cost-competitive compute. RunPod and Vast.ai can run $1-2/hour versus $3+ on major clouds, though reliability varies.

Maturity Assessment

Core supervised fine-tuning techniques are mature and proven. PEFT methods like LoRA and QLoRA are established best practices. Production deployment patterns are well-understood. The seven-stage pipeline (data preparation, model initialization, training setup, execution, evaluation, deployment, monitoring) is standard.

Areas still actively evolving include optimal data quality/quantity standards, RLHF/RLAIF alignment methods (complex, expensive, active research), multi-modal fine-tuning (images, text, audio), hybrid RAG + fine-tuning patterns (best practices emerging), and continual learning (catastrophic forgetting remains largely unsolved).

Adoption Patterns

Hobbyists and students use free tiers and cheap GPU marketplaces, enabled by PEFT methods that make 65B models accessible on consumer hardware. Startups use cloud GPU providers and often fine-tune open models to reduce per-token costs below API pricing at scale. Enterprises deploy hybrid architectures with sensitive data on-premises and experiments in the cloud, prioritizing governance and auditability.

By industry maturity: technology/software leads with code generation and internal tools in production; e-commerce achieves 85% autonomous customer support resolution; financial services deploy document analysis and compliance tools; legal services reach 99.9% accuracy in document classification. Healthcare shows high potential but slower adoption due to regulation. The pattern is clear: less-regulated, content-heavy industries moved first, highly-regulated industries test carefully before scaling.

Decision Framework

Use this hierarchy to evaluate approaches:

Tier 1: Prompt Engineering (hours to days, minimal cost)

  • Fastest iteration
  • No training infrastructure needed
  • Suitable for clear instructions, basic customization, general tasks
  • Try first for any new problem

Tier 2: RAG (days to weeks, $70-1,000/month)

  • Dynamic knowledge access
  • Source attribution and transparency
  • Real-time information updates
  • Choose when external/current knowledge needed

Tier 3: Fine-Tuning (weeks to months, $5,000-100,000+)

  • Deep behavioral specialization
  • Consistent domain-specific outputs
  • Cost efficiency at high volume
  • Choose when specialized behavior required at scale

The right choice depends on your specific needs:

  • Need current facts? → RAG
  • Need consistent style/format? → Fine-tuning
  • Need both? → Hybrid approach
  • Uncertain? → Start with prompting

Takeaways

Fine-tuning adjusts model behavior through parameter updates, not knowledge injection. It excels at teaching consistent terminology usage, output formatting, tone and style, and domain-specific reasoning patterns. It's not suitable for adding factual knowledge, rapidly changing information, or simple tasks solvable with better prompts.

The PEFT revolution (LoRA, QLoRA) has democratized fine-tuning with 95%+ parameter reduction and minimal accuracy loss. QLoRA enables 65B model fine-tuning on consumer GPUs. These are now the default approaches for most use cases.

Data quality dominates data quantity. Expect to spend 20-40% of effort on data curation. 1,000+ quality examples are typically needed, but 10,000 low-quality examples won't compensate for poor curation.

Try alternatives first: prompt engineering → RAG → fine-tuning. Most problems don't require fine-tuning. Those that do often benefit from hybrid approaches combining fine-tuning with RAG.

Cost and timeline expectations: 2-6 months, $5,000-100,000+ including hidden costs. Total cost of ownership runs 3-5x initial estimates. Infrastructure requires GPU access, ML expertise, and ongoing maintenance.

When fine-tuning makes sense - specialized behavior at scale, consistent domain performance, high-volume cost reduction, regulatory control requirements - it delivers substantial value. When applied to the wrong problems, it wastes resources that simpler approaches would have solved better.

Next Steps

For implementation, current resources include:

Open-source frameworks: Hugging Face Transformers, Axolotl, Unsloth Cloud platforms: AWS SageMaker, Google Vertex AI, Azure ML, Databricks Managed services: OpenAI Fine-tuning API, Hugging Face AutoTrain, Predibase GPU providers: RunPod, Modal, Replicate, Vast.ai

Documentation hubs:

  • Hugging Face documentation for transformers and PEFT
  • Anthropic's model card and documentation pages
  • OpenAI fine-tuning guides
  • Cloud provider ML platform docs

Start small. Experiment with a 7B model and PEFT methods before committing to larger models or full fine-tuning. Validate that fine-tuning is actually needed for your use case - prompt engineering or RAG may suffice. When you do fine-tune, prioritize data quality over quantity, set up proper evaluation from the start, and plan for the full pipeline including deployment and monitoring.

The field continues evolving, but the core principles - understanding what fine-tuning changes versus what it doesn't, choosing the right tool for each problem, and combining approaches strategically - will remain relevant regardless of which specific tools and models dominate in the future.