General Purpose LLMs Suck!

General Purpose LLMs

Advancements Beyond GPT-4: Cutting-Edge Models in Specialized Domains

The US-based clinical AI startup Saama recently made a buzz in the market by announcing OpenBioLLM-Llama3-70B and 8B models– the “Most openly available medical-domain LLMs to date”. These models mark a significant leap forward in the realm of specialized language models, particularly within the biomedical domain. Developed by Saama's AI Research Lab, these models have surpassed industry giants like OpenAI's GPT-4 and Google's Med-PaLM series, along with other general-purpose models such as Google's Gemini and open source Meditron-70B. Saama's AI Research Lab used a two-phase fine-tuning approach to enhance the performance of the Llama3-70B and Llama3-8B models, tailoring them specifically for biomedical tasks. This fine-tuning process involved exposing the models to extensive biomedical data and utilizing Direct Preference Optimization (DPO) techniques to refine their performance further. As a result, the OpenBioLLM Llama3-70B and Llama3-8B models have set new standards in the medical domain, outperforming their predecessors and competitors across nine different benchmarks.

Key takeaway from this? Fine-tuned foundational models, domain adapted for a certain use case, have more context that corresponds to that specific task. And hence, these fine-tuned models perform much better than the general-purpose models. 

We have one more example to set this premise, that we’ll further expand on in this article.

SQLCoder-7B-2 model, developed by Defog, Inc., is a large language model specifically designed for natural language to SQL generation tasks. Based on the CodeLlama-7B model, and fine-tuned to enhance its performance in generating SQL queries from natural language prompts, the model outperforms the base GPT-4, despite having a much smaller compute footprint. This model is intended to be used as an analytics tool by non-technical users to understand data within SQL databases. It assists users in querying databases by translating their natural language questions into SQL queries. To use the model effectively, users should provide prompts in a specific format, including the task description, database schema information, and the question to be answered. The model then generates the corresponding SQL query based on the provided prompt.

In this blog, we’ll cover the importance of domain adaptation and fine-tuning, and some of the popular trends in the LLM space currently.

What exactly is LLM Fine-Tuning?

Fine-tuning in the context of Large Language Models (LLMs) is a process where a pre-trained foundational model, such as a Llama3-70B is further trained on task-specific or domain-specific data to adapt it to a particular task or domain. This process involves updating the parameters of the pre-trained model using a smaller dataset that is relevant to the target task or domain.

Fine-tuning is crucial because it allows LLMs to leverage their pre-existing knowledge gained from large-scale pre-training, while customizing it to specific applications, thus enhancing their performance and applicability in real-world scenarios. It starts with pre-training, where a LLM learns general language patterns from vast text data. Afterward, the model is adapted to a specific task or domain, such as sentiment analysis or medical diagnosis. Task-specific data is then collected, typically a smaller dataset tailored to the target task. The pre-trained model undergoes further training on this dataset using supervised learning techniques, adjusting its parameters to learn task-specific patterns. Finally, the fine-tuned model's performance is evaluated on a separate dataset to gauge its effectiveness. This iterative process enhances the model's ability to perform specialized tasks accurately.

Current Fine-Tuning Trends

Let’s take a look at the current FT trends popular amongst the LLM community. For context, full model fine-tuning isn’t a feasible process when it comes to LLMs. Full model fine-tuning involves updating all model weights to create an enhanced version of the foundational model. However, it demands significant memory and computational resources, similar to pre-training, to manage storage and processing during training. While full fine-tuning can lead to impressive performance gains if done right, it poses significant challenges such as memory-intensive requirements, issues like catastrophic forgetting, and high storage and computational costs, limiting its practicality for many applications. Let’s delve into these issues in detail, to understand why some of the modern fine-tuning techniques are more prevalent and viable.

Resource Intensive Nature

Fine-tuning Large Language Models (LLMs) requires significant memory resources to store the model itself and additional parameters during training. For example, when fine-tuning a model for translation tasks, memory allocation is essential for storing optimizer states, gradients, and temporary memory. As LLMs continue to grow in size, with the largest models reaching hundreds of gigabytes, the memory demands for full fine-tuning become prohibitive. For instance, an unquantized Llama3-70B model, which consists of 70B parameters, cannot even fit on a single consumer-grade GPU. The setup requires a multi-GPU setup (min. 2 x Nvidia A100s, for instance), and would require weeks and months of fine-tuning on large amounts of data before giving a good performance. 

Catastrophic Forgetting

Full fine-tuning poses the risk of catastrophic forgetting, where the model's prior knowledge is overwritten or forgotten when fine-tuning for a new task. For example, if a model pre-trained for language translation is then fine-tuned for image captioning, it may forget its ability to translate languages effectively. This can lead to diminished performance on previously learned tasks, such as translation accuracy dropping after fine-tuning for image captioning.

Escalating Storage and Computational Expenses

Adaptation for multiple tasks using full fine-tuning results in the creation of multiple versions of the model, each as large as the original. This incurs substantial storage and computational costs, especially when dealing with a diverse range of tasks. For example, fine-tuning a language model for summarization, translation, and question answering tasks separately may require storing three separate versions of the model, each with the same size as the original model. This consumes significant storage space, and also increases computational overhead during training and inference, making full fine-tuning impractical for many applications.

Domain Shift Issues

Domain shift refers to the problem where the data distribution in the fine-tuning dataset differs significantly from the data model encountered in real-world domain-specific application. This discrepancy can lead to suboptimal performance, as the model may not generalize well to new, unseen data outside of the fine-tuning dataset. This problem becomes even more prominent when the domain’s data changes significantly over a period of time, making the previously learned context outdated and irrelevant.

Now that we are aware of the issues with full model fine-tuning, let us take a look at some of the more efficient techniques available out there.

Efficient Adaptation Techniques for Domain-Specific Model Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT)

To address the challenges associated with high resource requirements, Parameter-Efficient Fine-Tuning (PEFT) methods have come up as a more feasible solution. The PEFT techniques aim to reduce the number of parameters that need updating during fine-tuning, making the process more efficient and less resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) represents a significant advancement in adapting large language models (LLMs) for specific tasks without incurring the high costs and resource demands of traditional fine-tuning methods. The fundamental principle behind PEFT is to freeze the majority of the model's parameters and train only a small subset, thus reducing the memory footprint and computational load.

Out of the various techniques available under this umbrella term of PEFT, LoRA has seen a massive adoption by the open-source community.

How LoRA Works

LoRA's main innovation lies in the introduction of these smaller matrices that, when multiplied, produce a matrix matching the dimensions of the original model weights. This approach drastically reduces the number of parameters needing adjustment during fine-tuning, resulting in substantial memory savings and making it feasible to train large models on limited hardware.

Key Features

1. Reduction in Trainable Parameters: By focusing on the smaller rank decomposition matrices rather than the entire model, LoRA significantly decreases the number of parameters that need training. This reduction saves memory and allows the use of large models on less powerful hardware.

2. Inference Efficiency: LoRA maintains efficient inference by ensuring that the computational overhead remains minimal. The smaller matrices used in LoRA do not add significant latency during inference, ensuring that the fine-tuned model performs efficiently.

3. Targeted Application to Attention Layers: LoRA is often applied specifically to the self-attention layers of LLMs, which contain a large portion of the model’s parameters. This selective application maximizes parameter reduction without sacrificing performance. While it can be applied to other components, such as feed-forward layers, the most substantial benefits are seen in the attention layers.

For instance, consider fine-tuning a large model for a specific task like sentiment analysis. Instead of updating the entire model, LoRA introduces smaller matrices into the self-attention layers. These matrices, much smaller than the original model weights, are optimized during fine-tuning. As a result, the model achieves high performance on the sentiment analysis task with significantly reduced computational requirements.

Techniques like multi-LoRA are particularly useful for cost-effective inference scenarios. The core idea is as follows:

The foundational model (say, a Llama3-8B) is common for all tasks, but different LoRA adapters that have been fine-tuned for different domain-specific tasks are loaded alongside the foundational model into memory. For inference, simply the task-specific LoRA adapter is swapped in (with the other LoRA adapters set as inactive), and the response is generated. This significantly reduces memory and compute requirements, as you no longer need multiple copies of the base model running for each task.

Techniques like PEFT still fall under the category of conventional fine-tuning or transfer learning. We move past these to some Reinforcement Learning based methods, where with some external intervention, the model is penalized/rewarded to improve its performance over time. Let’s discuss these methods.

Reinforcement Learning with Human Feedback (RLHF)

Reinforcement Learning with Human Feedback (RLHF) is a method that enhances the performance of pre-trained language models by integrating human evaluations into the training process. This approach leverages human feedback to guide the model towards producing more accurate, reliable, and human-aligned responses.

How RLHF Works

1.Pre-Trained Model : Start with a pre-trained language model that has a basic understanding of natural language.

2. Human Evaluation : Humans assess and rank the model's outputs based on criteria such as accuracy, relevance, and safety. This feedback serves as a crucial signal for the model.

3. Reward Signal : The human-curated scores are used to generate reward signals that inform the model which responses are preferable.

4. Model Adjustment : Using reinforcement learning, the model adjusts its decision-making process (by updating its weights) to maximize these rewards, thus improving its performance on specific tasks.

Application in Language Models

In the context of large language models, RLHF helps align the model’s outputs with human expectations and requirements. The process involves:

  1. Agent : The language model itself, which learns to generate optimal text.

  2. Action Space : All possible text outputs the model can produce.

  3. State Space : Includes the initial user prompt and the model's subsequent responses.

  4. Reward : Evaluates how well the model's responses align with the intended application and user expectations.

Benefits of RLHF

  1. Enhanced Precision : Improves the model's ability to perform specific tasks accurately.

  2. Bias Mitigation : Incorporates human judgment to address biases in model outputs.

  3. Dependability : Produces responses that are more reliable and secure.

RLHF is a powerful technique that combines the strengths of reinforcement learning and human judgment to refine and enhance the capabilities of large language models, making them more aligned with human values and expectations.

Reinforcement Learning from AI Feedback (RLAIF)

Reinforcement Learning from AI Feedback (RLAIF) is similar to RLHF, just that instead of a human in the feedback loop, another LLM (guided by a set of ethical and safety principles outlined in a constitution) is employed, that evaluates the responses of the agent model to help improve its performance and reliability. This approach retains the benefits of human feedback but enhances scalability, reduces subjectivity, and ensures ethical alignment.

How RLAIF Works

1. AI Feedback Model : An AI system provides feedback on the outputs generated by another AI model, known as the Response Model.

2. Constitution : A predefined set of ethical and safety principles guides the feedback process, ensuring consistency and alignment with desired standards.

3. Training Loop : The response model generates responses to prompts, which the feedback model evaluates and scores based on the constitution. The response model is then fine-tuned using this AI-generated feedback.

Advantages of RLAIF

1. Scalability : Automating feedback collection with AI eliminates the bottleneck of human involvement, enabling large-scale training.

2. Consistency : AI feedback guided by a constitution reduces variability and subjectivity, ensuring more uniform and predictable training outcomes.

3.  Ethical Alignment : The constitution ensures that feedback aligns with predefined ethical and safety standards, enhancing the ethical behavior of the AI.

4.  Efficiency : By using AI for feedback, RLAIF reduces the time and resources needed for training compared to human-involved methods.

5. Reduced Human Bias : While human feedback can introduce biases, AI feedback guided by a constitution aims to minimize these biases, leading to fairer and more objective outcomes.

In the context of text summarization using RLAIF, the process begins with the response model (say, a Llama3-70B) generating initial summaries for a given set of texts. These summaries are then evaluated by the feedback model (eg.- GPT-4) designed to assess the outputs based on criteria such as clarity, accuracy, and adherence to ethical guidelines established in the constitution. The feedback model provides scores and detailed feedback on each summary, indicating areas for improvement. Using this AI-generated feedback, the response model is fine-tuned to enhance its performance. This iterative process ensures that the summaries produced are not only concise and informative but also align with ethical standards, ultimately leading to higher quality and more reliable text summarizations.

While the reinforcement-based techniques work great for ensuring model reliability over time, it still doesn’t address the issue of outdated context and the need to update it for reduced hallucinations and more factually correct and context-aware responses. That’s where techniques like RAFT come into picture.

Retrieval Augmented Fine-Tuning (RAFT)

RAFT, or Retrieval-Augmented Fine-Tuning, is a novel method designed to enhance the fine-tuning process for language models, particularly those used in Retrieval-Augmented Generation (RAG) tasks. In RAFT, the training data consists of questions paired with sets of documents, some containing relevant information (oracle documents) and some not (distractor documents), along with chain-of-thought style answers derived from the oracle documents. This training data is usually generated by operating the pre-trained model in a RAG pipeline, generating augmented responses on queries, and then labeling these query+response sets based on various accuracy and reliability parameters. Eventually, the pre-trained model is then fine-tuned on this augmented dataset.

This structure enables the model to do two things–

  • Learn to distinguish between useful and irrelevant information when answering questions. 

  • Learn from the retrieved-context, eventually improving its knowledge base. 

How RAFT Works 

1. Question and Document Pairing : Each training data point consists of a question and a set of documents, including both oracle (relevant) and distractor (irrelevant) documents.

2. Chain-of-Thought Style Answers : These answers are generated from the oracle documents and include detailed reasoning processes, aiding the model in understanding and reasoning over the provided context.

3. Mix of Question Types (A*) : The training dataset includes a mix of questions with both oracle and distractor documents, as well as questions with only distractor documents. This enables the model to learn to prioritize relevant information while handling questions without external documents.

Advantages of RAFT 

1. Enhanced Learning : By training on retrieved context, RAFT helps the model learn to identify and prioritize relevant information for answering questions, improving its accuracy and robustness.

2. Reasoning Chain Formation : The chain-of-thought style answers encourage the model to form reasoning chains using segments from the oracle documents, leading to more coherent and informative answers.

3. Flexible Training : RAFT allows for flexible training across various domains by adapting the fine-tuning process to incorporate both domain-specific information and general knowledge, making it suitable for a wide range of applications.

RAFT proves to be a great strategy for getting started with a pre-trained foundational model, but slowly improving the model’s domain-specific internal context over a period of time.


Adapting large language models (LLMs) to specific domains via fine-tuning is significantly more effective than relying on general-purpose models. General models, though powerful, often lack the nuanced understanding required for specialized tasks. Techniques such as Parameter-Efficient Fine-Tuning (PEFT), Reinforcement Learning with Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF) and Retrieval Augmented Fine-Tuning (RAFT) have emerged as pivotal in bridging this gap. PEFT methods, like Low-Rank Adaptation (LoRA), allow for efficient fine-tuning by focusing on a small subset of parameters, dramatically reducing computational costs and resource requirements. This makes fine-tuning more accessible and practical, especially when working with large models. RAFT, RLHF and RLAIF are aimed at iteratively improving the model’s performance over time.

Through these advanced techniques, domain-specific fine-tuning not only improves model performance but also ensures outputs are aligned with specific needs and ethical standards. As such, fine-tuning large language models for domain adaptation offers a more targeted, efficient, and ethical approach to leveraging AI for specialized applications, outperforming general-purpose models in delivering precise and contextually relevant results.

the New Home For Generative AI Apps.

AI moves fast, and ScaleGenAI helps you move faster.

Connect your infrastructure and one-click-deploy any model at scale!


Contact us

2024 ScaleGenAI All rights reserved.