Boosting LLM Performance: Low-Effort Strategies That Beat Fine-Tuning
What is Fine-tuning?
If you're working with Large Language Models (LLMs), chances are that you'll have heard of fine-tuning as a technique to improve model quality. Even OpenAI's zero-shot GPT models have been fine-tuned to reach such high levels of performance. But what exactly is fine-tuning?
Fine-tuning is the process of training a model (in this case a foundational LLM) on domain or task specific data to improve its performance on a downstream task.
When might you want to fine-tune?
Generally speaking, you might want to fine-tune in the following cases:
- Knowledge injection (your foundation model doesn’t know things it needs to know)
- Output forming (I need the outputs of the model in a certain format)
- Tone (I want my model to ‘talk’ in a certain way)
- Task fine-tuning (I want my model to chat rather than fill in the gaps)
Despite the generally good performance of many open-source foundational LLMs, these models may not perform as well in specific tasks. Fine-tuning often pops up as the first solution to these situations.
Difficulties of Fine-tuning
While fine-tuning can be very useful, it presents significant challenges:
- Requires significant GPU resources (alongside associated cost)
- Requires collecting and labelling high quality fine-tuning data
- Requires specialist skills and infrastructure
- Needs to be done often If the training data changes frequently
We know how challenging fine-tuning can be, therefore, fine-tuning your language model should be a last resort, rather than the first thing you should try. So in this article I’m going to explore some alternatives that you can try instead of fine-tuning.
Using RAG for Knowledge Injection
One of the key reasons why people decide to fine-tune is they want their model to reason about things that the base model doesn’t know - so you want to teach the model extra pieces of information.
One alternative to fine-tuning for the purpose of knowledge injection is RAG (Retrieval Augmented Generated). This is when you give your model the ability to ‘search’ a knowledge store where you keep all the relevant information - the result from this search is then passed into the model as ‘context’.
This makes the model significantly more accurate and less likely to hallucinate and make things up. Another advantage of using RAG over fine-tuning is that it allows you to reason about constantly changing information - just by updating the vector database the model will now ‘know’ about the new information.
Why try it?
- Less likely to hallucinate (make things up)
- Provides references to sources
- Allows you to update the information as often as required through the connected vector database
- Might still not be accurate enough in which case fine-tuning might be needed - but it's a good first pass (or to be used in combination with fine-tuning)
From our experience at TitanML - RAG performs astonishingly well, especially for enterprise use cases where hallucination is very damaging. The Titan Takeoff RAG Engine (currently in Beta with development partners) is our way of making RAG better for users who want to self-host their language models. The Titan Takeoff RAG Engine is a plug and play way to create a RAG application entirely through self-hosted components so you can build and deploy your RAG application with total privacy and transparency.
Using Constrained Output for Output Forming
We often see people wanting to use fine-tuning for extractive workloads, i.e. when they want to extract information from a document. Typically they want the language model response to be in a predictable JSON format.
Currently there are two options on how to do this; either you can try to use prompting or you can fine-tune. However, neither of these are ideal since in neither case does it guarantee that the response is in your desired format.
For this fine-tuning use case we always prefer using constrained output generation instead of fine-tuning.
Why try it?
- Much easier - all you need to do is write JSON
- Guaranteed to adhere to the JSON schema every time rather than just increasing the probabilities
- You can change the schema whenever you want with no extra training
- Requires more specific prompting including context
- Still an active area of research
We have built JSON and Regex controlled generation into our Titan Takeoff Inference Server, so all of our clients can do this kind of controlled generation in a foolproof and easy way. As you can see in the GIF above, all that needs to be done is to specify a regex string. This is perfect for extractive workloads which our clients love!
Using a better model & Prompt Engineering for Tone and Task-fine tuning
As a general rule of thumb, the bigger your model is, the better it is at following instructions. Therefore, you might be able to go a long way just with prompt engineering and using a better model. For example, if I want my model to speak in a pirate voice, it might be much easier to get GPT-4 to do this than using a LLaMA-2 7B model.
However, in this case it should be considered whether you have any deployment requirements - for example, if you require inference on CPU then going to the extra effort of fine-tuning might be worth it.
We’ve tried to make this process of trying out different models and prompts as easy as possible in the Takeoff Server - you can get to inference in just a single line of code - allowing you to try out dozens of different models and prompts easily.
Why try it?
- Very easy to try out (especially with Titan Takeoff Server) - worth a shot!
- The model you deploy is bigger and more expensive than the model you would have deployed alternatively
- Might not work well enough
So, in turns out that there are some alternatives to fine-tuning! Not all of them are guaranteed to work every time, however they are usually simpler and should always be tried as a first pass before trying to collect all of that data needed for fine-tuning and set up all of that infrastructure! Happy building!