What is finetuning?
If you're working with Large Language Models (LLMs), chances are that you'll have heard of finetuning as a technique to improve model quality. Even OpenAI's zero shot GPT models have been finetuned to reach such high levels of performance. But what exactly is finetuning?
Finetuning is the process of training a model (in this case a foundational LLM) on domain or task specific data to improve its performance on a downstream task.
When might you want to finetune?
Generally speaking, you might want to finetune in the following cases:
- Knowledge injection (your foundation model doesn’t know things it needs to know)
- Output forming (I need the outputs of the model in a certain format)
- Tone (I want my model to ‘talk’ in a certain way)
- Task fine-tuning (I want my model to chat rather than fill in the gaps)
Despite the generally good performance of many open source foundational LLMs, these models may not perform as well in specific tasks. Finetuning often pops up as the first solution to these situations.
Difficulties of finetuning
While finetuning can be very useful, it presents significant challenges:
- Requires significant GPU resources (alongside associated cost)
- Requires collecting and labelling high quality finetuning data
- Requires specialist skills and infrastructure
- Needs to be done often If the training data changes frequently
We know how challenging finetuning can be, therefore, finetuning your language model should be a last resort, rather than the first thing you should try. So in this article I’m going to explore some alternatives that you can try instead of finetuning.
Using RAG for knowledge injection
One of the key reasons why people decide to finetune is they want their model to reason about things that the base model doesn’t know - so you want to teach the model extra pieces of information.
One alternative to finetuning for the purpose of knowledge injection is RAG (retrieval augmented generated). This is when you give your model the ability to ‘search’ a knowledge store where you keep all the relevant information - the result from this search is then passed into the model as ‘context’.
This makes the model significantly more accurate and less likely to hallucinate and make things up. Another advantage of using RAG over finetuning is that it allows you to reason about constantly changing information - just by updating the vector database the model will now ‘know’ about the new information.
Why try it?
- Less likely to hallucinate (make things up)
- Provides references to sources
- Allows you to update the information as often as required through the connected vector database
- Might still not be accurate enough in which case finetuning might be needed - but it's a good first pass (or to be used in combination with finetuning)
From our experience at TitanML - RAG performs astonishingly well, especially for enterprise use cases where hallucination is very damaging. The Titan Takeoff RAG Engine (currently in Beta with development partners) is our way of making RAG better for users who want to self-host their language models. The Titan Takeoff RAG Engine is a plug and play way to create a RAG application entirely through self-hosted components so you can build and deploy your RAG application with total privacy and transparency.
Using constrained output for output forming
We often see people wanting to use finetuning for extractive workloads, i.e. when they want to extract information from a document. Typically they want the language model response to be in a predictable JSON format.
Currently there are two options on how to do this; either you can try to use prompting or you can finetune. However, neither of these are ideal since in neither case does it guarantee that the response is in your desired format.
For this finetuning use case we always prefer using constrained output generation instead of finetuning.
Why try it?
- Much easier - all you need to do is write JSON
- Guaranteed to adhere to the JSON schema every time rather than just increasing the probabilities
- You can change the schema whenever you want with no extra training
- Requires more specific prompting including context
- Still an active area of research
We have built JSON and Regex controlled generation into our Titan Takeoff Inference Server, so all of our clients can do this kind of controlled generation in a foolproof and easy way. As you can see in the GIF above, all that needs to be done is to specify a regex string. This is perfect for extractive workloads which our clients love!
Using a better model and prompt engineering for tone and task finetuning
As a general rule of thumb, the bigger your model is, the better it is at following instructions. Therefore, you might be able to go a long way just with prompt engineering and using a better model. For example, if I want my model to speak in a pirate voice, it might be much easier to get GPT-4 to do this than using a LLaMA-2 7B model.
However, in this case it should be considered whether you have any deployment requirements - for example, if you require inference on CPU then going to the extra effort of finetuning might be worth it.
We’ve tried to make this process of trying out different models and prompts as easy as possible in the Titan Takeoff Inference Server - you can get to inference in just a single line of code - allowing you to try out dozens of different models and prompts easily.
Why try it?
- Very easy to try out (especially with theTitan Takeoff Inference Server) - worth a shot!
- The model you deploy is bigger and more expensive than the model you would have deployed alternatively
- Might not work well enough
So, in turns out that there are some alternatives to finetuning! Not all of them are guaranteed to work every time, however they are usually simpler and should always be tried as a first pass before trying to collect all of that data needed for finetuning and set up all of that infrastructure! Happy building!