This so-called AI revolution has been anticipated for years now, but for the first time it’s felt truly on the horizon. I’m no longer asking people to ‘imagine a world in which…’ but rather, I’m just asking them to look at the latest product releases coming out of Microsoft and Google.
This AI revolution will touch everything that we do, we won’t be able to go more than 5 minutes on our devices without encountering some kind of AI or NLP system.
However, in order for this change to be as revolutionary as we expect, we need to face what I think is one of the biggest roadblocks — which is the availability of compute, specifically GPUs.
What are GPUs? And why do they matter?
AI is essentially just maths — really really really big maths problems. And to run any maths problem you need a calculator, and to run AI you need a really really really big calculator — otherwise known as compute. No compute, no AI.
Now you can run AI on different types of compute, but most of the most powerful AI is run on GPUs (Graphical Processing Units). So for businesses who are running either powerful AI models (like language models) or have decent latency requirements, then inferencing on a GPU is necessary.
How is GPU demand evolving? *
*In the AI space (clearly GPUs are used for lots of other things too!)
Pre-training LLMs
Training a Large Language model (LLM) is famously one of the most compute intensive things you can do. Training GPT-3, the base model behind the original ChatGPT is estimated to have required 1,287 Gigawatt hours or about as much electricity as 120 US homes would have consumed in a year.
This does require an enormous cluster of GPUs, and a lot of the focus when discussing GPU demand is focused. This significant GPU requirement does have an impact on how ‘Open’ AI is, since only a few very well capitalised companies have the infrastructure or capital to be able to train these models. However, in terms of mass adoption, I think the training GPU demand is somewhat misplaced.
Training an LLM is the kind of thing that happens a few times a year in a handful of companies, and that model can then be used millions of times for different applications. So this training cost when amortised over the number of uses, tends towards zero. Eg, if it takes a few thousand GPUs to train GPT-3, but the number of businesses who want to host that model are in the tens or hundreds of thousands — then per business you are looking at a pretty small fraction of a GPU. Not nothing, but also not the biggest deal.
Commercial fine-tuning & Inferencing LLMs
Commercial training and inferencing of LLMs is where I see the GPU demand growing the fastest. Now we have powerful AI and LLMs that are really genuinely useful (all that pre-training has done its job!), every man and his dog wants to start integrating AI applications into their business — this is easily demonstrated by the rapid growth of OpenAI wrapper companies after ChatGPT was released.
However, if the future that all of these companies are building towards will come to fruition, then we won’t be able to go more than about 5 minutes without interacting with a LLM in some form, whether that is through predictive text or through auto-transcription. This will require an enormous amount of compute to be able to handle this kind of adoption.
We are already starting to see this enormous demand. OpenAI’s paid version of ChatGPT, which promised consistent uptime, has been ‘unavailable’ a number of times in the last few weeks because of demand and presumably lack of compute. If this is happening to OpenAI with their backing of Azure right at the beginning of this AI S-curve, then imagine what this is going to look like over the coming months and years and usage only increases.
The impact this will have
The demand for GPUs is growing exponentially but the supply is growing but not nearly fast enough. This is causing a good-ole-fashioned supply-and-demand problem.
But why am I concerned about this?
- Makes AI ‘exclusive’ — Whenever supply can’t meet demand we inevitably see large price increases. If this happens AI will become exclusive in two senses: firstly, it will be limited to only high value use-cases, ie the use cases where the benefit outstrips the costs — this isn’t the biggest deal but it does stifle innovation. Secondly, it comes exclusive in a more traditional sense — rich people and rich companies get the benefit of this new technology, poorer people don’t.
- AI models get slower and more expensive and AI applications get buggier — This is the impact that we are already starting to see within OpenAI’s services — the models and requests are too much for the hardware that they have allocated to those services.
- Will concentrate the AI power in the hands of a few (those that have lots of GPUs, ie those with relationships with Azure, GCP, and AWS). This is part of the reason we have seen very large deals between the foundation model providers (eg OpenAI) and the cloud providers (Azure). If only a few players have access to the compute required to run this powerful AI, then that concentrates all of the power of this new AI era into the hands of a few. I don’t like that at all.
What can we do about this?
Luckily, there is lots that we can do to reduce the reliance on super expensive, powerful GPUs.
Use the right models for your task
The most powerful AI models (ie GPT-4 style models) are only really required in a handful of use-cases. For the majority of business use cases you’ll get equally good (sometimes better) performance using a much much smaller, more resource efficient model that is fine-tuned on good quality data.
Practice good utilisation of compute resources
There are good hygiene things that you can do to ensure that you are utilising the correct compute resources that you need for your particular task — this will avoid wasting any of that precious GPU time!
Compress your models and optimise for the hardware
There are loads of ways that you can reduce the computational complexity of your models through compression, optimisation and acceleration techniques. When you throw the kitchen sink at this you can get 20–50x speed-ups! However, most of these techniques are locked away in research labs and rarely get used in commercial settings — largely because they are so difficult to use. TitanML automates this entire process so businesses can benefit from these kinds of techniques — previous clients have achieved 96% cost reductions by moving from very expensive GPUs to legacy GPUs — Result!
Availability of GPUs is a real issue and I think it’ll get worse before it gets better. However, there’s lots of best practices that we can do to reduce compute consumption when running AI to improve latencies and reduce costs.
About TitanML
TitanML enables machine learning teams to effortlessly and efficiently deploy large language models (LLMs). Their flagship product Takeoff Inference Server is already supercharging the deployments of a number of ML teams.
Founded by Dr. James Dborin, Dr. Fergus Finn and Meryem Arik, and backed by key industry partners including AWS and Intel, TitanML is a team of dedicated deep learning engineers on a mission to supercharge the adoption of enterprise AI.
Written by Meryem Arik, Co-founder of TitanML.
Building Enterprise RAG applications?
Unsure whether you are unlocking the true value of your AI investment?