The fastest and easiest way to inference LLMs - Titan Takeoff Server 🛫
Try now
Titan logo white
Takeoff 🛫
Takeoff 🛫
Takeoff 🛫
Product
Docs
Blog
Blog
Discord
Careers
takeoff community
Join Beta
Schedule a consultation
Schedule AI consultation
Product
About
Blog
Careers
Join Beta
CONSULTATION
Product
Linkedin
Medium
Discord
Contact
hello@titanml.co
MLOps

Deploy Large Language Models on smaller, cheaper hardware with the Takeoff Inference Server

Posted on:
August 23, 2023
Back to Blog

Introduction

Almost every tech team has been playing with LLMs this year, but deploying them efficiently, affordably and on available GPUs remains a huge challenge. Enter the Takeoff Inference Server: revolutionizing the deployment of LLMs on even smaller hardware instances without compromising performance.

The Current Challenge

Deploying LLMs typically demands high-end GPU instances and significant know-how and time. This not only translates to higher costs, but it also puts constraints on time to deployment and scalability. Deploying a LLM (like a decent sized Llama) at scale requires a huge number of incredibly expensive GPUs — something that is out of reach for most businesses (even if they were available)!

The Takeoff Inference Server: LLM Performance on Smaller & Cheaper Hardware

The Takeoff Inference Server brings cutting-edge techniques to table to make deployment of LLMs the easiest part of the development process:

Diving Deep

  1. Broader deployment options: Deploy your models on cheaper and more available hardware instances (even CPU!), realizing a compute cost reduction ranging from 4–20x.
  2. Improved Model Latency: Achieve up to 4x latency reduction, ensuring real-time inference and enhanced user experience.
  3. Ultimate scalability: Boosted throughput thanks to a hyper-efficient rust server ensures that you can handle more queries, faster, whether it is 10 or 10million queries.
  4. Super fast experimentation: Developers can prototype, test, and deploy their models within minutes locally without getting bogged down in complex configurations.

Deploy Your LLMs to Smaller and Cheaper Hardware

Thanks to the memory compression that is part of the takeoff server, we can deploy LLMs to much smaller, cheaper, and more available GPU instances. Below you can see some benchmarks of the hardware that we can deploy LLMs to — resulting in 4–20x cost reductions (and making applications much much more scalable!)

Try It Yourself

The community edition of the Takeoff Inference Server is open-source and available for everyone to try just by running the following commands:

pip install titan-iris
iris takeoff --model tiiuae/falcon-7b-instruct --device cpu

You can check out the docs here and start inferencing your LLM with a few lines of code to check out the difference for yourself!

The pro edition of the Takeoff Server is loved by businesses who want to deploy efficiently at scale — reach out to us to get started with a trial!

Docs: https://docs.titanml.co/docs/titan-takeoff/getting-started

Discord: https://discord.gg/83RmHTjZgf

‍

About TitanML

TitanML enables machine learning teams to effortlessly and efficiently deploy large language models (LLMs). Their flagship product Takeoff Inference Server is already supercharging the deployments of a number of ML teams.

Founded by Dr. James Dborin, Dr. Fergus Finn and Meryem Arik, and backed by key industry partners including AWS and Intel, TitanML is a team of dedicated deep learning engineers on a mission to supercharge the adoption of enterprise AI.

‍

Titan logo white
Product
Takeoff 🛫
Takeoff 🛫
Product
Takeoff 🛫
About Us
Company
Careers
Contact
JOIN THE COMMUNITY
LinkedIn
Blog
Blog
Github
Blog
Medium
Discord
aDdress
Farringdon, London
Contact
hello@titanml.co
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.
©2023 TYTN LTD. All rights reserved.
designed by
celerart
Privacy Policy