NEW RELEASE: Deploy Llama 3.1 herd in your private enviornment

Best in breed technology powers the Takeoff Engine.

The Titan Takeoff Engine powers the Titan Takeoff Inference Server - with it our clients can take for granted that they are always using the best inference techniques.

Accelerate model inference by 3-12x
Response Caching

Titan Takeoff gives models access to their old outputs, so they can fast forward responses and quickly generate responses to similar requests, even if the requests aren't identical.

Speculative Decoding

Titan Takeoff natively uses speculative decoding, allowing smaller models to draft responses and then using a larger model to validate and correct the small model. Use up to a 10x bigger model with no extra compute resources, blending efficiency with accuracy. Get the best of both worlds: speed and reliability.

Flash attention

Flash attention dramatically improves transformer inference speeds, especially for long input sequences. Looking for quick and efficient model performance? We’ve got you covered. 

CUDA graphs

Massively increase language model inference speeds by bypassing the CPU entirely and queueing up all the operations on the GPU.

Fused Triton Kernels

Many operations common in LLMs can be combined into a single CUDA kernel to make them many times faster. Takeoff uses custom kernels written using in the Triton DSL to out-the-box support accelerated inference on non-NVIDIA hardware.

Throughput Optimisations
Confidently scale applications for production
Continuous batching

Queued requests can be inserted into running batches, minimizing the amount of time that your requests wait to be seen to. This is crucial for keeping GPU utilization high and making the most of every dollar spent on GPUs.

Model Sharding For Multi-GPU

Models are tensors sharded across GPUs. This is perfect if you are looking to run large models that don't fit in a single GPU, and maximise the per-token speeds.

Multi-threaded rust server

You never want the server to get in the way of lightning fast model inference. The lightweight Takeoff Server is lightning fast and guarantees high performance even under high load.

GPU Utilisation
Improve GPU Utilisation when deploying multiple models
Multi model serving

Use the same GPU to serve multiple models that can fit in GPU memory. Perfect for multi-model applications like RAG where models are used asynchronously. Never leave your GPUs idle!

Batched LoRA

Takeoff empowers serving of hundreds of fine-tuned models for the cost of just one, by deploying the fine-tuned LoRAs onto a single Takeoff Server. This innovation results in significantly reduced infrastructure requirements, especially in centrally managed deployments.

Minimise unpredictable model outputs with model controllers
JSON and Regex controller

Control the model’s output to fit a set JSON or REGEX schema. Confidently build pipelines around your language model without impacting the model latency. Perfect for document extraction workflows. 

Deploy models to smaller and cheaper GPUs with up to 8x model compression

Compress the models using accuracy-preserving model compressions (AWQ). Deploy the same model to significantly smaller and cheaper GPUs, or even CPUs.