We are delighted to announce Titan Takeoff 0.7.0 to our clients. This release comes with a number of features that allow our users to build more scalable and higher throughput systems.
Continuous batching is an algorithm that increases the throughput of LLM serving. It allows the size of the batch that your ML model is working on to grow and shrink dynamically over time, which means that responses are served to users more quickly at high load - dramatically improving throughput. More info here (https://docs.titanml.co/docs/next/titan-takeoff/pro-features/batching).
Previously, the batching implementation in the Titan Takeoff Inference Server meant that only requests with the same generation parameters (including JSON schema, regex strings) could be batched together to be worked on in the server. This release removes that restriction making it easier to deploy multiple applications at scale.
This release allows requests to be cancelled in flight. No more waiting for the playground to finish processing a request you don't care about - this was a much requested feature!
We have made a number of changes to our Multi-GPU offering which is required when deploying larger models. This release makes changes to this backend to improve performance.
- Licence keys - the Titan Takeoff Inference Server now has a new way of distributing licence keys
- Better error handling in the frontend
We are excited to release this to our customers. We do this so our clients always have access to the best technology so they can move forward with confidence. We have lots of exciting features and releases under development at the moment, so stay tuned for Takeoff 0.8.0!