TitanTakeoff x LangChain: Supercharged Local Inference for LLMs
With the release of many open source large language models over the past year, developers are increasingly keen to jump on the bandwagon and deploy their own LLMs. However, without specialised knowledge, developers who are experimenting with deploying LLMs on their own hardware may face unexpected technical difficulties. The recent scramble for powerful GPUs has also made it significantly harder to secure sufficient GPU allocation to deploy the best model at the desired latency and scale.
Developers are faced with an unappealing choice between suboptimal applications due to compromises on model size and quality, or costly deployments because of manual optimisations and reliance on expensive GPUs, not to mention wasted time dealing with boring and one-off technical eccentricities.
Titan Takeoff Server
That being said, deploying your own models locally doesn’t have to be difficult or frustrating. The Titan Takeoff Server offers a simple solution for the local deployment of open-source Large Language Models (LLMs,) even on memory-constrained CPUs. With it, users gain the benefits of on-premises inferencing — reduced latency, enhanced data security, cost savings in the long run, and unparalleled flexibility in model customization and integration without additional complexity, not to mention the ability to deploy larger and more powerful models on memory-constrained hardware.
With its lightning fast inference speeds and support on low cost, readily available devices, the Titan Takeoff Server is suitable for developers who need to constantly deploy, test and refine their LLMs. Through the use of state of the art memory compression techniques, the Titan Takeoff Server offers a 10x improvement in throughput, a 3-4x improvement in latency and a 4–20x cost saving through the use of smaller GPUs, in comparison with the base model implementation. In an era where control and efficiency are paramount, the Titan Takeoff Server stands out as an optimal solution for deploying and inferencing LLMs.
Seamless Integration with LangChain
With the recent integration of Titan Takeoff into LangChain, users will be able to inference their LLMs with minimal setup and coding overhead. You can view a short demonstration of how to use the LangChain integration with Titan Takeoff:
You can start deploying and inferencing your LLMs with these simple steps below:
1. Install the Iris CLI, which will allow you to run the Titan Takeoff Server
2. pip install titan-iris
3. Start the Takeoff Server, specifying the model name on HuggingFace, as well as the device if you’re using a GPU. This will pull the model from the HuggingFace server, allowing you to inference the model locally.
4. iris takeoff --model tiiuae/falcon-7b-instruct --device cuda
5. The Takeoff server is now ready. You can then initialise the LLM object by providing it with a custom port (if not running the Takeoff server on the default port 8000) or other generation parameters such as temperature. There is also an option to specify a streaming flag.
6. llm = TitanTakeoff(port=5000, temperature=0.8, streaming=True)output = llm("What is the weather in London in August?")print(output)# Output: The weather in London in August can vary, with some sunny days and occasional rain showers. The average temperature is around 20-25°C (68-77°F).
You have now made your first inference call to an LLM with the Titan Takeoff Server running right on your local machine!
For more examples demonstrating use of the Takeoff x LangChain integration, view our guide here.
The integration of Titan’s Takeoff Server with LangChain marks a transformative phase in the development and deployment of language model-powered applications. As developers and enterprises seek faster, more efficient, and cost-effective ways to leverage the capabilities of LLMs, solutions like this pave the way for a smarter, seamless, and supercharged future.
TitanML is an NLP development platform and service focusing on the deployability of LLMs. Our Takeoff Server is a hyper-optimised LLM inference server that ‘just works’, making it the fastest and simplest way to experiment with and deploy LLMs locally.