From Proprietary to Open Source to Fleets of Custom LLMs

Large language models (LLMs) have proven to be powerful tools for me over the last year. LLMs can be used to build a wide range of applications, from chatbots and content generators to coding assistants and question answering systems. After discussing the journey of building applications with friends, I want to share the typical progression most LLM enthusiasts experience when diving into the rapidly changing and fascinating world of LLMs.

Stage 1: Start with proprietary models available through a public API.

The most common way to start building applications with LLMs is to use a proprietary model available through a public API. In my case this meant the OpenAI models GPT-4 and GPT-3.5. This is a relatively easy and straightforward approach, as it does not require any expertise in training or deploying LLMs. At this stage, you play around with LLM orchestration tools (my current favorite is Haystack) and build your first retrieval-augmented generation (RAG) application.

Stage 2: Explore open source LLM alternatives.

While proprietary LLMs are easy to use, they can be slow, unstable, and expensive, especially as you scale your usage. For these reasons, many developers are now turning to open source LLMs. Open source LLMs are freely available to use and modify, and they can be deployed on your own infrastructure. This gives you more control over the cost, latency, and security of your applications.

My most recent experience with the Llama family of LLMs came through Anyscale endpoints. Despite their compact nature, these models can hold their ground. Llama models demand more explicit instructions, but that was a minor trade-off for their speed, stability, and remarkably affordable price—Anyscale charges just $1 for a million tokens. 

Since these models are open source, why not just host them yourself? Hosting your own version of Llama may seem tempting, but it’s not necessarily cost-efficient. Anyscale, for instance, invests significant intellectual property in ensuring optimal cost efficiency. Managing costs for LLM inference isn’t straightforward; it permeates every layer of the tech stack. It demands mastery over advanced GPU and model optimization techniques, such as PagedAttention and tensor parallelism. Additionally, you’d need to harness strategies like continuous batching, adeptly scale model replicas for fluctuating loads, and strategically select the most budget-friendly hardware and GPUs across diverse regions or even different cloud services. Given these complexities, I feel that utilizing Anyscale endpoints emerges as a practical and efficient choice.

Currently, my toolkit comprises a mix of open-source LLMs via Anyscale endpoints, supplemented by occasional proprietary models from OpenAI.

Stage 3: Create a custom LLM.

Custom LLMs offer several advantages over general-purpose models, especially when you have specific requirements for your application, such as enhanced accuracy or performance for a particular task or domain. Teams often opt for custom LLMs due to concerns related to privacy, security, and the desire for more control. Additionally, custom LLMs can be more compact than general models, offering advantages in cost, speed, and precision.

Expanding the notion of “custom LLM” to encompass retrieval-augmented generation can result in applications that are sharper, faster, and more cost-effective. However, these models demand intensive calibration to achieve their potential. Though smaller models may not perform optimally at first, they can benefit from fine-tuning techniques and domain-specific data using RAG. As I noted in my previous post, navigating the many techniques for customizing LLMs can be challenging, but the goal is to refine models for specific functions.

Anyscale endpoints, leveraging the OpenAI SDK, offer a streamlined process for this endeavor. They simplify the complexities of fine-tuning, from scheduling jobs to optimizing hyperparameters. Once the tuning is complete, Anyscale sends an email notification and your model becomes available for serving. For those venturing into RAG, a robust foundation like Ray is crucial to conducting experiments that produce an optimal RAG setup.

Stage 4: Deploy, use, and manage multiple custom LLMs.

A single custom LLM will likely be insufficient. Different tasks or projects often require their own distinct LLMs. This is driven by a variety of factors, including cost considerations, the pursuit of speed, the need for greater control, and the desire for pinpoint accuracy.

An overview of Ray Docs AI, a component of Anyscale Doctor.

Indeed some applications require multiple custom LLMs to maximize performance and enhance the user experience. A single LLM cannot always meet all of an application’s diverse needs. For instance, the Anyscale Doctor, a coding assistant discussed at the Ray Summit, epitomizes this. To address the knowledge gaps inherent in LLMs, a retrieval-augmented generation application was essential. Additionally, using a suite of LLMs allowed the developers to segment tasks, optimizing each and selecting the ideal LLM for specific tasks. An intelligent routing mechanism ensured that the correct LLM was chosen for each task. This multifaceted approach, integrating several LLMs and RAG applications, demanded a robust serving infrastructure, which was achieved through Anyscale endpoints.

Closing Thoughts: Security and Governance considerations.

Building applications with LLMs can be a complex and challenging process. The journey is as much about exploration as it is about optimization and evolution. You have to balance between understanding the tools available, pushing their boundaries, and crafting solutions that are both effective and efficient.

Regardless of which stage you resonate with, an overarching concern permeates them all: the imperative of securing data and intellectual property. In some cases, a standard API may not meet your needs, especially when you’re dealing with sensitive data or you’re in need of enhanced customization.

In response to these challenges, Anyscale’s introduction of private endpoints presents a compelling option. These endpoints empower developers to deploy the Anyscale backend, encompassing various LLMs, within dedicated cloud accounts. Not only does this bolster privacy, but it also enhances flexibility and provides unmatched control over the infrastructure. While numerous LLM APIs exist in the market, none present such a unique blend of advantages. With Anyscale private endpoints, users enjoy the ease of a recognizable API interface, supported by a formidable and scalable backend. Moreover, it affords an exceptional degree of customizability and control, mirroring the experience of operating a personal infrastructure.


If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

Discover more from Gradient Flow

Subscribe now to keep reading and get access to the full archive.

Continue reading