Introduction
This lesson will delve into integrating several LLM models in LangChain. We will examine the platforms supporting these LLM models and compare their features. LangChain has built-in support for some of the most popular publicly available pre-trained models. In previous lessons, we already discussed several options like ChatGPT, GPT-4, GPT-3, and GPT4ALL.
This framework provides close to 30 integrations with well-known AI platforms like OpenAI, Cohere, Writer, and Replicate, to name a few. Most notably, they provide access to Huggingface Hub API with more than 120K available models that can be easily incorporated into your applications. These organizations offer different ways to access their services.
It is a common practice to pay for the API interfaces. The prices are usually determined by factors such as the number of processed tokens, as seen in OpenAI, or the process’s duration measured in hours of GPU usage, as is the case with Huggingface Interface or Amazon Sagemaker. These options are generally easy and fast to set up. However, it is worth noting that you do not own the models, even if it was fine-tuned on your valuable datasets. They just provide access to the API with a pay-as-you-go plan.
On the other side of the spectrum, hosting the models locally on your servers is possible. It will enable you to have full and exclusive control over the network and your dataset. It is important to be aware of the hardware (high-end GPU for low latency) and maintenance (the expertise to deploy and fine-tune models) costs that are associated with this approach. Additionally, a number of publicly available models are not accessible for commercial use, like LLaMA.
The right approach is different for each use case and depends on details like budget, model capability, expertise, and trade secrets. It is straightforward to create a custom fine-tuned model by feeding your data to OpenAI’s API. On the other hand, you might consider doing fine-tuning in-house if the dataset is part of your intellectual property and cannot be shared.
The different models’ characteristics are another consideration. The network sizes and the dataset quality directly impact its language understanding capability. In contrast, a larger model is not always the best answer. The GPT-3’s Ada variation is the smallest model in the collection, making it the fastest and most cost-effective option with low latency. However, it suits more straightforward tasks like parsing text or classification. Conversely, the latest GPT-4 version is the largest model to generate high-quality results for every task. But, the large number of parameters makes it a slow and the most expensive option. Therefore, selecting the model based on their ability is also necessary. It might be cheaper to use Ada to implement an application to hold a conversation, but it is not the model’s objective and will result in disappointing responses. (You can read this article for a comparison between a number of well-known LLMs)
We will introduce a number of LangChain integrations in the rest of this lesson to help choose the right approach.
Popular LLM models accessible to LangChain via API
Cohere Command
The Cohere service provides a variety of models such as Command (command
) for dialogue-like interactions, Generation (base
) for generative tasks, Summarize (summarize-xlarge
) for generating summaries, and more. You can get free, rate-limited usage for learning and prototyping. This means that usage is free until you go into production, however some of the models may be a bit more expensive than OpenAI APIs when you do—for example, $2.5 for generating 1K tokens. However, since Cohere offers more customized models for each task, this could lead to a more use case-specific model having improved outcomes in downstream tasks. The LangChain’s Cohere class makes it easy to access these models. Cohere(model="<MODEL_NAME>", cohere_api_key="<API_KEY>")
command-xlarge-20221108
) Please refer to the Cohere documentation for the latest naming convention.GPT-3.5
GPT-3.5 is a language model developed by OpenAI. Its turbo version (recommended by OpenAI over other variants) offers a more affordable option for generating human-like text through an API accessible via OpenAI endpoints. The model is optimized for chat applications while remaining powerful on other generative tasks and can process 96 languages. GPT-3.5-turbo has a up to 16K tokens context length and is the most cost-effective option from the OpenAI collection with only $0.002 per 1000 tokens. It is possible to access this model’s API by using the gpt-3.5-turbo
key while initializing either ChatOpenAI
or OpenAI
classes.
GPT-4
OpenAI's GPT-4 is a competent multimodal model with an undisclosed number of parameters or training procedures. It is the latest and most powerful model published by OpenAI, and the multi-modality enables the model to process both text and image as input. Unfortunately, It is not publicly available; however, it can be accessed by submitting your early access request through the OpenAI platform. The two variants of the model are gpt-4
and gpt-4-32k
with different context lengths, 8192 and 32768 tokens, respectively.
Jurassic-2
The AI21’s Jurassic-2 is a language model with three sizes and different price points: Jumbo, Grande, and Large. The model sizes are not publicly available, but their documentation marks the Jumbo version as the most powerful model. They describe the models as general-purpose with excellent capability on every generative task. Their J2 model understands seven languages and can be fine-tuned on custom datasets. Getting your API key from the AI21 platform and using the AI21()
class to access these models is possible.
StableLM
StableLM Alpha is a language model developed by Stable Diffusion, which can be accessed via HuggingFace Hub (with the following id stabilityai/stablelm-tuned-alpha-3b
) to host locally or Replicate API with a rate from $0.0002 to $0.0023 per second. So far, it comes in two sizes, 3 billion and 7 billion parameters. The weights for StableLM Alpha are available under CC BY-SA 4.0 license with commercial use access. The context length of StableLM is 4096 tokens.
Dolly-v2-12B
Dolly-v2-12B is a language model created by Databricks, which can be accessed via HuggingFace Hub (with the following id databricks/dolly-v2-3b
) to host locally or Replicate API with the same price range as mentioned in the previous subsection. It has 12 billion parameters and is available under an open source license for commercial use. The base model used for Dolly-v2-12B is Pythia-12B.
GPT4ALL
GPT4ALL is based on meta’s LLaMA model with 7B parameters. It is a language model developed by Nomic-AI that can be accessed through GPT4ALL and Hugging Face Local Pipelines. The model is published with a GPL 3.0 open-source license. However, it is not free to use for commercial applications. It is available for researchers to use for their projects and experiments. We went through this model’s capability and usage process in the previous lesson.
LLM Platforms that can integrate into LangChain
Cohere is a Canadian-based startup specializing in natural language processing models that help companies enhance human-machine interactions. Cohere provides access to their Cohere xlarge
model through API, which has 52 billion parameters. Their API pricing is based on embeddings and is $1 for every 1000 embeddings. Cohere provides an easy-to-follow installation process for their package, which is required to access their API. Using LangChain, developers can easily interact with Cohere models by creating prompts incorporating input variables, which can then be passed to the Cohere API to generate responses.
OpenAI platform is one of the biggest companies focusing on large language models. By introducing their conversational model, ChatGPT, they were the first service to catch mainstream media attention on the potency of LLMs. They also provide a large variety of API endpoints for different NLP tasks with different price points. The LangChain library provides multiple classes for convenient access, examples of which we saw in previous lessons, like ChatGPT
and GPT4
classes.
Hugging Face is a company that develops natural language processing (NLP) technologies, including pre-trained language models, and offers a platform for developing and deploying NLP models. The platform hosts over 120k models and 20k datasets. They offer the Spaces service for researchers and developers to create a demo and showcase their model’s capabilities quickly. The platform hosts large-scale models such as StableLM by Stability AI, Dolly by DataBricks, or Camel by Writer. The HuggingFaceHub
class takes care of downloading and initializing the models.
This integration provides access to many models that are optimized for Intel CPUs using Intel® Extension for PyTorch library. The mentioned package can be applied to models with minimal code change. It enables the networks to take advantage of Intel®’s advanced architectural designs to significantly enhance CPU and GPU lines' performance. For example, the reports reveal a 3.8 speed up while running the BLOOMZ model (text-to-image) on the Intel® Xeon® 4s CPU compared to the previous generation with no changes in architecture/weights. When the mentioned optimization library was used alongside the 4th generation of Intel® Xeon® CPU, the inference speed rate increased nearly twofold to 6.5 times its original value. (online demo) Whisper and GPT-J are two other examples of widely recognized models that leverage these efficiency gains.
The Amazon SageMaker infrastructure enables users to train and host their machine-learning models easily. It is a high-performance and low-cost environment for experimenting and using large-scale models. The LangChain library provides a simple-to-use interface that simplifies the process of querying the deployed models. So, There is no need to write API codes for accessing the model. It is possible to load a model by using the endpoint_name
which is the model’s unique name from SageMaker, followed by credentials_profile_name
which is the name of the profile you want to use for authentication.
Hugging Face Local Pipelines is a powerful tool that allows users to run Hugging Face models locally using the HuggingFacePipeline
class. The Hugging Face Model Hub is home to an impressive collection of more than 120,000 models, 20,000 datasets, and 50,000 demo apps (Spaces) that are all publicly available and open source, making it easy for individuals to collaborate and build machine learning models together. To access these models, users can either utilize the local pipeline wrapper or call the hosted inference endpoints via the HuggingFaceHub
class. Before getting started, the Transformers Python package must be installed. Once installed, users can load their desired model using the model_id
and task
and any additional model arguments. Finally, the model can be integrated into an LLMChain by creating a PromptTemplate and LLMChain object and running the input through it.
OpenAI’s models can also be accessed via Microsoft’s Azure platform.
AI21 is a company that offers access to their powerful Jurassic-2 large language models through their API. The API provides access to their Jurassic-2
model, which has an impressive 178 billion parameters. The API comes at quite a reasonable cost of only $0.01 for every 1k tokens. Developers can easily interact with the AI21 models by creating prompts with LangChain that incorporate input variables. With this simple process, developers can take advantage of their powerful language processing capabilities.
Aleph Alpha is a company that offers a family of large language models known as the Luminous series. The Luminous family includes three models, namely Luminous-base, Luminous-extended, and Luminous-supreme, which vary in terms of complexity and capabilities. Aleph Alpha's pricing model is token-based, and the table provides the base prices per model for every 1000 input tokens. The Luminous-base model costs 0.03€ per 1000 input tokens, Luminous-extended costs 0.045€ per 1000 input tokens, Luminous-supreme costs 0.175€ per 1000 input tokens, and Luminous-supreme-control costs 0.21875€ per 1000 input tokens.
Banana is a machine learning infrastructure-focused company that provides developers with the tools to build machine learning models. Using LangChain, one can interact with Banana models by installing the Banana package, including an SDK for Python. Next, two following tokens are required: the BANANA_API_KEY
and the YOUR_MODEL_KEY
, which can be obtained from their platform. After setting the keys, we can create an object by providing the YOUR_MODEL_KEY
. It is then possible to integrate the Banana
model into an LLMChain by creating a PromptTemplate
and LLMChain
object and running the desired input through it.
CerebriumAI is an excellent alternative to AWS Sagemaker, providing access to several LLM models through its API. The available pre-trained LLM models include Whisper, MT0, FlanT5, GPT-Neo, Roberta, Pygmalion, Tortoise, and GPT4All. Developers create an instance of CerebriumAI by providing the endpoint URL
and other relevant parameters such as max length
, temperature
, etc.
DeepInfra is a unique API that offers a range of LLMs, such as distilbert-base-multilingual-cased, bert-base, whisper-large, gpt2, dolly-v2-12b, and more. It is connected to LangChain via API and runs on A100 GPUs that are optimized for inference performance and low latency. Compared to Replicate, DeepInfra's pricing is much more affordable, at $0.0005/second and $0.03/minute. With DeepInfra, we are given a 1-hour free trial of serverless GPU computing to experiment with different models.
ForefrontAI is a platform that allows users to fine-tune and utilize various open-source large language models like GPT-J, GPT-NeoX, T5, and more. The platform offers different pricing plans, including the Starter plan for $29/month, which comes with 5 million serverless tokens, 5 fine-tuned models, 1 user, and Discord support. With ForefrontAI, developers have access to various models that can be fine-tuned to suit our specific needs.
GooseAI is a fully managed NLP-as-a-Service platform that offers access to various models, including GPT-Neo, Fairseq, and GPT-J. The pricing for GooseAI is based on different model sizes and usage. For the 125M model, the base price for up to 25 tokens is $0.000035 per request, with an additional fee of $0.000001. To use GooseAI with LangChain, you need to install the openai package and set the Environment API Key, which can be obtained from GooseAI. Once you have the API key, you can create a GooseAI instance and define a Prompt Template for Question and Answer. The LLMChain can then be initiated, and you can provide a question to run the LLMChain.
Llama-cpp, a Python binding for llama.cpp, has been seamlessly integrated into the LangChain framework. This integration allows users to access a variety of LLM (Large Language Model) models offered by Llama-cpp, including LLaMA 🦙, Alpaca, GPT4All, Chinese LLaMA / Alpaca, Vigogne (French), Vicuna, Koala, OpenBuddy 🐶 (Multilingual), Pygmalion 7B, and Metharme 7B. With this integration, users have a wide range of options to choose from based on their specific language processing needs. By integrating Llama-cpp into LangChain, users can benefit from the powerful language models and generate humanistic and step-by-step responses to their input questions.
Manifest is an integration tool that enhances the capabilities of LangChain, making it more powerful and user-friendly for language processing tasks. It acts as a bridge between LangChain and local Hugging Face models, allowing users to access and utilize these models within LangChain easily. Manifest has been seamlessly integrated into LangChain, providing users with enhanced capabilities for language processing tasks. To utilize Manifest within LangChain, users can follow the provided instructions, which involve installing the manifest-ml package and configuring the connection settings. Once integrated, users can leverage Manifest's functionalities alongside LangChain for a comprehensive language processing experience.
Modal is seamlessly integrated into LangChain, adding powerful cloud computing capabilities to the language processing workflow. While Modal does not provide any specific language models (LLMs), it serves as the infrastructure enabling LangChain to leverage serverless cloud computing. By integrating Modal into LangChain, users can directly harness the benefits of on-demand access to cloud resources from their Python scripts on their local computers. By installing the Modal client library and generating a new token, users can authenticate and establish a connection to the Modal server. In the LangChain example, a Modal LLM is instantiated using the endpoint URL
, and a PromptTemplate
is defined to structure the input. LangChain then executes the LLMChain with the specified prompt and runs a language processing task, such as answering a question.
NLP Cloud seamlessly integrates with LangChain, providing a comprehensive suite of high-performance pre-trained and custom models for a wide range of natural language processing (NLP) tasks. These models are designed for production use and can be accessed through a REST API. By executing the LLMChain with the specified prompt, users can seamlessly perform NLP tasks like answering questions.
Petals are seamlessly integrated into LangChain, enabling the utilization of over 100 billion language models within a decentralized architecture similar to BitTorrent. This notebook provides guidance on incorporating Petals into the LangChain workflow. Petals offer a diverse range of language models, and its integration with LangChain enhances natural language understanding and generation capabilities. Petals operate under a decentralized model, providing users with powerful language processing capabilities in a distributed environment.
PipelineAI is seamlessly integrated into LangChain, allowing users to scale their machine-learning models in the cloud. Additionally, PipelineAI offers API access to a range of LLM (Large Language Model) models. It includes GPT-J, Stable Diffusion, ESRGAN, DALL·E, GPT-2, and GPT-Neo, each with its own specific model parameters and capabilities. PipelineAI empowers users to leverage the scalability and power of the cloud for their machine-learning workflows within the LangChain ecosystem.
PredictionGuard is seamlessly integrated into LangChain, providing users with a powerful wrapper for their language model usage. To begin using PredictionGuard within the LangChain framework, the predictionguard
and LangChain libraries need to be installed. PredictionGuard can also be seamlessly integrated into LangChain's LLMChain for more advanced tasks. PredictionGuard enhances the LangChain experience by providing an additional layer of control and safety to language model outputs.
PredictionGuard is seamlessly integrated into LangChain, offering users enhanced control and management of their GPT prompt engineering. PromptLayer acts as a middleware between users' code and OpenAI's Python library, enabling the recording, tracking, and exploration of OpenAI API requests through the PromptLayer dashboard. To utilize PromptLayer with OpenAI, the 'promptlayer' package needs to be installed. Users can attach templates to requests, enabling the evaluation of different templates and models within the PromptLayer dashboard.
Replicate is seamlessly integrated into LangChain, providing a wide range of LLM models for various applications. Some of the LLM models offered by Replicate include vicuna-13b, bark, speaker-transcription, stablelm-tuned-alpha-7b, Kandinsky-2, and stable-diffusion. These models cover diverse areas such as language generation, generative audio, speaker transcription, language modeling, and text-to-image generation. Each model has specific parameters and capabilities, enabling users to choose the most suitable model for their needs. Replicate provides flexible pricing options based on the computational resources required for running the models. Replication simplifies the deployment of custom machine-learning models at scale. Users can integrate Replicate into LangChain to interact with these models effectively.
Runhouse is seamlessly integrated into LangChain, providing powerful remote compute and data management capabilities across different environments and users. Runhouse offers the flexibility to host models on your own GPU infrastructure or leverage on-demand GPUs from cloud providers such as AWS, GCP, and Azure. Runhouse provides several LLM models that can be utilized within LangChain, such as gpt2 and google/flan-t5-small. Users can specify the desired hardware configuration. By combining Runhouse and LangChain, users can easily create advanced language model workflows, enabling efficient model execution and collaboration across different environments and users.
StochasticAI aims to simplify the workflow of deep learning models within LangChain, providing users with an efficient and user-friendly environment for model interaction and deployment. It provides a streamlined process for the lifecycle management of Deep Learning models. StochasticAI's Acceleration Platform simplifies tasks such as model uploading, versioning, training, compression, and acceleration, ultimately facilitating the deployment of models into production. Within LangChain, users can interact with StochasticAI models effortlessly. The available LLM models from StochasticAI include FLAN-T5, GPT-J, Stable Diffusion 1, and Stable Diffusion 2. These models offer diverse capabilities for various language-related tasks.
The writer is seamlessly integrated into LangChain, providing users with a powerful platform for generating diverse language content. With Writer integration, LangChain users can effortlessly interact with a range of LLM models to meet their language generation needs. The available LLM models provided by Writer include Palmyra Small (128m), Palmyra 3B (3B), Palmyra Base (5B), Camel 🐪 (5B), Palmyra Large (20B), InstructPalmyra (30B), Palmyra-R (30B), Palmyra-E (30B), and Silk Road. These models offer different capacities for improving language understanding, generative pre-training, following instructions, and retrieval-augmented generation.
Conclusion
It's understandable to feel overwhelmed by the number of choices when integrating the mentioned foundational models. That's why, in this lesson, we have explained the different paths one can take. This information can be a valuable reference tool to help make an informed decision. Depending on your requirements, you may host the model locally or opt for a pay-as-you-go service. The former will enable you to have complete control over the model's implementation, while the latter can be more cost-effective for those with limited resources. Whatever your preference, choosing the option that best suits your needs and budget is essential.
Good job on completing all the lessons of the first module of the course! Head up to the module quiz to test your new knowledge. After that, the following module focuses on prompting, i.e. the best way of giving instructions to LLMs.
Intel, the Intel logo and Xeon are trademarks of Intel Corporation or its subsidiaries.