Deploying an LLM on a Cloud CPU

Training a language model can be costly, and expenses associated with deploying it can quickly accumulate over time. Utilizing optimization techniques that enhance the efficiency of the inference process is crucial for minimizing hosting expenses. In this lesson, we will discuss the utilization of the Intel ® Neural Compressor library to implement quantization techniques. This approach aims to enhance the cost-effectiveness and speed of models when running on CPU instances (it supports also AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime, but with limited testing).

Various techniques can be employed for optimizing a network. Pruning involves trimming the parameter count by targeting less important weights, while knowledge distillation transfers insights from a larger model to a smaller one. Lastly, quantization decreases weight precision from 32 bits to 8 bits. It will significantly decrease the memory needed for loading models and generating responses with minimal accuracy loss.

Credit:
Credit: Deci.ai

The primary focus of this lesson is the quantization technique. We will apply it to an LLM and demonstrate how to perform inference using the quantized model. Ultimately, we will execute several experiments to assess the resulting acceleration.

We'll begin by setting up the necessary libraries. Install the optimum-intel package directly from its GitHub repository.

pip install git+https://github.com/huggingface/optimum-intel.git@v1.10.1
pip install neural_compressor===2.2.1
pip install onnx===1.14.1
The sample code.

Quantization

You can utilize the optimum-cli command within the terminal to execute dynamic quantization. Dynamic quantization stands as the recommended approach for transformer-based neural networks. You have the choice to either specify the path to your custom model or select a model from the Huggingface Hub, which will be designated using the --model parameter. The --output parameter determines the name of the resulting model. We are conducting tests on Facebook's OPT model with 1.3 billion parameters.

optimum-cli inc quantize --model facebook/opt-1.3b --output opt1.3b-quantized
The sample code.

If the script fails to recognize your model, you can employ the --task parameter. You might use --task text-generation for language models. Check the source code for a complete list of supported tasks.

The library also provides a constrained quantization method, enabling you to define a specific target. For example, you can employ an evaluation function to request quantization of the model while experiencing no more than a 5% reduction in accuracy. For further details regarding constrained quantization, please refer to the library documentation.

Inference

Now, the model is ready for inference purposes. In this section, we will focus on how to load these models and present the outcomes of our benchmark tests, highlighting the impact of quantization on the speed of the generation process. Prior to conducting the inference process, it's essential to load the pre-trained tokenizer using the AutoTokenizer class. As the quantization technique doesn't alter the model's vocabulary, we will employ the same tokenizer as the base model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
The sample code.

For loading the model, we utilize the INCModelForCasualLM class provided by the Optimum package. Additionally, it offers a range of loaders tailored for various tasks, including INCModelForSequenceClassification for classification and INCModelForQuestionAnswering for tasks involving question answering. The .from_pretrained() method should be provided with the path to the quantized model from the previous section.

from optimum.intel import INCModelForCausalLM

model = INCModelForCausalLM.from_pretrained("./opt1.3b-quantized")
The sample code.

Finally, we can employ the identical .generate method from the Transformers library to input the prompt to the model and get the response.

inputs = tokenizer("<PROMPT>", return_tensors="pt")

generation_output = model.generate(**inputs,
                                   return_dict_in_generate=True,
                                   output_scores=True,
                                   min_length=512,
                                   max_length=512,
                                   num_beams=1,
                                   do_sample=True,
                                   repetition_penalty=1.5)
The sample code.

We compel the model to produce 512 tokens by explicitly setting both the minimum and maximum length parameters. The rationale behind this is to maintain a uniform token count between the standard model and the quantized version, facilitating a valid comparison of their generation times. We also experimented with batching the requests and employing an alternative decoding strategy.

Greedy
1
58.09
26.847
Greedy
4
127.86
52.46
Beam Search (K=4)
1
144.77
40.73
Beam Search (K=4)
4
354.50
199.72

The above table shows a large improvement in results with the quantization method. The most significant enhancement involves implementing beam search with a batch size of 1, which led to a 3.5x acceleration in the inference process. All the experiments mentioned were conducted on a server instance equipped with an Intel® Xeon® 4s CPU and 64GB of memory. This highlights the feasibility of performing inference on CPU instances to mitigate costs and latency effectively.

Deployment Frameworks

Deploying large language models into production is the final stage in harnessing their capabilities for a diverse array of applications. Creating an API is the most efficient and flexible approach among the various methods available. APIs allow developers to seamlessly integrate these models into their code, enabling real-time interactions with web or mobile applications. There are several ways to create such APIs, each with its advantages and trade-offs.

There exist specialized libraries, such as vLLM and TorchServe, designed for handling specific use cases. These libraries are capable of loading models from various sources and creating endpoints for convenient accessibility. In most cases, these libraries even offer optimization methods to enhance the speed of the inference process, batching income requests, and efficient memory management. On the other hand, there exist standard backend libraries such as FastAPI that facilitate the creation of any endpoints. While it may not be specifically designed for serving AI models, you can effortlessly integrate it into your development process to generate other APIs as needed.

Regardless of the chosen method, a well-designed API ensures that large language models can be deployed robustly, enabling organizations to leverage their capabilities in chatbots, content generation, language translation, and many other applications.

Deploying a model on CPU using Compute Engine with GCP

Follow these steps to deploy a language model on Intel CPUs using Compute Engine with Google Cloud Platform (GCP):

  1. Google Cloud Setup: Sign in to your Google Cloud account. If you don't have one, create it and set up a new project.
  2. Enable Compute Engine API: Navigate to APIs & Services > Library. Search for "Compute Engine API" and enable it.
  3. Create a Compute Engine instance: Go to the Compute Engine dashboard and click on “Create Instance”. Choose an CPU for your machine type. Here are several machine types that can be used in GCP and sporting Intel CPUs.

Once the instance is up and running:

  1. Deploy the model: SSH into your instance. Install the necessary libraries and dependencies and copy your server code (FastAPI, vLLM, etc) to the machine.
  2. Run the model: Once the setup is complete, run your language model. If it's a web-based model, start your server.

Remember, Google Cloud charges based on the resources used, so make sure to stop your instance when not in use.

A similar process can be done for AWS too using EC2. You can find AWS machine types here.

Conclusion

In this lesson, we explored the potential of harnessing 4th Generation Intel® Xeon® Scalable Processors for the inference process and the array of optimization techniques available that make it a practical choice. Our focus was on the quantization approach aimed at enhancing the speed of text generation while conserving resources. The results demonstrate the advantages of applying this technique across various configurations. It is worth noting that there are additional techniques available to optimize the models further. The upcoming chapter will discuss advanced topics within language models, including aspects like multi-modality and emerging challenges.

>> Notebook.

For more information on Intel® Accelerator Engines, visit this resource page. Learn more about Intel® Extension for Transformers, an Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere here.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries.