Introduction
In this lesson, we'll guide you through the step-by-step process of training a large language model from the ground up. Our primary focus will be on conducting the pre-training process in the cloud. Nevertheless, it's worth noting that all the concepts covered here can be transferable if you want to train a model locally and have enough resources on your local machine (only for small language models).
When embarking on model training, three key components must be taken into account. The process begins with selecting an appropriate dataset that aligns with your specific use case. Next, configure the architecture of the model, making adjustments based on the resources at your disposal. Finally, execute the training loop, bringing everything together to train the model effectively.
We integrate well-known libraries like Deep Lake Datasets and Transformers into our implementation to build a smooth pipeline. The initial step to initiate the process involves selecting the dataset.
GPU Cloud - Lambda
In this lesson, we’ll leverage Lambda, the GPU cloud designed by ML engineers for training LLMs & Generative AI. We can create an account on it, link a billing account, and then rent one instance of the following GPU servers with associated costs. Please follow the instructions in the course logistics section to open a Lambda account. The cost of your instance is based on its duration, not just the time spent training your model; so remember to turn your instance off. For this lesson, we rented an 8x NVIDIA A100 instance comprising 40GB of memory for $8.80/h. If you're using the Lambda-provided cloud credit for the course, be aware that you still need to register a credit card. The credit will cover costs up to $75, but you must have a card on file. If you spend more money than allocated by the credit (more than $75), you will have to cover those costs yourself.
You can find the code of this lesson in this Notebook.
Training Monitoring - Weights and Biases
As we’re going to spend a lot of money in training our LLM and ensure that everything is progressing smoothly, we’ll log the training metrics to Weights and Biases, allowing us to see the metrics in real time in a suitable dashboard.
Load the Dataset
During the pre-training process, we utilize the Activeloop datasets to stream the samples seamlessly, batch by batch. This approach proves beneficial for resource management as loading the entire dataset directly into memory is unnecessary. Consequently, it greatly helps in optimizing resource usage. You can quickly load the dataset, and it automatically handles the streaming process without requiring any special configurations.
You can load the datasets in just one line of code and visualize their content for analysis. The library seamlessly integrates with PyTorch and TensorFlow, which are considered two of the most powerful frameworks for implementing AI applications. You can head out to datasets.activeloop.ai to see the complete list of available datasets. Porting your datasets to the hub is also achievable with minimal effort.
Let’s start by loading the openwebtext
dataset, a collection of Reddit posts with at least three upvotes. This dataset is well-suited for acquiring broad knowledge to build a foundational model for general purposes.
import deeplake
ds = deeplake.load('hub://activeloop/openwebtext-train')
ds_val = deeplake.load('hub://activeloop/openwebtext-val')
print(ds)
print(ds[0].text.text())
Dataset(path='hub://activeloop/openwebtext-train', read_only=True, tensors=['text', 'tokens'])
"An in-browser module loader configured to get external dependencies directly from CDN. Includes babel/typescript. For quick prototyping, code sharing, teaching/learning - a super simple web dev environment without node/webpack/etc.\n\nAll front-end libraries\n\nAngular, React, Vue, Bootstrap, Handlebars, and jQuery are included. Plus all packages from cdnjs.com and all of NPM (via unpkg.com). Most front-end libraries should work out of the box - just use import / require() . If a popular library does not load, tell us and we’ll try to solve it with some library-specific config.\n\nWrite modern javascript (or typescript)\n\nUse latest language features or JSX and the code will be transpiled in-browser via babel or typescript (if required). To make it fast the transpiler will start in a worker thread and only process the modified code. Unless you change many files at once or open the project for the first time, the transpiling should be barely noticeable as it runs in parallel with loading a..."
The provided code will instantiate a dataset object capable of retrieving the data points for both training and validation sets. Afterward, we can print the variable to examine the dataset's characteristics. It consists of two tensors: text
containing the textual input and tokens
representing the tokenized version of the content (which we won't be utilizing). We can also index through the dataset and access each column by using .text
and convert the row to textual format by calling .text()
method.
The next step involves crafting a PyTorch Dataset class that leverages the loader object and ensures compatibility with the framework. The Dataset class handles both dataset formatting and any desired preprocessing steps to be applied. In this instance, our objective is to tokenize the samples. We will load the GPT-2 tokenizer model from the Transformers library to achieve this.
For this specific model, we need to set a padding token (which may not be required for other models), and for this specific purpose, we have chosen to utilize the end of sentence eos_token
to set the loaded tokenizer’s pad_token
method.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
Next, we create dataloaders from the Deep Lake datasets. In doing so, we also specify a transform
that tokenizes the texts of the dataset on the fly.
# define transform to tokenize texts
def get_tokens_transform(tokenizer):
def tokens_transform(sample_in):
tokenized_text = tokenizer(
sample_in["text"],
truncation=True,
max_length=512,
padding='max_length',
return_tensors="pt"
)
tokenized_text = tokenized_text["input_ids"][0]
return {
"input_ids": tokenized_text,
"labels": tokenized_text
}
return tokens_transform
# create data loaders
ds_train_loader = ds.dataloader()\
.batch(32)\
.transform(get_tokens_transform(tokenizer))\
.pytorch()
ds_eval_train_loader = ds_val.dataloader()\
.batch(32)\
.transform(get_tokens_transform(tokenizer))\
.pytorch()
Please note that we have formatted the dataset so that each sample is comprised of two components: input_ids
and labels
. input_ids
are the tokens the model will use as inputs, while labels
are the tokens the model will try to predict.
Currently, both keys contain the same tokenized text. However, the trainer object from the Transformers library will automatically shift the labels by one token, preparing them for training.
Initialize the Model
As the scope of this course does not include building the architecture from scratch, we won't be implementing it. We have already covered the details of the Transformer architecture in a previous lesson and provided additional resources for those who are interested in a more in-depth implementation.
To accelerate the process, we will leverage an existing publicly available implementation of the transformer architecture. This approach allows us to scale the model quickly using available hyperparameters, including the number of layers, embedding dimension, and attention heads. Additionally, we will capitalize on the success of established architectures while maintaining the flexibility to modify the model size to accommodate our available resources.
We opted to utilize the GPT-2 pre-trained model. Nonetheless, there is an option to utilize any other available model from the Huggingface hub; the approach presented here can be easily adapted to work with various architectures.
Initially, we examine the default hyperparameters by loading the configuration file and reviewing the choices made in the architecture design.
from transformers import AutoConfig
config = AutoConfig.from_pretrained("gpt2")
print(config)
GPT2Config {
"_name_or_path": "gpt2",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.30.2",
"use_cache": true,
"vocab_size": 50257
}
It is apparent that we have the ability to exert significant control over almost every aspect of the network by manipulating the configuration settings. Specifically, we focus on the following parameters: n_layer
, which indicates the number of stacking decoder components and defines the embedding layer’s hidden dimension; n_positions
and n_ctx
, to represent the maximum number of input tokens; and n_head
to change the number of attention heads in each attention component. You can read the documentation to gain a more comprehensive understanding of the remaining parameters.
We can start by initializing the model using the default configuration and then count the number of parameters it contains, which will serve as a baseline. To achieve this, we utilize the GPT2LMHeadModel
class, which takes the config
variable as input and then proceeds to loop through the parameters, summing them up accordingly.
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT-2 size: {model_size/1e6:.1f}M parameters")
GPT-2 size: 124.4M parameters
As shown, the GPT-2 model is relatively small (124M) when compared to the current state-of-the-art large language models. We’re going to pre-train a 124-million-parameter model, which we refer to as GPT2-scratch-openwebtext
. We chose this size so that a part of its training can be easily replicated by any reader within a reasonable price (~$100).
If you wanted to train a larger model, you could modify the architecture to scale it up slightly. As we previously described the selected parameters, we can create a network with 32 layers and an embedding size of 1600. It is worth noting that if not specified, the hidden dimensionality of the linear layers will be 4 × n_embd
.
config.n_layer = 32
config.n_embd = 1600
config.n_positions = 512
config.n_ctx = 512
config.n_head = 32
Now, we proceed to load the model with the updated hyperparameters.
model_1b = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model_1b.parameters())
print(f"GPT2-1B size: {model_size/1e6:.1f}M parameters")
GPT2-1B size: 1065.8M parameters
The modifications led to a model with 1 billion parameters. It is possible to scale the network further to be more in line with the newest state-of-the-art models, which often have more than 80 layers.
However, let’s continue with this lesson's 124M parameters model.
Training Loop
The final step in the process involves initializing the training loop. We utilize the Transformers library's Trainer
class, which takes the necessary parameters for training the model. However, before proceeding, we need to create a TrainingArguments
object that defines all the essential arguments.
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="GPT2-scratch-openwebtext",
evaluation_strategy="steps",
save_strategy="steps",
eval_steps=500,
save_steps=500,
num_train_epochs=2,
logging_steps=1,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=1,
weight_decay=0.1,
warmup_steps=100,
lr_scheduler_type="cosine",
learning_rate=5e-4,
bf16=True,
ddp_find_unused_parameters=False,
run_name="GPT2-scratch-openwebtext",
report_to="wandb"
)
Note that we set the per_device_train_batch_size
and the per_device_eval_batch_size
variables to 1
as the batch size is already specified by the dataloader we created earlier.
There are over 90 parameters available for adjustment. Find a comprehensive list with explanations in the documentation. Please note that if there is an "out of memory" error while attempting to train, a smaller batch_size
can be used. Additionally, the bf16
flag, which trains the model using lower precision floating numbers, is only available on high-end GPU devices. If unavailable, it can be substituted with the argument fp16=True
.
Notice also that we set the parameter report_to
to wandb
; that is, we are sending the training metrics to Weights and Biases so that we can see a real-time report of how the training is going.
Next, we define the TrainerWithDataLoaders
class, a subclass of Trainer
where we override the get_train_dataloader
and get_eval_dataloader
methods to return our previously defined data loaders.
from transformers import Trainer
class TrainerWithDataLoaders(Trainer):
def __init__(self, *args, train_dataloader=None, eval_dataloader=None, **kwargs):
super().__init__(*args, **kwargs)
self.train_dataloader = train_dataloader
self.eval_dataloader = eval_dataloader
def get_train_dataloader(self):
return self.train_dataloader
def get_eval_dataloader(self, dummy):
return self.eval_dataloader
The process initiates with a call to the .train()
method.
trainer = TrainerWithDataLoaders(
model=model,
args=args,
train_dataloader=ds_train_loader,
eval_dataloader=ds_eval_train_loader,
)
trainer.train()
The Trainer
object will handle model evaluation during training, as specified in the eval_steps
argument, and save checkpoints based on the previously defined in save_steps
.
Here’s the final trained model after about 45 hours of training on 8x NVIDIA A100 on Lambda Labs.
As the hourly cost of 8x NVIDIA A100 on Lambda Labs is $8.80, the total cost is $ 400. You can stop your pretraining earlier if you want to spend less money on that.
Here’s the training report on Weights and Biases. The following report shows that the training loss decreased relatively smoothly as iterations passed.
Inference
Once the pre-training process is complete, we proceed with the inference stage to observe our model in action and evaluate its capabilities. As specified, the Trainer
will store the intermediate checkpoints in a designated directory called ./GPT2-scratch-openwebtext
. The most efficient approach to utilize the model involves leveraging the Transformers pipeline functionality, which automatically loads both the model and tokenizer, making them ready for text generation.
Below is the code snippet that establishes a pipeline object utilizing the pre-trained model alongside the tokenizer we defined in the preceding section. This pipeline enables text generation.
from transformers import pipeline
pipe = pipeline("text-generation",
model="./GPT2-scratch-openwebtext",
tokenizer=tokenizer,
device="cuda:0")
The pipeline object leverages the powerful Transformers .generate()
method internally, offering exceptional flexibility in managing the text generation process. (documentation) We can use methods like min_length
to define a minimum number of tokens to be generated, max_length
to limit the newly generated tokens, temperature
to control the generation process between randomness and most likely, and lastly, do_sample
to modify the completion process, switching between a greedy approach that always selects the most probable token and other sampling methods, such as beam search or diverse search. We only set the num_return_sequences
to limit the number of generated sequences.
txt = "The house prices dropped down"
completion = pipe(txt, num_return_sequences=1)
print(completion)
[{'generated_text': 'The house prices dropped down to 3.02% last year. While it was still in development, the housing market was still down. The recession hit on 3 years between 1998 and 2011. In fact, it slowed the amount of housing from 2013 to 2013'}]
The code will attempt to generate a completion for the given input sequence using the knowledge it has acquired from the training dataset. It aims to finish the following sequence: The house prices dropped down
while being relevant and contextually appropriate. Even with a brief training period, the model exhibits a good grasp of the language, generating grammatically correct and contextually coherent sentences.
Conclusion
Throughout this lesson, we gained an understanding of the fundamental steps required to train your own language model. The steps involve loading the relevant training data, defining the architecture, scaling it up as per your requirements, and, finally, commencing the training process. As previously discussed, there is no need to train a language model from scratch in many cases. In the upcoming module, we will cover the fine-tuning process in greater detail, enabling you to harness the capabilities of existing powerful models for specific use cases.