Introduction to the LangChain Course
Activeloop, Towards AI, and Intel Disruptor Initiative are excited to collaborate to bring Gen AI 360: Foundational Model Certification Course to tomorrow’s Generative AI professionals, executives, and enthusiasts.
Hello and welcome to our “LangChain & Vector Databases In Production” course, an integral part of a three-course series aimed at introducing Large Language Models, Deep Lake, and LangChain. This specific course focuses on LangChain - a popular framework for easily and quickly building applications empowered by large language models like GPT-Turbo-3.5, GPT-4, and GPT4ALL.
Why This Course?
This LangChain course will equip you with the knowledge and practical skills to build products and apps using Large Language Models (LLMs). We place heavy emphasis on the hands-on application, striving to guide you through a deep, practical introduction to leveraging the power of LLMs through LangChain.
One of the tools we extensively cover in this course is Activeloop's Deep Lake. It amalgamates the best features of data lakes and vector databases, facilitating companies to create their own data flywheels for refining their Large Language Models. Combined with LangChain, Deep Lake can seamlessly connect datasets to foundational models for various applications, from understanding GitHub repositories to analyzing financial statements.
Who Should Take This Course?
Whether you're a machine learning enthusiast or transitioning into the AI field from another domain in software, this course is for you. No prerequisites other than familiarity with Python & coding (to complete the projects). We aim to provide you with the necessary tools to apply Large Language Models across a wide range of industries, making AI more accessible and practical.
How Long Will This Course Take You?
On average, students complete this course at around 40 hours (or 5 full days equivalent of learning) if they follow along with all code examples and read through the material. Our quickest course takers have managed to complete the course as quickly as in 2.5 days! The course is designed as a self-paced journey - so don't feel any rush in completing it. However, people who do complete it within the first two weeks of signing up will get free access to Deep Lake Growth Plan for a Month!
What Will You Learn?
By taking this comprehensive course, students will gain a deep understanding of Large Language Models and how to use them effectively with LangChain. They will be introduced to various concepts, including prompting, managing outputs, and giving memory to LLMs. They will explore the integration of LLMs with external tools and even how to use them as reasoning engines with agents.
Students will learn through a hands-on approach, engaging in multiple real-world projects such as building a news articles summarizer and creating a customer support question-answering chatbot. This course ensures students understand the theory and the practical application of LLMs.
A critical aspect of this course centers on understanding the current limitations of LLMs, specifically hallucinations and limited memory. However, solutions to these limitations exist, and one of the most potent is the use of Vector Stores. Throughout this course, we will delve into the usage of Activeloop’s Deep Lake vector store as an effective remedy.
Is the Course Free?
Yes, the course is entirely free for everybody.
Certification
By participating in this course and completing the quizzes at the end of each chapter, you will have the opportunity to earn a certification in using Deep Lake - a valuable addition to your professional credentials. This certification program, offered at no cost, forms part of the Deep Lake Foundational Model Certification program in collaboration with Intel Disruptor Initiative and Towards AI.
You can skip the quizzes as you read through the lessons and chapters, but please remember to complete them at the end to receive your certificate!
AI Tutor
The AI Tutor Bot is designed to be a real-time, precise query-answering companion. The AI Tutor is the official course chatbot companion, so naturally, it knows everything from the course. Furthermore, it has access to thousands of technical articles, Wikipedia, technical documentation from Activeloop, HuggingFace, LangChain and OpenAI. This ensures each response is precise and up-to-date with the latest AI and coding insights. As you work through the courses, you can use the chatbot to assist you with learning with relevant resource recommendations and sourced responses.
Course Impact
"Reaching over 385,000 AI developers monthly, we're passionate about educating and upskilling engineers in this rapidly growing field. That is why we designed a practical course engineers can take to implement AI into their company processes or use LLMs to build entirely new products," said Louie Peters, CEO of Towards AI.
Adding to this, Davit Buniatyan, CEO of Activeloop, emphasized, "Every company will be adding foundational models and vector databases to their day-to-day operations and the products they build very soon. Upon course completion, Deep Lake Certified developers will be able to harness the full potential of Foundational Models and advanced technologies like Deep Lake and LangChain."
This course serves as a pathway to stay ahead in this rapidly advancing field, arming you with the skills necessary to use these frameworks in your toolset, thereby providing a competitive advantage.
We're looking forward to having you on this journey. Join us, and let's build the future of AI together!
Modules Covered
This course has been structured into several modules, each providing a detailed examination in increasing complexity of various facets of Large Language Models (LLMs), LangChain, and Deep Lake. Here is an overview of the modules you'll engage with.
1. From Zero to Hero
This introductory module serves as a quick guide, swiftly bringing you up to speed with all the fundamental concepts. It includes hands-on code snippets covering library installation, OpenAI credentials, deriving predictions from LLMs, and more. You'll also take a peek at Deep Lake and its applications.
2. Large Language Models and LangChain
This module provides a comprehensive overview of Large Language Models, including their capabilities, limitations, and use cases. You'll dive deep into LLMs like ChatGPT and GPT-4, explore these models' emergent abilities and scaling laws, and gain insights into phenomena like hallucinations and bias. This module also introduces LangChain and its role in integrating LLMs with other data sources and tools. You will also undertake a project to build a News Articles Summarizer.
3. Learning How to Prompt
Learning how to craft effective prompts is a key skill in working with LLMs. This module delves into the nuances of prompt engineering and teaches you to develop prompts that are easy to maintain. You'll learn techniques such as role prompting, few-shot prompting, and chain of thought. Towards the end of this module, you'll take your learning further by enhancing the News Articles Summarizer built in the previous module and undertaking a project to extract a knowledge graph from news articles.
4. Keeping Knowledge Organized with Indexes
The final module focuses on how to effectively leverage documents as a base for LLMs using LangChain's indexes and retrievers. You'll learn about data ingestion through various loaders, the importance of text splitters, and delve into the concept of embeddings and vector stores. The module ends with a project where you'll build a Customer Support Question Answering Chatbot using ChatGPT, Deep Lake, and LangChain.
5. Combining Components Together with Chains
In this module, you will get a handle on LangChain's chains - a concept that enables the creation of a single, coherent application. You will understand why chains are used and have the opportunity to work on multiple projects. These include creating a YouTube Video Summarizer, building a Jarvis for your Knowledge Base, and exploring code understanding with GPT-4 and LangChain. You will also learn about the Self-Critique Chain and how to guard against undesirable outputs.
6. Giving Memory to LLMs
This module emphasizes the importance of memory in maintaining context over a conversation. You will master the different types of memory in LangChain, including ConversationBufferMemory, ConversationBufferWindowMemory, ConversationSummaryMemory, and ConversationChain. Various exciting projects await you, such as creating a chatbot that interacts with a Github Repo, building a question-answering chatbot, and working with financial data.
7. Making LLMs Interact with the World Using Tools
In this module, you'll explore LangChain's tools and their diverse applications, including Google Search, requests, Python REPL, Wikipedia, and Wolfram-Alpha. Projects in this module revolve around enhancing blog posts with LangChain and Google Search, recreating the Bing chatbot, and leveraging multiple tools simultaneously. You'll also learn how to define custom tools for your specific needs.
8. Using Language Model as Reasoning Engines with Agents
The final module introduces you to the concept of agents in LangChain, with a particular emphasis on using a language model as a reasoning engine. You'll explore autonomous agents, their projects, and the application of AutoGPT with LangChain. The module culminates with a project on building autonomous agents to create comprehensive analysis reports.
Each module has been thoughtfully designed to provide you with a solid understanding of LLMs, LangChain, and Deep Lake. By the end of this course, you'll have a firm grasp of these advanced tools and frameworks, ready to harness their potential to solve real-world problems.
Course Logistics
Here's everything you need to know about how the course will work.
Course Hosting and Pace
This course is hosted by Activeloop. It is designed as a self-paced learning journey, allowing you to proceed at your own comfort. The online format provides flexibility, enabling you to engage with the lessons whenever suits you best. On average the course duration is 40 hours, with some participants completing it in as fast as in 25 if they do not try to work on projects.
At the end of each module you can test your new knowledge with multiple-choice quizzes, which are required to solve to continue the course. Once solved all the quizzes, you’ll be given your course Certification by completing the whole course.
Community Support
Got questions about this course or specific lessons? Want to exchange ideas with fellow learners? We encourage active interaction in the dedicated forum in the Towards AI’s Learn AI Together Discord Community (gen-ai-360 channel). This vibrant community comprises over 50,000 AI enthusiasts. There’s a dedicated channel within our community for this course where you can pose questions and share insights.
For queries specifically related to Deep Lake, please join the Deep Lake Slack community, where experts and users alike will be ready to assist.
Required API Tokens
The course involves practical projects and exercises that will require the use of various API keys. These will be thoroughly guided in the individual lessons. However, the two main API tokens that you will use throughout the course are:
- The OpenAI API token: This will be used to query LLMs like ChatGPT and GPT-4.
- The Deep Lake API token: Essential for creating Deep Lake datasets as vector stores for the projects we’ll build during the course.
These are the steps you should take to get the OpenAI API token.
- If you don't have an account yet, create one by going to https://platform.openai.com/. If you already have an account, skip to step 5.
- Fill out the registration form with your name, email address, and desired password.
- OpenAI will send you a confirmation email with a link. Click on the link to confirm your account.
- Please note that you'll need to verify your email account and provide a phone number for verification.
- Log in to https://platform.openai.com/.
- Navigate to the API key section at https://platform.openai.com/account/api-keys.
- Click "Create new secret key" and give the key a recognizable name or ID.
You should take these steps to get the Deep Lake API token.
- Sign up for an account on Activeloop's platform. You can sign up at Activeloop's website. After specifying your username, click on the “Sign up” button. You should now see your homepage.
- You should now see a “Create API token” button at the top of your homepage. Click on it, and you’ll get redirected to the “API tokens” page. This is where you can generate, manage, and revoke your API keys for accessing Deep Lake.
- Click on the "Create API token" button. You should see a popup asking for a token name and an expiration date. By default, the token expiration date is set so that the token expires after one day from its creation, but you can set it further in the future if you want to keep using the same token for the whole duration of the course. Once you’ve set the token name and its expiration date, click the “Create API token” button.
- You should now see a green banner saying that the token has been successfully generated, along with your new API token, on the “API tokens” page. To copy your token to your clipboard, click the square icon on its right.
Environment variables play an important role in storing sensitive information, such as API keys. Be careful not to share your API tokens with anyone!
Expected Cost of OpenAI Usage
By running the code samples from this course you'll make requests to the OpenAI API, incurring in costs. We expect the total cost of running all the lessons in this course, along with some experimentations, to be under $3.
If you’re eager to explore and experiment without worrying about costs - don’t worry! Go to the lesson called “Using the Open-Source GPT4All Model Locally” in the “Large Language Models and LangChain” module. This lesson teaches you how to use the open-source LLM GPT4All on your own computer, so you can enjoy benefits LLMs provide without having to pay for the OpenAI models’ API. With GPT4All, you can replace the OpenAI models in every lesson and continue your exciting journey without need to pay. Happy experimenting!
Coding Environment and Packages
Before embarking on this course, you need to ensure that you have the appropriate coding environment ready. Please make sure to use a Python version equal to, or later than 3.8.1, which is the minimum requirement to utilize the LangChain library. You can set up your environment by choosing one of the following options:
- Having a code editor installed on your computer. A popular coding environment is Visual Studio Code.
- Using Python virtual environments to manage Python libraries.
- Alternatively, you could use Google Colab notebooks.
You will need the following packages to successfully execute the sample codes provided in each lesson. They can be installed using the pip
package manager.
langchain==0.0.208
deeplake==3.6.5
openai==0.27.8
tiktoken==0.4.0
selenium==4.15.2
While we strongly recommend installing the latest versions of these packages, please note that the codes have been tested with the versions specified in parentheses. as the langchain
library is still evolving rapidly, we suggest to install a specific version of it, while installing the latest versions of the other libraries. You can do that with the following command: pip install langchain==0.0.208 deeplake openai==0.27.8 tiktoken
. Moreover, specific lessons may require the installation of additional packages, which will be explicitly mentioned. The following code will demonstrate how to install a package using pip.
pip install deeplake
# Or: (to install an specific version)
# pip install deeplake==3.6.5
Google Colab
Google Colaboratory, popularly known as Google Colab, is a free cloud-based Jupyter notebook environment. Data scientists and engineers widely use it to train machine learning and deep learning models using CPUs, GPUs, and TPUs. Google Colab comes with an array of features such as:
- Free access to GPUs and TPUs for accelerated model training.
- A web-based interface for a service running on a virtual machine, eliminating the need for local software installation.
- Seamless integration with Google Drive and GitHub.
To use Google Colab, all you need is a Google account. You can run terminal commands directly in notebook cells by appending an exclamation mark (!) before the command. Every notebook created in Google Colab gets stored in your Google Drive for easy access.
A convenient way of using API keys in Colab involves:
- Saving them in a file named
.env
on your Google Drive. Here’s how the file should be formatted for saving the Activeloop token and the OpenAI API key.
ACTIVELOOP_TOKEN=your_activeloop_token
OPENAI_API_KEY=your_openai_key
- Mounting your Google Drive on your Colab instance.
- Loading them as environment variables using the
dotenv
library, like in the following code.
from dotenv import load_dotenv
load_dotenv('/content/drive/MyDrive/path/to/.env')
Creating Python Virtual Environments
Python virtual environments offer an excellent solution for managing Python libraries and avoiding package conflicts. They create isolated environments for installing packages, ensuring that your packages and their dependencies are contained within that environment. This setup provides clean and isolated environments for your Python projects.
Begin by executing the python
command in your terminal to confirm that the Python version is either equal to or greater than 3.8.1. Then follow these steps to create a virtual environment:
- Create a virtual environment using the command
python -m venv my_venv_name
. - Activate the virtual environment by executing
source my_venv_name/bin/activate
. - Install the required libraries and run the code snippets from the lessons within the virtual environment.
- To deactivate the virtual environment, simply run
deactivate
.
The LLM and vector store toolkit
Building applications around Large Language Models like ChatGPT, GPT-4, or PALM-2 presents unique challenges. Understanding these challenges and how Deep Lake overcomes them is very important in the development of advanced AI applications.
The Power and Limitations of Large Language Models
LLMs are trained on huge amounts of text with the aim of learning the conditional distribution of words in a language. Doing so allows them to generalize and generate meaningful text without directly memorizing the training data. This means they can accurately recall widely disseminated information, such as historical events or popular cultural facts.
However, the LLM's knowledge is restricted to its training set. So, suppose the model was trained on data up to 2021 and is asked about a company founded in 2023. In that case, it may generate a plausible but entirely fabricated description - a phenomenon known as "hallucination.” Managing hallucinations is tricky, especially in applications where accuracy and reliability are paramount, such as customer-service chatbots, knowledge-base assistants, or AI tutors.
One promising strategy to mitigate hallucination is the use of retrievers in tandem with LLMs. A retriever fetches relevant information from a trusted knowledge base (like a search engine), and the LLM is then specifically prompted to rearrange the information without inventing additional details.
LLMs' large context window sizes facilitate the inclusion of multiple documents in a single prompt. Models like GPT-4 and Claude can handle context windows of up to 32k and 100k tokens, respectively, equating to approximately 20k words or 40 pages of text. However, the cost of execution rises with the number of tokens used, hence the need for an efficient retriever to find the most relevant documents.
Building Efficient Retrievers with Deep Lake
Efficient retrievers are built using embedding models that map texts to vectors. These vectors are then stored in specialized databases called vector stores.
This is where Deep Lake comes in. As a data lake that doubles as a vector store for multiple data types, Deep Lake provides several advantages:
- Multimodal: Deep Lake can store items of diverse modalities - text, images, audio, and video - along with their vector representations.
- Serverless: The serverless nature of Deep Lake allows for the creation and management of cloud datasets without the need for a dedicated database instance. This streamlines the setup process and accelerates project development.
- Data Loader: Deep Lake makes creating a streaming data loader from the loaded dataset easy, which is particularly useful for fine-tuning machine learning models using frameworks like PyTorch and TensorFlow.
- Querying and Visualization: Data can be queried and visualized easily from the web.
In the context of LLM applications, Deep Lake provides a seamless way to store embeddings and their corresponding metadata. It enables hybrid searches on these embeddings and their attributes for efficient data retrieval. Moreover, as LangChain integrates with it, it facilitates the development and deployment of LLM-based applications.
As a result, Deep Lake serves as a convenient serverless memory solution for LLM chains and agents, whether for storing relevant documents for question-answering tasks or storing images for guided image-generation tasks.
In summary, Deep Lake equips developers with a powerful tool to tackle the challenges of creating LLM-based applications and enhance the capabilities of these transformative models.
Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.