Introduction to RAG evaluation

Welcome to the last module and congrats on making it this far! 🎉

You are now at the point where you have seen so many great new RAG techniques. So, the next step would be trying these out in a real-world project or a fun experiment. Some of you may already be doing that. You have launched your IDE of choice, started copying some code from llamaindex or langchain and now you are thinking which technique to start with. You know, which one is the most bang for your buck and which one is just icing on the cake? Should you start with query expansion, graph RAG or RAFT? And if you select for example multi-query, how many individual sub queries should you create? 🤔

Well, you are in luck! This is what is known as hyperparameter optimization in “traditional” ML (not statistics, I am not THAT old). You can think about individual RAG techniques as parameters. Will you select multi-query, HyDE or step-back prompting? Will you create chunks of length 200, 300 or 1000? Or will you go entirely different way and split based on sentences? There are so many decisions to make and if you just make them intuitively based on the next hot LinkedIn article I can guarantee your system will be suboptimal.

For this to work, we need something to optimize. In “traditional” ML this would be a validation dataset with data points and corresponding ground truth labels. It turns out, we can do pretty much the same thing with text! We can create a dataset of Q&A pairs (ideally 100+) that have correct answers we want to get right. Now we can just let the system answer those questions with different parameters (different chunking, multi-query, HyDE, graph RAG etc.) and then evaluate how close to the correct answers these new answers are. There is some magic included in this - you can actually create the eval dataset by an LLM and then use “another” LLM to do the evaluation. This makes it 100% automatic and scalable. You can even include these evals in your CI/CD pipeline!

If this is too much to take in, do not worry! In this module, we will go step by step through this process and describe in detail how to approach different problems. I am absolutely sure that after completing this module you will think about optimizing LLM/RAG systems differently.

Let me conclude this chapter with a small rant. Coming from a “traditional” ML, I am still baffled how poorly these evaluations are done. A lot of people have 10 Q&A pairs in a spreadsheet and whenever they make changes to the system, they just copy paste new answers and manually evaluate if the system is better (trust me, I was one of those people 😬). This is not scalable and a crime against statistics. Moreover, people use this eval process to implement new cool techniques. Imagine you would be training an xgboost model and you would be setting up the model in this way: “Oh I put the number of trees at 47 because I saw an article where they did that”. And then you look at 4 data points and say it works fine. You would be (hopefully) fired immediately. This is not how machine learning is done. It honestly feels like we forgot how to do ML/SW engineering when we started designing LLM/RAG systems. We can do so much better and in this chapter I will show you how!