Hopefully, your mind is blown right now and you are seeing matrix. 🤯If not, let me help you with imagining the possibilities. We have learned that if you have a good target (evaluation dataset) you can choose the best possible setup of your RAG system. This is already so much better than your average “10 QA pairs in Excel” type of eval. But why stop there? As Douwe Kiela said in his amazing lecture why not backprop all the way? When you have a good target, you can dynamically optimize different parts of your RAG setup (and you can think big here - 100s of experiments is absolutely doable). But we can go even further. You can fine-tune your embedding model for your specific RAG use-case, or even fine-tune your LLM. And this whole thing can be mostly automatic. Are you seeing matrix now?
Jokes aside, I really believe this is the way to go. Every week there is a LinkedIn article or a blog post about a new cool RAG technique. But without robust, scalable evaluation pipeline, you have no way of deciding if this particular technique is worth your time. However, when you have this pipeline in place, deciding on new techniques becomes much simpler - you add a new experiment, run it on a use-case specific benchmark and BOOM you know what to do. Rely on hard data when making decisions, not on your intuition or expert advice. Every use-case is different.
In this module, we have covered:
- Creating evaluation dataset - This is easily the most crucial part, since even good optimization focused on wrong target leads to bad results.
- Different evaluation metrics - TL;DR - LLM-based evals with human in the loop are most often the way to go.
- RAG evaluation tools - If you do not want to reinvent the wheel, there is plenty of solutions that can help with different parts of the evaluation pipeline.
- End-to-end example - We have created a working evaluation pipeline, completely from scratch (well, with a little help from llamaindex). Feel free to reuse the code or find a more complex pipeline in the ARAGOG repo.
Hopefully, the module made sense to you. Do not worry if you did not get a part of code or a theory behind something. The main takeaway here should be a new way of thinking about evaluating LLM/RAG systems. Because we can do so much better. Cheers!