From the course: Complete Guide to Evaluating Large Language Models (LLMs)
Unlock this course with a free trial
Join today to access over 24,900 courses taught by industry experts.
The role of benchmarks
From the course: Complete Guide to Evaluating Large Language Models (LLMs)
The role of benchmarks
- We have spent the last several sessions, talking about all kinds of LLM tasks and the metrics that generally go along with evaluating them. Multiple choice, free text response, embeddings and classification. We've seen code to do classification. We've looked at accuracy measurements. We've talked about calibration, and we got a little bit more into that as well. We talked about cosine similarities of embeddings and rubrics and all this stuff. But generally, when we talk about our own specific tasks, we're generally talking about our own specific data, right? When we did our case study in the last lesson, we had used the app reviews dataset, which yes, was a public dataset, but I was using as a hypothetical, as if I were building such a model for my own use case, whether it be professional or personal. In this lesson, we want to talk more about benchmarks, which are standardized testing sets. So not necessarily ones that matter to me, but ones that, in theory, matter to everyone…