Github Huggingface Evaluation Guidebook Sharing Both Practical
If you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! It covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience. Whether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliorations or missing resources) and I'll complete the guide! In text, links prefixed by โญ are links I really enjoyed and recommend reading. If you want an intro on the topic, you can read this blog on how and why we do evaluation! The most densely practical part of this guide.
There was an error while loading. Please reload this page. There was an error while loading. Please reload this page. There was an error while loading. Please reload this page.
If you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! It covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience. Whether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliorations or missing resources) and I'll complete the guide! In text, links prefixed by โญ are links I really enjoyed and recommend reading. If you want an intro on the topic, you can read this blog on how and why we do evaluation! The most densely practical part of this guide.
New: Hugging Face LLM evaluation guidebook! ๐ This guide was created to share both practical insights and theoretical knowledge that the ๐ค evaluation team gathered, while managing the Open LLM Leaderboard and designing lighteval! โก๏ธ Whether you're a beginner in LLMs, or an advanced user working on production-side models, you should find something to help you! https://lnkd.in/eammwcz3 Some contents: - how to create your own evaluation for your specific use case ๐ง - insights on current methods' pros and cons โ๏ธ - troubleshooting advice ๐ - lots of tips and... With Nathan HABIB, we'll also add applied notebooks to show you how to do evaluation experiments fast and follow good practices! If you want more knowledge or see a reference missing, feel free to open an issue!
The creation of this guide was inspired by Stas Bekman's great ML engineering book, and will similarly be updated regularly :) Thanks to all who influenced this guide through discussions, among which Kyle Lo,... Machine Learning Architect | PhD | Max Planck Alumni | AI Innovation Enthusiast This diagram in the "Tips and Tricks" section will be a lifesaver! Some programmers complain about the use of 'whitespace' in Python (tabs and spaces are part of the code), but in prompting it's much harder to debug! A space in the wrong position can lower your model's IQ! Bootstrapped Founder | Building ๐ฌ LocalClip.app โ a macOS AI video clipper, local alternative to Klap, Opusclip & Sendshort
Khaled ALNUAIMI, CFA Abdulla Alketbi, CFA This page provides concrete, hands-on examples demonstrating how to implement the LLM evaluation concepts discussed throughout the guidebook. By working through these examples, you'll gain practical experience with different evaluation approaches, understand how to structure evaluations, and learn to analyze results effectively. For information about specific evaluation approaches and their theoretical foundations, see the corresponding pages on Automated Benchmarks, Human Evaluation, and LLM-as-a-Judge. The practical examples in this section demonstrate real-world applications of evaluation techniques, complete with code implementations, results analysis, and key takeaways. Each example is designed to illustrate specific aspects of LLM evaluation while providing reusable patterns for your own evaluation needs.
Sources: README.md17-53 contents/examples/comparing_task_formulations.ipynb38-41 The first practical example demonstrates how different prompt formulations for the same task can significantly impact model performance. This example uses the AI2 ARC Challenge dataset to compare multiple approaches to question answering evaluation. The experiment is conducted on a small language model (SmolLM-1.7B) to clearly illustrate the effects of different prompt designs. There was an error while loading. Please reload this page.
and get access to the augmented documentation experience Before you can create a new metric make sure you have all the necessary dependencies installed: Also make sure your Hugging Face token is registered so you can connect to the Hugging Face Hub: All evaluation modules, be it metrics, comparisons, or measurements live on the ๐ค Hub in a Space (see for example Accuracy). In principle, you could setup a new Space and add a new module following the same structure. However, we added a CLI that makes creating a new evaluation module much easier:
This will create a new Space on the ๐ค Hub, clone it locally, and populate it with a template. Instructions on how to fill the template will be displayed in the terminal, but are also explained here in more detail. There was an error while loading. Please reload this page.
People Also Search
- GitHub - huggingface/evaluation-guidebook: Sharing both practical ...
- evaluation-guidebook/README.md at main ยท huggingface ... - GitHub
- evaluation-guidebook/yearly_dives/2025-evaluations-for-useful ... - GitHub
- evaluation-guidebook/contents/human-evaluation/basics.md at ... - GitHub
- GitHub - zhongdongy/hf-evaluation-guidebook: Sharing both practical ...
- Practical Examples | huggingface/evaluation-guidebook | DeepWiki
- Issues ยท huggingface/evaluation-guidebook ยท GitHub
- Creating and sharing a new evaluation - Hugging Face
- evaluation-guidebook/contents/model-as-a-judge/designing-your ... - GitHub
If You've Ever Wondered How To Make Sure An LLM
If you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! It covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience. Whether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliora...
There Was An Error While Loading. Please Reload This Page.
There was an error while loading. Please reload this page. There was an error while loading. Please reload this page. There was an error while loading. Please reload this page.
If You've Ever Wondered How To Make Sure An LLM
If you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! It covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience. Whether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliora...
New: Hugging Face LLM Evaluation Guidebook! ๐ This Guide Was
New: Hugging Face LLM evaluation guidebook! ๐ This guide was created to share both practical insights and theoretical knowledge that the ๐ค evaluation team gathered, while managing the Open LLM Leaderboard and designing lighteval! โก๏ธ Whether you're a beginner in LLMs, or an advanced user working on production-side models, you should find something to help you! https://lnkd.in/eammwcz3 Some conten...
The Creation Of This Guide Was Inspired By Stas Bekman's
The creation of this guide was inspired by Stas Bekman's great ML engineering book, and will similarly be updated regularly :) Thanks to all who influenced this guide through discussions, among which Kyle Lo,... Machine Learning Architect | PhD | Max Planck Alumni | AI Innovation Enthusiast This diagram in the "Tips and Tricks" section will be a lifesaver! Some programmers complain about the use o...