Topic: "lm-evaluation"
IAAR-Shanghai/xFinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Language: Python - Size: 1.36 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 169 - Forks: 7

bethgelab/CiteME
CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.
Language: Python - Size: 283 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 35 - Forks: 3

hitz-zentroa/latxa
Latxa: An Open Language Model and Evaluation Suite for Basque
Language: Shell - Size: 27.4 MB - Last synced at: 3 days ago - Pushed at: 12 months ago - Stars: 28 - Forks: 0

RulinShao/RAG-evaluation-harnesses
An evaluation suite for Retrieval-Augmented Generation (RAG).
Language: Python - Size: 1.78 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 2

SprykAI/lm-evaluation-harness Fork of huggingface/lm-evaluation-harness
Fork of lm-evaluation-harness. Includes MATH benchmark fix
Language: Python - Size: 22.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0
