An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llm-as-a-judge

IAAR-Shanghai/xVerify

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Language: Python - Size: 826 KB - Last synced at: about 11 hours ago - Pushed at: 4 days ago - Stars: 71 - Forks: 5

Agenta-AI/agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Language: TypeScript - Size: 163 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 2,606 - Forks: 305

root-signals/root-signals-mcp

MCP for Root Signals Evaluation Platform

Language: Python - Size: 201 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 5 - Forks: 0

Non-NeutralZero/LLM-EvalSys

automated evaluation of llm generated responses on aws

Language: Python - Size: 36.1 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

docling-project/docling-sdg

A set of tools to create synthetically-generated data from documents

Language: Python - Size: 3.29 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6 - Forks: 3

KID-22/LLM-IR-Bias-Fairness-Survey

This is the repo for the survey of Bias and Fairness in IR with LLMs.

Size: 919 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 52 - Forks: 3

prometheus-eval/prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

Language: Python - Size: 15 MB - Last synced at: 17 days ago - Pushed at: about 1 month ago - Stars: 898 - Forks: 55

IAAR-Shanghai/xFinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Language: Python - Size: 1.36 MB - Last synced at: 15 days ago - Pushed at: about 2 months ago - Stars: 162 - Forks: 7

root-signals/rs-python-sdk

Root Signals Python SDK

Language: Python - Size: 331 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 11 - Forks: 0

OussamaSghaier/CuREV

Harnessing Large Language Models for Curated Code Reviews

Language: Python - Size: 592 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 12 - Forks: 1

metauto-ai/agent-as-a-judge

🤠 Agent-as-a-Judge and DevAI dataset

Language: Python - Size: 5.49 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 331 - Forks: 44

MJ-Bench/MJ-Bench

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

Language: Jupyter Notebook - Size: 218 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 43 - Forks: 5

PKU-ONELab/Themis

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

Language: Python - Size: 354 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 18 - Forks: 0

PKU-ONELab/LLM-evaluator-reliability

The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

Language: Python - Size: 7.95 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 1

djokester/groqeval

Use groq for evaluations

Language: Python - Size: 98.6 KB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 2 - Forks: 0

HillPhelmuth/LlmAsJudgeEvalPlugins

LLM-as-judge evals as Semantic Kernel Plugins

Language: C# - Size: 890 KB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 6 - Forks: 1

UMass-Meta-LLM-Eval/llm_eval

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

Language: Python - Size: 1.88 MB - Last synced at: 6 months ago - Pushed at: 7 months ago - Stars: 8 - Forks: 1

aws-samples/genai-system-evaluation

A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques

Language: Jupyter Notebook - Size: 548 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 2 - Forks: 0

martin-wey/CodeUltraFeedback

CodeUltraFeedback: aligning large language models to coding preferences

Language: Python - Size: 12.4 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 54 - Forks: 2

rafaelsandroni/antibodies

Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)

Language: Python - Size: 3.91 KB - Last synced at: 14 days ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

aws-samples/model-as-a-judge-eval

Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.

Language: Jupyter Notebook - Size: 45.9 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

minnesotanlp/cobbler

Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 13 - Forks: 1