llm-as-evaluator | Topic | Ecosyste.ms: Repos

Topic: "llm-as-evaluator"

prometheus-eval/prometheus-eval

Evaluate your LLM's response with Prometheus and GPT4 💯

Language: Python - Size: 15.1 MB - Last synced at: 23 days ago - Pushed at: about 2 months ago - Stars: 946 - Forks: 54

JohnSnowLabs/langtest

Deliver safe & effective language models

Language: Python - Size: 157 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 523 - Forks: 46

IAAR-Shanghai/xFinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Language: Python - Size: 1.36 MB - Last synced at: 26 days ago - Pushed at: 4 months ago - Stars: 169 - Forks: 7

KID-22/LLM-IR-Bias-Fairness-Survey

This is the repo for the survey of Bias and Fairness in IR with LLMs.

Size: 919 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 52 - Forks: 3

minnesotanlp/cobbler

Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 1

HillPhelmuth/LlmAsJudgeEvalPlugins

LLM-as-judge evals as Semantic Kernel Plugins

Language: C# - Size: 2.04 MB - Last synced at: about 6 hours ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 1

djokester/groqeval

Use groq for evaluations

Language: Python - Size: 98.6 KB - Last synced at: 6 days ago - Pushed at: 11 months ago - Stars: 2 - Forks: 0

trustyai-explainability/vllm_judge

A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.

Language: Python - Size: 765 KB - Last synced at: about 8 hours ago - Pushed at: about 9 hours ago - Stars: 1 - Forks: 1

Kakz/prometheus-llm

PrometheusLLM is a unique transformer architecture inspired by dignity and recursion. This project aims to explore new frontiers in AI research and welcomes contributions from the community. 🐙🌟

Language: Python - Size: 257 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

Non-NeutralZero/LLM-EvalSys

automated evaluation of llm generated responses on aws

Language: Python - Size: 36.1 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

rafaelsandroni/antibodies

Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)

Language: Python - Size: 3.91 KB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

Related Topics

llm-as-a-judge 9 llm 7 llmops 4 evaluation 4 llm-evaluation 3 large-language-models 3 python 2 litellm 2 gpt4 2 nlp 2 bias 2 llms 2 responsible-ai 1 model-assessment 1 mlops 1 ml-testing 1 ml-safety 1 llm-testing 1 llm-test 1 llm-evaluation-toolkit 1 bias-detection 1 ai-safety 1 ethics-in-ai 1 benchmarks 1 ai-testing 1 benchmark-framework 1 artificial-intelligence 1 xfinder 1 reliable-evaluation 1 reliability 1 regex 1 qwen 1 phi 1 open-compass 1 lm-evaluation 1 key-answer-extraction 1 judge-model 1 gpt 1 dataset 1 chatglm 1 cc-by-nc-nd-4 1 benchmark 1 semantickernel 1 trustworthy-ai 1 llm-as-judge 1 llms-benchmarking 1 autopoietic-systems 1 cognitive-architecture 1 deep-learning 1 hermeneutics 1 language-model 1 mcp 1 ollama 1 philosophy-of-mind 1 pipelines 1 prompt-logging 1 self-organization 1 tracing 1 hallucination-detection 1 hallucinations 1 nli 1 generative-ai 1 groq 1 llama3 1 mixtral 1 evaluation-metrics 1 chatgpt 1 fairness 1 information-retrieval 1 ir 1 llm4ir 1 llm4rec 1 llm4rs 1 recommender-systems 1 vllm 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos