GitHub topics: llm-evaluation-toolkit
zli12321/qa_metrics
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.
Language: Python - Size: 16.6 MB - Last synced at: 5 days ago - Pushed at: 7 days ago - Stars: 50 - Forks: 4

JohnSnowLabs/langtest
Deliver safe & effective language models
Language: Python - Size: 158 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 526 - Forks: 47

parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: Python - Size: 5.48 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 78 - Forks: 7

ronniross/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 143 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

nhsengland/evalsense
Tools for systematic large language model evaluations
Language: Python - Size: 877 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

Fbxfax/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 96.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Agenta-AI/job_extractor_template
Template for an AI application that extracts the job information from a job description using openAI functions and langchain
Language: Python - Size: 15.6 KB - Last synced at: 26 days ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 1

CodeEval-Pro/CodeEval-Pro
Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"
Language: Python - Size: 4.1 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 26 - Forks: 2

Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
Language: Python - Size: 17.7 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 85 - Forks: 6

zhuohaoyu/KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Language: Python - Size: 10.6 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 36 - Forks: 2

parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: TypeScript - Size: 2.94 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 4 - Forks: 1

scalexi/scalexi
scalexi is a versatile open-source Python library, optimized for Python 3.11+, focuses on facilitating low-code development and fine-tuning of diverse Large Language Models (LLMs).
Language: Python - Size: 31.2 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 1
