An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llm-evaluation-toolkit

zli12321/qa_metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

Language: Python - Size: 16.6 MB - Last synced at: 5 days ago - Pushed at: 7 days ago - Stars: 50 - Forks: 4

JohnSnowLabs/langtest

Deliver safe & effective language models

Language: Python - Size: 158 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 526 - Forks: 47

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.48 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 78 - Forks: 7

ronniross/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 143 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

nhsengland/evalsense

Tools for systematic large language model evaluations

Language: Python - Size: 877 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

Fbxfax/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 96.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Agenta-AI/job_extractor_template

Template for an AI application that extracts the job information from a job description using openAI functions and langchain

Language: Python - Size: 15.6 KB - Last synced at: 26 days ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 1

CodeEval-Pro/CodeEval-Pro

Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

Language: Python - Size: 4.1 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 26 - Forks: 2

Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Language: Python - Size: 17.7 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 85 - Forks: 6

zhuohaoyu/KIEval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Language: Python - Size: 10.6 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 36 - Forks: 2

parea-ai/parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: TypeScript - Size: 2.94 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 4 - Forks: 1

scalexi/scalexi

scalexi is a versatile open-source Python library, optimized for Python 3.11+, focuses on facilitating low-code development and fine-tuning of diverse Large Language Models (LLMs).

Language: Python - Size: 31.2 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 1