GitHub topics: llm-evaluation-toolkit

Repositories

zli12321/qa_metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

Language: Python - Size: 16.6 MB - Last synced at: 5 days ago - Pushed at: 7 days ago - Stars: 50 - Forks: 4

JohnSnowLabs/langtest

Deliver safe & effective language models

Language: Python - Size: 158 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 526 - Forks: 47

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.48 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 78 - Forks: 7

ronniross/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 143 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

nhsengland/evalsense

Tools for systematic large language model evaluations

Language: Python - Size: 877 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 280 - Forks: 19

Fbxfax/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 96.7 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Agenta-AI/job_extractor_template

Template for an AI application that extracts the job information from a job description using openAI functions and langchain

Language: Python - Size: 15.6 KB - Last synced at: 26 days ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 1

CodeEval-Pro/CodeEval-Pro

Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

Language: Python - Size: 4.1 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 26 - Forks: 2

Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Language: Python - Size: 17.7 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 85 - Forks: 6

zhuohaoyu/KIEval

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Language: Python - Size: 10.6 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 36 - Forks: 2

parea-ai/parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: TypeScript - Size: 2.94 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 4 - Forks: 1

scalexi/scalexi

scalexi is a versatile open-source Python library, optimized for Python 3.11+, focuses on facilitating low-code development and fine-tuning of diverse Large Language Models (LLMs).

Language: Python - Size: 31.2 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 1

Related Keywords

llm-evaluation-toolkit 13 llm 11 llm-evaluation 11 llm-evaluation-framework 7 llms 4 llm-evaluation-metrics 4 llm-eval 4 llms-benchmarking 4 llmops 3 evaluation 2 evaluation-metrics 2 evaluation-framework 2 llms-reasoning 2 openai 2 llm-tools 2 llms-evalution 2 prompt-engineering 2 dataset 2 datasets 2 llm-training 2 llms-efficency 2 llm-benchmarking 1 dataset-generation 1 machine-learning 1 explainable-ai 1 acl2024 1 gpt4 1 llm4code 1 llm-reasoning 1 code-generation 1 unstructured-text 1 template 1 openai-function-example 1 langchain 1 extraction 1 extract-information 1 extract-data 1 example 1 llm-ops 1 exact-matching 1 qa-automation-test 1 reward-modeling 1 rl-training 1 ai-safety 1 ai-testing 1 artificial-intelligence 1 benchmark-framework 1 benchmarks 1 ethics-in-ai 1 large-language-models 1 llm-as-evaluator 1 llm-test 1 llm-testing 1 ml-safety 1 ml-testing 1 mlops 1 model-assessment 1 nlp 1 responsible-ai 1 trustworthy-ai 1 generative-ai 1 good-first-issue 1 metrics 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos