Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llm-eval

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.11 MB - Last synced: about 21 hours ago - Pushed: about 21 hours ago - Stars: 42 - Forks: 4

uptrain-ai/uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Language: Python - Size: 35.7 MB - Last synced: 5 days ago - Pushed: 5 days ago - Stars: 2,045 - Forks: 171

Arize-ai/phoenix

AI Observability & Evaluation

Language: Jupyter Notebook - Size: 74.1 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 2,589 - Forks: 182

promptfoo/promptfoo

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

Language: TypeScript - Size: 14.7 MB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 3,018 - Forks: 201

parea-ai/parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: TypeScript - Size: 450 KB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 4 - Forks: 1

prompt-foundry/typescript-sdk

The Typescript SDK for the prompt engineering, prompt management, and prompt testing tool Prompt Foundry

Language: TypeScript - Size: 327 KB - Last synced: 14 days ago - Pushed: 14 days ago - Stars: 0 - Forks: 0

athina-ai/athina-evals

Python SDK for running evaluations on LLM generated responses

Language: Python - Size: 1.01 MB - Last synced: 14 days ago - Pushed: 14 days ago - Stars: 139 - Forks: 11

Auto-Playground/ragrank

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

Language: Python - Size: 569 KB - Last synced: 21 days ago - Pushed: 27 days ago - Stars: 20 - Forks: 8

Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing framework for LLMs and ML models

Language: Python - Size: 176 MB - Last synced: 29 days ago - Pushed: about 1 month ago - Stars: 3,163 - Forks: 199

kdcyberdude/punjabi-llm-eval Fork of gordicaleksa/serbian-llm-eval

First Punjabi LLM Eval.

Language: Python - Size: 12.3 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0

Re-Align/just-eval

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

Language: Python - Size: 17.7 MB - Last synced: about 2 months ago - Pushed: 4 months ago - Stars: 62 - Forks: 4

Networks-Learning/prediction-powered-ranking

Code for the paper Prediction-Powered Ranking of Large Language Models, Arxiv 2024.

Language: Python - Size: 21.5 KB - Last synced: about 2 months ago - Pushed: 3 months ago - Stars: 3 - Forks: 1

kuk/rulm-sbs2

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

Language: Jupyter Notebook - Size: 19.1 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 22 - Forks: 0

awesome-software/evals Fork of openai/evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Size: 1.94 MB - Last synced: 9 months ago - Pushed: 12 months ago - Stars: 0 - Forks: 0

harshagrawal523/GenerativeAgents

Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.

Language: Python - Size: 249 MB - Last synced: 9 months ago - Pushed: 10 months ago - Stars: 0 - Forks: 0