Ecosyste.ms: Repos
An open API service providing repository metadata for many open source software ecosystems.
GitHub topics: llm-eval
parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: Python - Size: 5.11 MB - Last synced: about 21 hours ago - Pushed: about 21 hours ago - Stars: 42 - Forks: 4
uptrain-ai/uptrain
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Language: Python - Size: 35.7 MB - Last synced: 5 days ago - Pushed: 5 days ago - Stars: 2,045 - Forks: 171
Arize-ai/phoenix
AI Observability & Evaluation
Language: Jupyter Notebook - Size: 74.1 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 2,589 - Forks: 182
promptfoo/promptfoo
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
Language: TypeScript - Size: 14.7 MB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 3,018 - Forks: 201
parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: TypeScript - Size: 450 KB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 4 - Forks: 1
prompt-foundry/typescript-sdk
The Typescript SDK for the prompt engineering, prompt management, and prompt testing tool Prompt Foundry
Language: TypeScript - Size: 327 KB - Last synced: 14 days ago - Pushed: 14 days ago - Stars: 0 - Forks: 0
athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language: Python - Size: 1.01 MB - Last synced: 14 days ago - Pushed: 14 days ago - Stars: 139 - Forks: 11
Auto-Playground/ragrank
🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.
Language: Python - Size: 569 KB - Last synced: 21 days ago - Pushed: 27 days ago - Stars: 20 - Forks: 8
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing framework for LLMs and ML models
Language: Python - Size: 176 MB - Last synced: 29 days ago - Pushed: about 1 month ago - Stars: 3,163 - Forks: 199
kdcyberdude/punjabi-llm-eval Fork of gordicaleksa/serbian-llm-eval
First Punjabi LLM Eval.
Language: Python - Size: 12.3 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0
Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
Language: Python - Size: 17.7 MB - Last synced: about 2 months ago - Pushed: 4 months ago - Stars: 62 - Forks: 4
Networks-Learning/prediction-powered-ranking
Code for the paper Prediction-Powered Ranking of Large Language Models, Arxiv 2024.
Language: Python - Size: 21.5 KB - Last synced: about 2 months ago - Pushed: 3 months ago - Stars: 3 - Forks: 1
kuk/rulm-sbs2
Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat
Language: Jupyter Notebook - Size: 19.1 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 22 - Forks: 0
awesome-software/evals Fork of openai/evals
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Size: 1.94 MB - Last synced: 9 months ago - Pushed: 12 months ago - Stars: 0 - Forks: 0
harshagrawal523/GenerativeAgents
Generative agents — computational software agents that simulate believable human behavior and OpenAI LLM models. Our main focus was to develop a game - “Werewolves of Miller’s Hollow”, aiming to replicate human-like behavior.
Language: Python - Size: 249 MB - Last synced: 9 months ago - Pushed: 10 months ago - Stars: 0 - Forks: 0