An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llm-testing

vladimir22700/agent-monitor

📊 Monitor AI agents in real-time with Agent Monitor. Gain visibility, track costs, and debug effectively for enhanced performance and reliability.

Language: Python - Size: 1.34 MB - Last synced at: 25 minutes ago - Pushed at: about 2 hours ago - Stars: 0 - Forks: 0

Free-AI-Things/g4f-working

g4f-working is a daily-updated list of working no-auth AI providers and models from @xtekky/gpt4free. It helps developers, testers, and AI enthusiasts instantly find which models are currently online and accessible without any API keys, tokens, or cookies.

Language: Python - Size: 359 MB - Last synced at: about 21 hours ago - Pushed at: about 23 hours ago - Stars: 35 - Forks: 5

PinguChileno/onerun

🤖 Simulate realistic conversations to test and improve your AI agents, generating evaluation datasets and automating QA for reliable performance.

Language: Python - Size: 1.57 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

BEHZAD1919/Advanced-Multi-Agent-AI-Framework

⚡ Streamline AI development with a multi-agent framework that integrates 80+ prompt engineering techniques for effective team coordination and automation.

Size: 1.58 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 5 - Forks: 0

infos3ddesigner/g4f-working

Daily-updated hub of no-auth AI providers and models from GPT4Free. It lists options that work now for quick access and reliable, up-to-date results across services. 🐙

Size: 6.84 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

yukincom/llm-SugarScape

Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in a grid world.This project aims to replicate a paper published in 2025 (arXiv:2508.12920).

Language: Python - Size: 1.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

Thurstondrooping742/agent-monitor

📊 Monitor AI agent performance with real-time tracing, metrics, and cost tracking to enhance visibility and streamline debugging in production systems.

Language: Python - Size: 1.33 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

quarkiverse/quarkus-rage4j

Rage4j is a java library thats helps evaluate LLM's based on scientifically grounded metrics

Language: Java - Size: 55.7 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4 - Forks: 0

Addepto/contextcheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

Language: Python - Size: 464 KB - Last synced at: 8 days ago - Pushed at: 11 months ago - Stars: 86 - Forks: 11

LLAMATOR-Core/llamator

Framework for testing vulnerabilities of large language models (LLM).

Language: Python - Size: 4.52 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 171 - Forks: 16

North-Shore-AI/crucible_framework

CrucibleFramework: A scientific platform for LLM reliability research on the BEAM

Language: Elixir - Size: 325 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

raga-ai-hub/RagaAI-Catalyst

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

Language: Python - Size: 55.8 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 16,012 - Forks: 3,711

vincentkoc/tiny_qa_benchmark_pp

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

Language: Python - Size: 310 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 8 - Forks: 0

Pacific-AI-Corp/langtest

Deliver safe & effective language models

Language: Python - Size: 200 MB - Last synced at: 29 days ago - Pushed at: about 2 months ago - Stars: 543 - Forks: 50

evalops/mocktopus

🐙 Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing

Language: Python - Size: 37.1 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

lalitkpal/VerifyAI

VerifyAI is a simple UI application to test GenAI outputs

Language: Python - Size: 24.4 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

aakashkavuru101/LLM-Testing-app-v1

using LMarena.ai base foundation fastchat and building another version of it, where you can locally test LLMs through it. If it fails, connect Ollama and test.

Language: Python - Size: 31.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

rimironenko/rostcamp

Language: Python - Size: 9.68 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

ihatenodejs/llm-tests

My personal, web-dev focused LLM tests

Language: HTML - Size: 61.5 KB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

dr-gareth-roberts/LLM-Dev

Python Tools for Developing with LLMs (cloud & offline)

Language: Python - Size: 299 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

thatoldfarm/logos-infinitum-artifact

A comprehensive corpus of interconnected texts and protocols designed as a conceptual stress-test for advanced AI.

Size: 5.47 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

sandy-sp/ai-reply-index

A community-driven archive of AI prompts and responses. Log, compare, and contribute structured examples to build a searchable public prompt-response database.

Language: Python - Size: 8.03 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

JohnRitchie/qa-llm-guard

Language: Python - Size: 23.4 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

Leftinant/tiny_qa_benchmark_pp

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

Size: 1.95 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

rhesis-ai/rhesis-sdk

Open-source test generation SDK for LLM applications. Access curated test sets. Build context-specific test sets and collaborate with subject matter experts.

Language: Python - Size: 420 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 18 - Forks: 0

ssilwal29/api-ninja

API Ninja simplifies API testing by allowing users to define test flows in plain English.

Language: Python - Size: 1.45 MB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

pyladiesams/eval-llm-based-apps-jan2025

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 7 - Forks: 5

wizenheimer/periscope

LLM Performance Testing | K6 + Grafana + InfluxDB | A tiny toolkit for load testing and benchmarking OpenAI-like inference endpoints using K6 + Grafana + InfluxDB

Language: JavaScript - Size: 563 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

neonxploit/Dragon-Glitch---NeonXploit-Audit-v1.0-

Red-team audit on deepseek AI by lala aka NeonXploit (operation dragon Glitch)

Size: 2.28 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

MirrorLoop/mirrorloop-core

Official public release of MirrorLoop Core (v1.3 – April 2025)

Size: 1.39 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

borisveis/LLMTesting

LLM Testing with gpt4all

Language: Python - Size: 24.4 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

ModelPulse/BreakYourLLM

Break Your LLM before your users do!

Language: Python - Size: 234 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

prompt-foundry/go-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Go.

Size: 1000 Bytes - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0