GitHub topics: llm-testing
vladimir22700/agent-monitor
📊 Monitor AI agents in real-time with Agent Monitor. Gain visibility, track costs, and debug effectively for enhanced performance and reliability.
Language: Python - Size: 1.34 MB - Last synced at: 25 minutes ago - Pushed at: about 2 hours ago - Stars: 0 - Forks: 0
Free-AI-Things/g4f-working
g4f-working is a daily-updated list of working no-auth AI providers and models from @xtekky/gpt4free. It helps developers, testers, and AI enthusiasts instantly find which models are currently online and accessible without any API keys, tokens, or cookies.
Language: Python - Size: 359 MB - Last synced at: about 21 hours ago - Pushed at: about 23 hours ago - Stars: 35 - Forks: 5
PinguChileno/onerun
🤖 Simulate realistic conversations to test and improve your AI agents, generating evaluation datasets and automating QA for reliable performance.
Language: Python - Size: 1.57 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0
BEHZAD1919/Advanced-Multi-Agent-AI-Framework
⚡ Streamline AI development with a multi-agent framework that integrates 80+ prompt engineering techniques for effective team coordination and automation.
Size: 1.58 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 5 - Forks: 0
infos3ddesigner/g4f-working
Daily-updated hub of no-auth AI providers and models from GPT4Free. It lists options that work now for quick access and reliable, up-to-date results across services. 🐙
Size: 6.84 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0
yukincom/llm-SugarScape
Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in a grid world.This project aims to replicate a paper published in 2025 (arXiv:2508.12920).
Language: Python - Size: 1.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0
Thurstondrooping742/agent-monitor
📊 Monitor AI agent performance with real-time tracing, metrics, and cost tracking to enhance visibility and streamline debugging in production systems.
Language: Python - Size: 1.33 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0
quarkiverse/quarkus-rage4j
Rage4j is a java library thats helps evaluate LLM's based on scientifically grounded metrics
Language: Java - Size: 55.7 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 4 - Forks: 0
Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
Language: Python - Size: 464 KB - Last synced at: 8 days ago - Pushed at: 11 months ago - Stars: 86 - Forks: 11
LLAMATOR-Core/llamator
Framework for testing vulnerabilities of large language models (LLM).
Language: Python - Size: 4.52 MB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 171 - Forks: 16
North-Shore-AI/crucible_framework
CrucibleFramework: A scientific platform for LLM reliability research on the BEAM
Language: Elixir - Size: 325 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0
raga-ai-hub/RagaAI-Catalyst
Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view
Language: Python - Size: 55.8 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 16,012 - Forks: 3,711
vincentkoc/tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
Language: Python - Size: 310 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 8 - Forks: 0
Pacific-AI-Corp/langtest
Deliver safe & effective language models
Language: Python - Size: 200 MB - Last synced at: 29 days ago - Pushed at: about 2 months ago - Stars: 543 - Forks: 50
evalops/mocktopus
🐙 Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing
Language: Python - Size: 37.1 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0
lalitkpal/VerifyAI
VerifyAI is a simple UI application to test GenAI outputs
Language: Python - Size: 24.4 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0
aakashkavuru101/LLM-Testing-app-v1
using LMarena.ai base foundation fastchat and building another version of it, where you can locally test LLMs through it. If it fails, connect Ollama and test.
Language: Python - Size: 31.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0
rimironenko/rostcamp
Language: Python - Size: 9.68 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0
ihatenodejs/llm-tests
My personal, web-dev focused LLM tests
Language: HTML - Size: 61.5 KB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0
dr-gareth-roberts/LLM-Dev
Python Tools for Developing with LLMs (cloud & offline)
Language: Python - Size: 299 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0
thatoldfarm/logos-infinitum-artifact
A comprehensive corpus of interconnected texts and protocols designed as a conceptual stress-test for advanced AI.
Size: 5.47 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0
sandy-sp/ai-reply-index
A community-driven archive of AI prompts and responses. Log, compare, and contribute structured examples to build a searchable public prompt-response database.
Language: Python - Size: 8.03 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0
JohnRitchie/qa-llm-guard
Language: Python - Size: 23.4 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0
Leftinant/tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
Size: 1.95 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0
rhesis-ai/rhesis-sdk
Open-source test generation SDK for LLM applications. Access curated test sets. Build context-specific test sets and collaborate with subject matter experts.
Language: Python - Size: 420 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 18 - Forks: 0
ssilwal29/api-ninja
API Ninja simplifies API testing by allowing users to define test flows in plain English.
Language: Python - Size: 1.45 MB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0
pyladiesams/eval-llm-based-apps-jan2025
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
Language: Jupyter Notebook - Size: 11.6 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 7 - Forks: 5
wizenheimer/periscope
LLM Performance Testing | K6 + Grafana + InfluxDB | A tiny toolkit for load testing and benchmarking OpenAI-like inference endpoints using K6 + Grafana + InfluxDB
Language: JavaScript - Size: 563 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0
neonxploit/Dragon-Glitch---NeonXploit-Audit-v1.0-
Red-team audit on deepseek AI by lala aka NeonXploit (operation dragon Glitch)
Size: 2.28 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0
MirrorLoop/mirrorloop-core
Official public release of MirrorLoop Core (v1.3 – April 2025)
Size: 1.39 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0
borisveis/LLMTesting
LLM Testing with gpt4all
Language: Python - Size: 24.4 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0
ModelPulse/BreakYourLLM
Break Your LLM before your users do!
Language: Python - Size: 234 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0
prompt-foundry/go-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Go.
Size: 1000 Bytes - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0