ai-benchmark | Topic | Ecosyste.ms: Repos

Topic: "ai-benchmark"

microsoft/WindowsAgentArena

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

Language: Python - Size: 191 MB - Last synced at: about 9 hours ago - Pushed at: 8 days ago - Stars: 689 - Forks: 67

TheAgentCompany/TheAgentCompany

An agent benchmark with tasks in a simulated software company.

Language: Python - Size: 6.58 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 279 - Forks: 34

kaykycampos/gta-benchmark

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

Size: 1.95 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 8 - Forks: 0

PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.

Language: Blade - Size: 901 KB - Last synced at: about 9 hours ago - Pushed at: about 9 hours ago - Stars: 1 - Forks: 0

petmal/MindTrial

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek), custom tasks in YAML, and HTML/CSV reports.

Language: Go - Size: 143 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

Related Topics

ai 3 benchmark 2 ai-research 2 ai-evaluation-tools 1 algorithmic-reasoning 1 binary-analysis 1 computational-thinking 1 ctf 1 docker 1 educational 1 flask 1 machine-learning 1 pattern-recognition 1 puzzle 1 python 1 reverse-engineering 1 agentic 1 ai-agent 1 computer 1 computer-use 1 desktop-agent 1 windows 1 agent 1 llm 1 ai-model-comparison 1 ai-tool 1 anthropic 1 csv-reports 1 customizable 1 deepseek 1 golang-cli 1 google-gemini-ai 1 html-reports 1 language-models-ai 1 llm-benchmarking 1 llm-comparison 1 llm-evaluation-framework 1 mozilla-public-license 1 nlp 1 openai 1 opensource 1 yaml-configuration 1 chess 1 rock-paper-scissors 1 svg 1 algorithm-analysis 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos

Topic: "ai-benchmark"

microsoft/WindowsAgentArena

TheAgentCompany/TheAgentCompany

kaykycampos/gta-benchmark

playsaurus-inc/play-bench

petmal/MindTrial