Topic: "llms-benchmarking"
steel-dev/awesome-web-agents
🔥 A list of tools, frameworks, and resources for building AI web agents
Size: 228 KB - Last synced at: 12 days ago - Pushed at: about 2 months ago - Stars: 378 - Forks: 31

karthikv792/LLMs-Planning
An extensible benchmark for evaluating large language models on planning
Language: PDDL - Size: 52 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 351 - Forks: 36

JonathanChavezTamales/llm-leaderboard
A comprehensive set of LLM benchmark scores and provider prices.
Language: JavaScript - Size: 319 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 206 - Forks: 16

bboylyg/BackdoorLLM
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
Language: Python - Size: 273 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 146 - Forks: 13

lerogo/MMGenBench
Official repository of MMGenBench
Language: Python - Size: 19.6 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 119 - Forks: 5

ChemFoundationModels/ChemLLMBench
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks
Language: Jupyter Notebook - Size: 4.28 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 106 - Forks: 5

lechmazur/nyt-connections
Benchmark that evaluates LLMs using 651 NYT Connections puzzles extended with extra trick words
Language: Python - Size: 6.37 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 85 - Forks: 5

lamalab-org/chembench
How good are LLMs at chemistry?
Language: Python - Size: 261 MB - Last synced at: 4 days ago - Pushed at: 19 days ago - Stars: 83 - Forks: 9

parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: Python - Size: 5.48 MB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 76 - Forks: 7

lechmazur/generalization
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.
Size: 30 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 52 - Forks: 2

RaptorMai/MLLM-CompBench
[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes
Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 38 - Forks: 2

multinear/multinear
Develop reliable AI apps
Language: Svelte - Size: 1.12 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 36 - Forks: 1

FSoft-AI4Code/XMainframe
Language Model for Mainframe Modernization
Language: Python - Size: 11.8 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 35 - Forks: 3

epfl-dlab/cc_flows
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
Language: Python - Size: 17.6 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 30 - Forks: 1

declare-lab/resta
Restore safety in fine-tuned language models through task arithmetic
Language: Python - Size: 75.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 28 - Forks: 2

SuperBruceJia/Awesome-Mixture-of-Experts
Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)
Size: 438 KB - Last synced at: 26 days ago - Pushed at: 5 months ago - Stars: 27 - Forks: 3

rajpurkarlab/craft-md
Language: Python - Size: 318 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 24 - Forks: 6

Laoyu84/4onebench
A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.
Language: Python - Size: 1.34 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 17 - Forks: 1

Paulescu/text-embedding-evaluation
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
Language: Python - Size: 615 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 16 - Forks: 2

gautierdag/plancraft
Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs
Language: Python - Size: 124 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 13 - Forks: 1

minnesotanlp/cobbler
Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 1

logikon-ai/cot-eval
A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
Language: Jupyter Notebook - Size: 2.41 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 12 - Forks: 2

RUC-GSAI/YuLan-SwarmIntell
🐝 SwarmBench: Benchmarking LLMs' Swarm Intelligence
Language: Python - Size: 84.4 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 10 - Forks: 2

microsoft/private-benchmarking
A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.
Language: Python - Size: 4.32 MB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 10 - Forks: 2

SergioV3005/llm-belief-bias
Belief-Bias evaluation of local LLMs
Language: Python - Size: 1.46 MB - Last synced at: about 5 hours ago - Pushed at: about 6 hours ago - Stars: 9 - Forks: 0

Kartik-3004/facexbench
FaceXBench: Evaluating Multimodal LLMs on Face Understanding
Language: Python - Size: 409 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 9 - Forks: 1

microsoft/MEGAVERSE
Official Codebase for MEGAVERSE: (published in ACL: NAACL 2024)
Language: Python - Size: 51.3 MB - Last synced at: 6 days ago - Pushed at: 28 days ago - Stars: 8 - Forks: 0

cosmaadrian/romath
Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"
Language: Python - Size: 276 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 8 - Forks: 0

amazon-science/llm-code-preference
Training and Benchmarking LLMs for Code Preference.
Language: Python - Size: 156 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 8 - Forks: 0

whisk/leetgptsolver
Benchmarking LLMs (large language models) on leetcode algorithmic challenges
Language: Go - Size: 289 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 7 - Forks: 0

desboisGIT/OpenPrompt
OpenPromptBank is an AI prompt library platform where users can explore, rank, and contribute AI prompts categorized by various topics. This platform features a searchable library, community-driven rankings, prompt performance benchmarks, and user profiles.
Language: Python - Size: 18.9 MB - Last synced at: about 15 hours ago - Pushed at: about 16 hours ago - Stars: 5 - Forks: 0

ronniross/asi-core-protocol
A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.
Size: 122 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 5 - Forks: 1

parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: TypeScript - Size: 2.94 MB - Last synced at: 1 day ago - Pushed at: 4 months ago - Stars: 4 - Forks: 1

nachoDRT/MERIT-Dataset
The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.
Language: Python - Size: 495 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 4 - Forks: 0

matus-pikuliak/genderbench
GenderBench - Evaluation suite for gender biases in LLMs
Language: Python - Size: 14.3 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 3 - Forks: 2

stair-lab/melt
Multilingual Evaluation Toolkits
Language: Python - Size: 204 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 3 - Forks: 3

ronniross/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 143 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0

lwachowiak/LLMs-for-Social-Robotics
Code and data for our IROS paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"
Language: Jupyter Notebook - Size: 6.33 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 3

melvinebenezer/Liah-Lie_in_a_haystack
needle in a haystack for LLMs
Language: Python - Size: 2.42 MB - Last synced at: 17 days ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

jwallat/temporalrobustness
A Study Into Temporal Robustness of LLMs
Language: Jupyter Notebook - Size: 3.66 MB - Last synced at: 10 days ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

unable12/LLM_Side_By_Side_Comparison
Language: CSS - Size: 675 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

i-partalas/industrial-rag-qna-benchmark
Benchmarking the performance of proprietary vs open-source LLMs in industrial QnA tasks using various RAG-based implementations and evaluation metrics.
Language: Python - Size: 1.27 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

erdemormann/kanarya-and-trendyol-classification-tests
Test results of Kanarya and Trendyol models with and without fine-tuning techniques on the Turkish tweet hate speech detection dataset.
Language: Jupyter Notebook - Size: 293 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

Chotom/guardrails-dss-ml-2024
Demo showcase highlighting the capabilities of Guardrails in LLMs.
Language: Python - Size: 95.7 KB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

s2e-lab/RegexEval
Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.
Language: Python - Size: 24 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 1

aflah02/Humans-v-s-LLM-Benchmarks
LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them
Language: Python - Size: 40.9 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

bgonzalezbustamante/TextClass-Benchmark
TextClass Benchmark Leaderboards
Language: Jupyter Notebook - Size: 147 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

686f6c61/LLM-Psyche-Modelo-Multidimensional-Personalidad-LLM
LLM-Psyche es un marco teórico y metodológico para evaluar la "personalidad" de Grandes Modelos de Lenguaje (LLMs) mediante un sistema de evaluación dual. Combinamos los principios del reconocido test 16PF-5 con dimensiones específicas para LLMs, creando un modelo multidimensional que captura las tendencias conductuales de estos sistemas avanzados.
Language: HTML - Size: 2.58 MB - Last synced at: 19 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

Fbxfax/llm-confidence-scorer
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
Language: Python - Size: 96.7 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

dinessa-ga/LLM-benchmark-frontend
Este estudio subraya la importancia de combinar la automatización mediante IA con principios sólidos de ingeniería de software. Aunque los LLMs ofrecen una ventaja significativa en productividad, su uso debe estar guiado por buenas prácticas para evitar costos ocultos asociados a la deuda técnica.
Language: JavaScript - Size: 8.87 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Vignesh010101/Intelligent-Health-LLM-System
An Intelligent Health LLM System for Personalized Medication Guidance and Support.
Language: Jupyter Notebook - Size: 615 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

realPJL/ExplainAI.dev
Keep track of all the AI tools
Language: TypeScript - Size: 120 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

andrewimpellitteri/creativity_bench
A benchmark for the creativity of LLMs based on Gwern's post
Language: Python - Size: 1.6 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

isaac-berlin/LLM_Model_Bias
Exploring Political Bias in LLMs Through Debate: A Multi-Agent Framework
Language: Jupyter Notebook - Size: 5.04 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Santhoshi-Ravi/minerva
Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.
Language: Jupyter Notebook - Size: 129 KB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash
A framework using the game Balderdash to evaluate creativity and logical reasoning in Large Language Models (LLMs). Multiple LLMs generate fictitious definitions to deceive others and identify correct ones, analyzing creativity, deception, and performance.
Language: Jupyter Notebook - Size: 489 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

Paraskevi-KIvroglou/Hackathon-LlamaEval
LlamaEval is a rapid prototype developed during a hackathon to provide a user-friendly dashboard for evaluating and comparing Llama models using the TogetherAI API.
Language: Python - Size: 66.8 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 0 - Forks: 1

rsinghal757/babyARC
BabyARC is a tiny abstraction and reasoning dataset inspired by the original Abstraction and Reasoning Corpus by Francois Chollet.
Language: Jupyter Notebook - Size: 544 KB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

saqib727/Medical-Analyst
Vital Image Analytics is an AI-powered application designed to assist healthcare professionals in analyzing medical images for diagnostic purposes.
Language: Python - Size: 334 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

saqib727/Blog-assistant
BlogCraft is a web application built with Streamlit that leverages AI to assist in crafting blog posts effortlessly.
Language: Python - Size: 5.86 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

SharathHebbar/eval_llms
Language: Jupyter Notebook - Size: 131 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

dinesh-kumar-mr/MediVQA
Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs
Language: HTML - Size: 1.98 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

PrincySinghal/Html-code-generation-from-LLM
Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.
Language: Jupyter Notebook - Size: 294 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0
