An open API service providing repository metadata for many open source software ecosystems.

Topic: "llms-benchmarking"

steel-dev/awesome-web-agents

🔥 A list of tools, frameworks, and resources for building AI web agents

Size: 228 KB - Last synced at: 12 days ago - Pushed at: about 2 months ago - Stars: 378 - Forks: 31

karthikv792/LLMs-Planning

An extensible benchmark for evaluating large language models on planning

Language: PDDL - Size: 52 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 351 - Forks: 36

JonathanChavezTamales/llm-leaderboard

A comprehensive set of LLM benchmark scores and provider prices.

Language: JavaScript - Size: 319 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 206 - Forks: 16

bboylyg/BackdoorLLM

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Language: Python - Size: 273 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 146 - Forks: 13

lerogo/MMGenBench

Official repository of MMGenBench

Language: Python - Size: 19.6 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 119 - Forks: 5

ChemFoundationModels/ChemLLMBench

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

Language: Jupyter Notebook - Size: 4.28 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 106 - Forks: 5

lechmazur/nyt-connections

Benchmark that evaluates LLMs using 651 NYT Connections puzzles extended with extra trick words

Language: Python - Size: 6.37 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 85 - Forks: 5

lamalab-org/chembench

How good are LLMs at chemistry?

Language: Python - Size: 261 MB - Last synced at: 4 days ago - Pushed at: 19 days ago - Stars: 83 - Forks: 9

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.48 MB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 76 - Forks: 7

lechmazur/generalization

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

Size: 30 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 52 - Forks: 2

RaptorMai/MLLM-CompBench

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes

Language: Jupyter Notebook - Size: 10.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 38 - Forks: 2

multinear/multinear

Develop reliable AI apps

Language: Svelte - Size: 1.12 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 36 - Forks: 1

FSoft-AI4Code/XMainframe

Language Model for Mainframe Modernization

Language: Python - Size: 11.8 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 35 - Forks: 3

epfl-dlab/cc_flows

The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

Language: Python - Size: 17.6 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 30 - Forks: 1

declare-lab/resta

Restore safety in fine-tuned language models through task arithmetic

Language: Python - Size: 75.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 28 - Forks: 2

SuperBruceJia/Awesome-Mixture-of-Experts

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

Size: 438 KB - Last synced at: 26 days ago - Pushed at: 5 months ago - Stars: 27 - Forks: 3

rajpurkarlab/craft-md

Language: Python - Size: 318 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 24 - Forks: 6

Laoyu84/4onebench

A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.

Language: Python - Size: 1.34 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 17 - Forks: 1

Paulescu/text-embedding-evaluation

Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

Language: Python - Size: 615 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 16 - Forks: 2

gautierdag/plancraft

Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs

Language: Python - Size: 124 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 13 - Forks: 1

minnesotanlp/cobbler

Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 1

logikon-ai/cot-eval

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

Language: Jupyter Notebook - Size: 2.41 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 12 - Forks: 2

RUC-GSAI/YuLan-SwarmIntell

🐝 SwarmBench: Benchmarking LLMs' Swarm Intelligence

Language: Python - Size: 84.4 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 10 - Forks: 2

microsoft/private-benchmarking

A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

Language: Python - Size: 4.32 MB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 10 - Forks: 2

SergioV3005/llm-belief-bias

Belief-Bias evaluation of local LLMs

Language: Python - Size: 1.46 MB - Last synced at: about 5 hours ago - Pushed at: about 6 hours ago - Stars: 9 - Forks: 0

Kartik-3004/facexbench

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

Language: Python - Size: 409 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 9 - Forks: 1

microsoft/MEGAVERSE

Official Codebase for MEGAVERSE: (published in ACL: NAACL 2024)

Language: Python - Size: 51.3 MB - Last synced at: 6 days ago - Pushed at: 28 days ago - Stars: 8 - Forks: 0

cosmaadrian/romath

Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"

Language: Python - Size: 276 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 8 - Forks: 0

amazon-science/llm-code-preference

Training and Benchmarking LLMs for Code Preference.

Language: Python - Size: 156 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 8 - Forks: 0

whisk/leetgptsolver

Benchmarking LLMs (large language models) on leetcode algorithmic challenges

Language: Go - Size: 289 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 7 - Forks: 0

desboisGIT/OpenPrompt

OpenPromptBank is an AI prompt library platform where users can explore, rank, and contribute AI prompts categorized by various topics. This platform features a searchable library, community-driven rankings, prompt performance benchmarks, and user profiles.

Language: Python - Size: 18.9 MB - Last synced at: about 15 hours ago - Pushed at: about 16 hours ago - Stars: 5 - Forks: 0

ronniross/asi-core-protocol

A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.

Size: 122 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 5 - Forks: 1

parea-ai/parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: TypeScript - Size: 2.94 MB - Last synced at: 1 day ago - Pushed at: 4 months ago - Stars: 4 - Forks: 1

nachoDRT/MERIT-Dataset

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

Language: Python - Size: 495 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 4 - Forks: 0

matus-pikuliak/genderbench

GenderBench - Evaluation suite for gender biases in LLMs

Language: Python - Size: 14.3 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 3 - Forks: 2

stair-lab/melt

Multilingual Evaluation Toolkits

Language: Python - Size: 204 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 3 - Forks: 3

ronniross/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 143 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0

lwachowiak/LLMs-for-Social-Robotics

Code and data for our IROS paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"

Language: Jupyter Notebook - Size: 6.33 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 3

melvinebenezer/Liah-Lie_in_a_haystack

needle in a haystack for LLMs

Language: Python - Size: 2.42 MB - Last synced at: 17 days ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

jwallat/temporalrobustness

A Study Into Temporal Robustness of LLMs

Language: Jupyter Notebook - Size: 3.66 MB - Last synced at: 10 days ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

unable12/LLM_Side_By_Side_Comparison

Language: CSS - Size: 675 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

i-partalas/industrial-rag-qna-benchmark

Benchmarking the performance of proprietary vs open-source LLMs in industrial QnA tasks using various RAG-based implementations and evaluation metrics.

Language: Python - Size: 1.27 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

erdemormann/kanarya-and-trendyol-classification-tests

Test results of Kanarya and Trendyol models with and without fine-tuning techniques on the Turkish tweet hate speech detection dataset.

Language: Jupyter Notebook - Size: 293 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

Chotom/guardrails-dss-ml-2024

Demo showcase highlighting the capabilities of Guardrails in LLMs.

Language: Python - Size: 95.7 KB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

s2e-lab/RegexEval

Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.

Language: Python - Size: 24 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 1

aflah02/Humans-v-s-LLM-Benchmarks

LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them

Language: Python - Size: 40.9 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

bgonzalezbustamante/TextClass-Benchmark

TextClass Benchmark Leaderboards

Language: Jupyter Notebook - Size: 147 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

686f6c61/LLM-Psyche-Modelo-Multidimensional-Personalidad-LLM

LLM-Psyche es un marco teórico y metodológico para evaluar la "personalidad" de Grandes Modelos de Lenguaje (LLMs) mediante un sistema de evaluación dual. Combinamos los principios del reconocido test 16PF-5 con dimensiones específicas para LLMs, creando un modelo multidimensional que captura las tendencias conductuales de estos sistemas avanzados.

Language: HTML - Size: 2.58 MB - Last synced at: 19 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

Fbxfax/llm-confidence-scorer

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

Language: Python - Size: 96.7 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

dinessa-ga/LLM-benchmark-frontend

Este estudio subraya la importancia de combinar la automatización mediante IA con principios sólidos de ingeniería de software. Aunque los LLMs ofrecen una ventaja significativa en productividad, su uso debe estar guiado por buenas prácticas para evitar costos ocultos asociados a la deuda técnica.

Language: JavaScript - Size: 8.87 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Vignesh010101/Intelligent-Health-LLM-System

An Intelligent Health LLM System for Personalized Medication Guidance and Support.

Language: Jupyter Notebook - Size: 615 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

realPJL/ExplainAI.dev

Keep track of all the AI tools

Language: TypeScript - Size: 120 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

andrewimpellitteri/creativity_bench

A benchmark for the creativity of LLMs based on Gwern's post

Language: Python - Size: 1.6 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

isaac-berlin/LLM_Model_Bias

Exploring Political Bias in LLMs Through Debate: A Multi-Agent Framework

Language: Jupyter Notebook - Size: 5.04 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Santhoshi-Ravi/minerva

Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.

Language: Jupyter Notebook - Size: 129 KB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash

A framework using the game Balderdash to evaluate creativity and logical reasoning in Large Language Models (LLMs). Multiple LLMs generate fictitious definitions to deceive others and identify correct ones, analyzing creativity, deception, and performance.

Language: Jupyter Notebook - Size: 489 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

Paraskevi-KIvroglou/Hackathon-LlamaEval

LlamaEval is a rapid prototype developed during a hackathon to provide a user-friendly dashboard for evaluating and comparing Llama models using the TogetherAI API.

Language: Python - Size: 66.8 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 0 - Forks: 1

rsinghal757/babyARC

BabyARC is a tiny abstraction and reasoning dataset inspired by the original Abstraction and Reasoning Corpus by Francois Chollet.

Language: Jupyter Notebook - Size: 544 KB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

saqib727/Medical-Analyst

Vital Image Analytics is an AI-powered application designed to assist healthcare professionals in analyzing medical images for diagnostic purposes.

Language: Python - Size: 334 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

saqib727/Blog-assistant

BlogCraft is a web application built with Streamlit that leverages AI to assist in crafting blog posts effortlessly.

Language: Python - Size: 5.86 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

SharathHebbar/eval_llms

Language: Jupyter Notebook - Size: 131 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

dinesh-kumar-mr/MediVQA

Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs

Language: HTML - Size: 1.98 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

PrincySinghal/Html-code-generation-from-LLM

Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.

Language: Jupyter Notebook - Size: 294 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Related Topics
llms 29 llm 25 llm-evaluation 9 llms-reasoning 8 benchmark 8 large-language-models 7 ai 6 llm-evaluation-framework 6 evaluation 5 llm-evaluation-toolkit 4 dataset 4 machine-learning 4 code-generation 3 nlp 3 prompt-engineering 3 llms-evalution 3 llm-training 3 datasets 3 gpt-4o 3 openai 3 multimodal-large-language-models 3 streamlit 3 llm-eval 3 planning 2 reasoning 2 bias-detection 2 foundation-models 2 medical-application 2 artificial-intelligence 2 multilingual 2 mixture-of-experts 2 llm-tools 2 llms-efficency 2 chemistry 2 llm-evaluation-metrics 2 alignment 2 llama3 2 safety 2 llmops 2 agents 2 llm-framework 2 metrics 2 synthetic-dataset-generation 2 sonnet3-7 2 gpt-4-5 2 llm-agents 2 benchmarking 2 retrieval-augmented-generation 1 balderdash 1 generative-ai 1 leaderboard 1 good-first-issue 1 gen-ai 1 chain-of-thought 1 program-synthesis 1 t3-stack 1 react 1 prompt-tuning 1 alignment-algorithm 1 prompt-toolkit 1 llm-safety 1 llm-safety-benchmark 1 embeddings 1 agentic-ai 1 interactive-environments 1 kilobots 1 swarm 1 pytorch 1 langchain 1 huggingface 1 swarm-intelligence 1 swarm-robotics 1 mathematics 1 docker 1 chunking 1 romanian 1 chromadb 1 azureopenai 1 psychology 1 personality-test 1 multifactor 1 mmgenbench 1 mllm 1 natural-language-processing 1 multi-agent-simulation 1 materials-science 1 elo-rating 1 gpt-4 1 leaderboards 1 llama 1 misinformation 1 mistral 1 nous-hermes 1 ollama 1 perspective-api 1 qwen2-5 1 text-as-data 1 text-classification 1 toxicity 1 toxicity-classification 1