An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llm-benchmarking

azaki2005/llmscript

🧑💻 Write your shell scripts in natural language

Language: Go - Size: 154 KB - Last synced at: about 3 hours ago - Pushed at: about 3 hours ago - Stars: 0 - Forks: 0

mags0ft/hle-eval-ollama

An easy-to-use evaluation tool for running Humanity's Last Exam on (locally) hosted Ollama instances.

Language: Python - Size: 25.4 KB - Last synced at: about 7 hours ago - Pushed at: about 7 hours ago - Stars: 0 - Forks: 0

atadml/txt-2-sql-benchmark

An app and set of methodologies designed to evaluate the performance of various Large Language Models (LLMs) on the text-to-SQL task. Our goal is to offer a standardized way to measure how well these models can generate SQL queries from natural language descriptions

Language: Jupyter Notebook - Size: 296 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 18 - Forks: 0

lechmazur/confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

Language: HTML - Size: 23.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 122 - Forks: 3

tongye98/Awesome-Code-Benchmark

A comprehensive code domain benchmark review of LLM researches.

Size: 771 KB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 9 - Forks: 3

alopatenko/LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

Language: HTML - Size: 3.96 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 111 - Forks: 9

levitation-opensource/bioblue

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLM-s with simplified observation format. The benchmark themes include multi-objective homeostasis, (multi-objective) diminishing returns, complementary goods, sustainability, multi-agent resource sharing.

Language: Python - Size: 5.75 MB - Last synced at: 2 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 2

cburst/LLMscripting

This is a series of Python scripts for zero-shot and chain-of-thought LLM scripting

Language: Python - Size: 7.42 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

ronniross/coreAGIprotocol

The Core AGI Protocol provides a framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.

Size: 144 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 4 - Forks: 1

AKSW/LLM-KG-Bench

LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.

Language: Python - Size: 19.7 MB - Last synced at: 3 days ago - Pushed at: 9 days ago - Stars: 33 - Forks: 5

lechmazur/deception

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

Size: 36.1 KB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 26 - Forks: 2

petmal/MindTrial

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek), custom tasks in YAML, and HTML/CSV reports.

Language: Go - Size: 130 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

lakeraai/pint-benchmark

A benchmark for prompt injection detection systems.

Language: Jupyter Notebook - Size: 2.28 MB - Last synced at: 16 days ago - Pushed at: 2 months ago - Stars: 99 - Forks: 11

Keshavpatel2/local-llm-workbench

🧠 A comprehensive toolkit for benchmarking, optimizing, and deploying local Large Language Models. Includes performance testing tools, optimized configurations for CPU/GPU/hybrid setups, and detailed guides to maximize LLM performance on your hardware.

Language: Shell - Size: 8.79 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

Cristian-Curaba/CryptoFormalEval

We introduce a benchmark for testing how well LLMs can find vulnerabilities in cryptographic protocols. By combining LLMs with symbolic reasoning tools like Tamarin, we aim to improve the efficiency and thoroughness of protocol analysis, paving the way for future AI-powered cybersecurity defenses.

Language: Haskell - Size: 7.43 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 5 - Forks: 1

AUCOHL/RTL-Repo

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24

Language: Python - Size: 202 KB - Last synced at: 15 days ago - Pushed at: 11 months ago - Stars: 13 - Forks: 1

asimsinan/LLM-Research

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

Language: Python - Size: 3.12 MB - Last synced at: 11 days ago - Pushed at: 7 months ago - Stars: 52 - Forks: 8

nl4opt/ORQA

[AAAI 2025] ORQA is a new QA benchmark designed to assess the reasoning capabilities of LLMs in a specialized technical domain of Operations Research. The benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when presented with complex optimization modeling tasks.

Size: 2.48 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 37 - Forks: 0

mrigankpawagi/HinglishEval

Evaluating the Effectiveness of Code-generation Models on Hinglish Prompts

Language: Python - Size: 11 MB - Last synced at: 13 days ago - Pushed at: 18 days ago - Stars: 4 - Forks: 2

henryle97/llm-serving-benchmark

LLM Serving Libs Benchmark

Language: Python - Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

MJ-Bench/MJ-Bench

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

Language: Jupyter Notebook - Size: 218 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 43 - Forks: 5

damianomarsili/VADAR

Program synthesis for 3D spatial reasoning

Language: Jupyter Notebook - Size: 6.18 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 15 - Forks: 0

danielrosehill/LLM-Experiment-Notebook

Experiments in evaluating various prompting strategies and LLM performance generally

Size: 682 KB - Last synced at: 4 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

redblock-ai/parrot-python

PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.

Language: Python - Size: 5.97 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

ricardo-agz/LLMChess

Benchmark LLMs' abilities to plan, strategize, and reason by making them play chess against each other.

Language: Python - Size: 337 KB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

aws-samples/fm-leaderboarder

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

Language: Python - Size: 511 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 14 - Forks: 4

Gabrielstav/llm_benchmarking

Framework to benchmark LLMs performance on domain categorization, done as part of my internship at iQ Global.

Language: Python - Size: 288 KB - Last synced at: 9 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

robertvacareanu/llm4regression

Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update

Language: Python - Size: 12.3 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 66 - Forks: 7

Related Keywords
llm-benchmarking 28 llm 14 llms 7 llm-evaluation 4 llm-inference 4 large-language-models 4 openai 3 llm-datasets 3 benchmark 3 claude 3 ai-safety 2 nlp 2 llama 2 machine-learning 2 language-model 2 llm-reasoning 2 gemini 2 evaluation 2 anthropic 2 ai-evaluation 2 python 2 code-generation 2 ollama 2 llm-evaluation-framework 2 llm-apps 2 llm-agents 2 education 2 wsl-ai-setup 1 ollama-optimization 1 communication-protocol 1 cryptography 1 llm-based-agents 1 vulnerability-detection 1 rtl-design 1 verilog 1 arxiv-papers 1 buyuk-dil-modelleri 1 llm-as-a-judge 1 bug-fixing 1 llm-comparison 1 mozilla-public-license 1 opensource 1 yaml-configuration 1 llm-security 1 prompt-injection 1 context-window-scaling 1 cpu-inference 1 cuda 1 gpu-acceleration 1 hybrid-inference 1 inference-optimization 1 llama-cpp 1 llm-deployment 1 local-llm 1 model-management 1 model-quantization 1 multimodal-foundation-model 1 multimodal-judge 1 reward-models 1 3d 1 program-synthesis 1 spatial-reasoning 1 prompt-engineering 1 benchmarking-framework 1 llm-qa-document 1 ai 1 chess 1 llm-agent 1 domain-risk-protection 1 linear-regression 1 regression 1 regression-models 1 sklearn 1 llm-frameworks 1 llm-research 1 llm-theses 1 llm-tools 1 aaai2025 1 ai4or 1 linear-programming 1 llm4math 1 llm4opt 1 llm4or 1 mathematical-modelling 1 mixed-integer-programming 1 multi-choice 1 operations-research 1 optimization 1 hinglish-dataset 1 bentoml 1 llm-serving 1 code-completion 1 code-efficiency 1 codellm 1 codellms 1 data-science 1 multimodal 1 reasoning 1 generative-ai-benchmarking 1 ai-alignment 1