Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llms-benchmarking

Repositories

parea-ai/parea-sdk-py

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: Python - Size: 5.11 MB - Last synced: about 20 hours ago - Pushed: about 20 hours ago - Stars: 42 - Forks: 4

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

Language: Python - Size: 495 MB - Last synced: 7 days ago - Pushed: 8 days ago - Stars: 4 - Forks: 0

lamalab-org/chem-bench

How good are LLMs at chemistry?

Size: 118 MB - Last synced: 6 days ago - Pushed: 6 days ago - Stars: 32 - Forks: 1

parea-ai/parea-sdk-ts

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Language: TypeScript - Size: 450 KB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 4 - Forks: 1

declare-lab/resta

Restore safety in fine-tuned language models through task arithmetic

Language: Python - Size: 75.6 MB - Last synced: 15 days ago - Pushed: 2 months ago - Stars: 20 - Forks: 1

EvilPsyCHo/Open-LLM-Benchmark

Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent，格式化输出，指令追随，长文本，多语言，代码，自定义任务的能力基准测试。

Language: Python - Size: 1.79 MB - Last synced: 25 days ago - Pushed: 26 days ago - Stars: 0 - Forks: 0

Santhoshi-Ravi/minerva

Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.

Language: Jupyter Notebook - Size: 91.8 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0

logikon-ai/cot-eval

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

Language: Jupyter Notebook - Size: 1.39 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 5 - Forks: 1

Paulescu/text-embedding-evaluation

Join 14k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

Language: Python - Size: 615 KB - Last synced: about 1 month ago - Pushed: about 2 months ago - Stars: 7 - Forks: 0

minnesotanlp/cobbler

Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Language: Jupyter Notebook - Size: 3.92 MB - Last synced: about 1 month ago - Pushed: 4 months ago - Stars: 13 - Forks: 1

melvinebenezer/Liah-Lie_in_a_haystack

needle in a haystack for LLMs

Language: Python - Size: 2.42 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 0 - Forks: 0

lwachowiak/LLMs-for-Social-Robotics

Code and data for the paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"

Language: Jupyter Notebook - Size: 6.2 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 1 - Forks: 0

epfl-dlab/cc_flows

The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

Language: Python - Size: 17.6 MB - Last synced: about 1 month ago - Pushed: 4 months ago - Stars: 30 - Forks: 1

ChemFoundationModels/ChemLLMBench

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

Language: Jupyter Notebook - Size: 4.28 MB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 71 - Forks: 5

SharathHebbar/eval_llms

Language: Jupyter Notebook - Size: 131 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0

dinesh-kumar-mr/MediVQA

Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs

Language: HTML - Size: 1.98 MB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

PrincySinghal/Html-code-generation-from-LLM

Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.

Language: Jupyter Notebook - Size: 294 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

s2e-lab/RegexEval

Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.

Language: Python - Size: 24 MB - Last synced: 21 days ago - Pushed: 3 months ago - Stars: 1 - Forks: 1

Related Keywords

llms-benchmarking 18 llms 9 llm 8 llm-evaluation 4 machine-learning 3 benchmark 2 chemistry 2 safety 2 nlp 2 code-generation 2 alignment 2 prompt-engineering 2 llms-reasoning 2 large-language-models 2 llm-tools 2 llm-evaluation-toolkit 2 llm-evaluation-framework 2 llm-eval 2 vlm 1 agents 1 ai 1 value-alignment 1 social-robotics 1 hri 1 needle-in-haystack 1 long-context 1 llm-as-judge 1 llm-as-evaluator 1 llm-as-a-judge 1 evaluation 1 aiflows 1 competitive-coding 1 competitive-programming 1 competitive-programming-contests 1 ai4science 1 eleutherai 1 medical-application 1 vqa 1 vqa-dataset 1 vqa-med-2018 1 fine-tuning-llm 1 benchmark-framework 1 redos-checker 1 redos-detector 1 regex 1 generative-ai 1 good-first-issue 1 llmops 1 metrics 1 biases 1 layoutlm 1 layoutlmv2 1 layoutlmv3 1 layoutxlm 1 synthetic-dataset 1 synthetic-dataset-generation 1 token-classification 1 materials-science 1 alignment-algorithm 1 llm-safety 1 llm-safety-benchmark 1 evaluation-framework 1 huggingface 1 llamacpp 1 llm-agent 1 openai 1 vllm 1 multi-agent-debate 1 chain-of-thought 1 gen-ai 1 leaderboard 1 embeddings 1 bias 1 bias-detection 1