Topic: "llm-evaluation"
langfuse/langfuse
🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Language: TypeScript - Size: 19.9 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 10,594 - Forks: 959

comet-ml/opik
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Language: Python - Size: 145 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 6,668 - Forks: 481

promptfoo/promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
Language: TypeScript - Size: 342 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6,241 - Forks: 513

confident-ai/deepeval
The LLM Evaluation Framework
Language: Python - Size: 78.5 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 6,045 - Forks: 526

Arize-ai/phoenix
AI Observability & Evaluation
Language: Jupyter Notebook - Size: 300 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 5,433 - Forks: 401

Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
Language: Python - Size: 175 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 4,486 - Forks: 318

NVIDIA/garak
the LLM vulnerability scanner
Language: Python - Size: 4.81 MB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 4,317 - Forks: 422

Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Language: Python - Size: 70 MB - Last synced at: about 8 hours ago - Pushed at: about 2 months ago - Stars: 3,852 - Forks: 304

Helicone/helicone
🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓
Language: TypeScript - Size: 409 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 3,640 - Forks: 364

PacktPublishing/LLM-Engineers-Handbook
The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices
Language: Python - Size: 4.46 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 2,838 - Forks: 572

Agenta-AI/agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Language: TypeScript - Size: 163 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,606 - Forks: 305

lmnr-ai/lmnr
Laminar - open-source all-in-one platform for engineering AI products. Crate data flywheel for you AI app. Traces, Evals, Datasets, Labels. YC S24.
Language: TypeScript - Size: 30.5 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,866 - Forks: 113

msoedov/agentic_security
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
Language: Python - Size: 21.7 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1,283 - Forks: 200

microsoft/prompty
Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.
Language: Python - Size: 5.53 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 856 - Forks: 82

cyberark/FuzzyAI
A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.
Language: Jupyter Notebook - Size: 14.4 MB - Last synced at: about 13 hours ago - Pushed at: about 14 hours ago - Stars: 520 - Forks: 53

onejune2018/Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
Size: 12.6 MB - Last synced at: 5 days ago - Pushed at: 6 months ago - Stars: 514 - Forks: 44

relari-ai/continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
Language: Python - Size: 1.7 MB - Last synced at: 5 months ago - Pushed at: 8 months ago - Stars: 446 - Forks: 29

Value4AI/Awesome-LLM-in-Social-Science
Awesome papers involving LLMs in Social Science.
Size: 121 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 426 - Forks: 29

kimtth/awesome-azure-openai-llm
A curated list of 🌌 Azure OpenAI, 🦙 Large Language Models (incl. RAG, Agent), and references with memos.
Language: Python - Size: 285 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 351 - Forks: 43

palico-ai/palico-ai
Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework
Language: TypeScript - Size: 13.7 MB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 339 - Forks: 27

athina-ai/athina-evals
Python SDK for running evaluations on LLM generated responses
Language: Python - Size: 1.84 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 276 - Forks: 17

iMeanAI/WebCanvas
All-in-one Web Agent framework for post-training. Start building with a few clicks!
Language: Python - Size: 5.84 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 239 - Forks: 17

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 235 - Forks: 41

PetroIvaniuk/llms-tools
A list of LLMs Tools & Projects
Size: 187 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 222 - Forks: 26

cvs-health/langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
Language: Python - Size: 30.1 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 201 - Forks: 32

JonathanChavezTamales/LLMStats
A comprehensive set of LLM benchmark scores and provider prices.
Language: JavaScript - Size: 167 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 130 - Forks: 10

raga-ai-hub/raga-llm-hub
Framework for LLM evaluation, guardrails and security
Language: Python - Size: 1.22 MB - Last synced at: about 19 hours ago - Pushed at: 8 months ago - Stars: 112 - Forks: 14

alopatenko/LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Language: HTML - Size: 3.96 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 111 - Forks: 9

villagecomputing/superpipe
Superpipe - optimized LLM pipelines for structured data
Language: Python - Size: 11.2 MB - Last synced at: 23 days ago - Pushed at: 10 months ago - Stars: 108 - Forks: 3

kolenaIO/autoarena
Rank LLMs, RAG systems, and prompts using automated head-to-head evaluation
Language: TypeScript - Size: 2.52 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 103 - Forks: 8

hkust-nlp/dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
Language: Jupyter Notebook - Size: 4.18 MB - Last synced at: 18 days ago - Pushed at: 4 months ago - Stars: 100 - Forks: 4

rungalileo/hallucination-index
Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.
Size: 1.4 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 91 - Forks: 6

allenai/CommonGen-Eval
Evaluating LLMs with CommonGen-Lite
Language: Python - Size: 1.28 MB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 89 - Forks: 3

Re-Align/just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
Language: Python - Size: 17.7 MB - Last synced at: about 22 hours ago - Pushed at: about 1 year ago - Stars: 85 - Forks: 6

open-compass/GTA
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
Language: Python - Size: 9.98 MB - Last synced at: 16 days ago - Pushed at: 26 days ago - Stars: 82 - Forks: 7

parea-ai/parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: Python - Size: 5.48 MB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 76 - Forks: 6

loganrjmurphy/LeanEuclid
LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.
Language: Lean - Size: 3.59 MB - Last synced at: 5 months ago - Pushed at: 11 months ago - Stars: 76 - Forks: 5

Addepto/contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
Language: Python - Size: 464 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 61 - Forks: 7

azminewasi/Awesome-LLMs-ICLR-24
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
Size: 821 KB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 60 - Forks: 3

PromptMixerDev/prompt-mixer-app-ce
A desktop application for comparing outputs from different Large Language Models (LLMs).
Language: TypeScript - Size: 3.2 MB - Last synced at: 14 days ago - Pushed at: 17 days ago - Stars: 50 - Forks: 6

deshwalmahesh/PHUDGE
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.
Language: Jupyter Notebook - Size: 13.1 MB - Last synced at: 23 days ago - Pushed at: 10 months ago - Stars: 49 - Forks: 7

Chainlit/literalai-cookbooks
Cookbooks and tutorials on Literal AI
Language: Jupyter Notebook - Size: 8.65 MB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 48 - Forks: 13

cedrickchee/vibe-jet
A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"
Language: HTML - Size: 8.64 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 45 - Forks: 5

SajiJohnMiranda/DoCoreAI
DoCoreAI is a next-gen open-source AI profiler that optimizes reasoning, creativity, precision and temperature in a single step—cutting token usage by 15-30% and lowering LLM API costs
Language: Python - Size: 1.88 MB - Last synced at: 4 days ago - Pushed at: 7 days ago - Stars: 43 - Forks: 1

multinear/multinear
Develop reliable AI apps
Language: Svelte - Size: 1.12 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 36 - Forks: 1

zhuohaoyu/KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Language: Python - Size: 10.6 MB - Last synced at: 23 days ago - Pushed at: 9 months ago - Stars: 36 - Forks: 2

ZeroSumEval/ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔
Language: Python - Size: 10.4 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 33 - Forks: 4

adithya-s-k/indic_eval
A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks
Language: Python - Size: 555 KB - Last synced at: 22 days ago - Pushed at: 11 months ago - Stars: 33 - Forks: 7

google/litmus
Litmus is a comprehensive LLM testing and evaluation tool designed for GenAI Application Development. It provides a robust platform with a user-friendly UI for streamlining the process of building and assessing the performance of your LLM-powered applications.
Language: Vue - Size: 303 MB - Last synced at: 4 days ago - Pushed at: 9 days ago - Stars: 31 - Forks: 4

fuxiAIlab/CivAgent
CivAgent is an LLM-based Human-like Agent acting as a Digital Player within the Strategy Game Unciv.
Language: Python - Size: 53.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 31 - Forks: 4

SapienzaNLP/ita-bench
A collection of Italian benchmarks for LLM evaluation
Language: Python - Size: 728 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 30 - Forks: 0

ChanLiang/CONNER
The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“
Language: Python - Size: 15.8 MB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 30 - Forks: 2

kereva-dev/kereva-scanner
Code scanner to check for issues in prompts and LLM calls
Language: Python - Size: 7.12 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 29 - Forks: 2

Yifan-Song793/GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Language: Python - Size: 2.04 MB - Last synced at: 8 days ago - Pushed at: 9 months ago - Stars: 28 - Forks: 1

alan-turing-institute/prompto
An open source library for asynchronous querying of LLM endpoints
Language: Python - Size: 6.7 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 27 - Forks: 0

mts-ai/rurage
Language: Python - Size: 3.85 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 27 - Forks: 0

dannylee1020/openpo
Language: Python - Size: 10.7 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 27 - Forks: 0

Praful932/llmsearch
Find better generation parameters for your LLM
Language: Python - Size: 5.04 MB - Last synced at: 23 days ago - Pushed at: 11 months ago - Stars: 27 - Forks: 1

CodeEval-Pro/CodeEval-Pro
Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"
Language: Python - Size: 4.1 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 26 - Forks: 2

LLM-Evaluation-s-Always-Fatiguing/leaf-playground
A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.
Language: Python - Size: 868 KB - Last synced at: 19 days ago - Pushed at: 10 months ago - Stars: 24 - Forks: 0

VITA-Group/llm-kick
[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.
Language: Python - Size: 7.11 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 23 - Forks: 5

AgiFlow/agiflow-sdks
LLM QA, Observability, Evaluation and User Feedback
Language: TypeScript - Size: 2.5 MB - Last synced at: 8 days ago - Pushed at: 9 months ago - Stars: 22 - Forks: 2

Supahands/llm-comparison
This is an opensource project allowing you to compare two LLM's head to head with a given prompt, it has a wide range of supported models, from opensource ollama ones to the likes of openai and claude
Language: TypeScript - Size: 888 KB - Last synced at: 19 days ago - Pushed at: about 1 month ago - Stars: 20 - Forks: 1

kieranklaassen/leva
LLM Evaluation Framework for Rails apps to be used with production data.
Language: HTML - Size: 262 KB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 1

Babelscape/ALERT
Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"
Language: Python - Size: 177 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 18 - Forks: 1

rhesis-ai/rhesis-sdk
Open-source test generation SDK for LLM applications. Access curated test sets. Build context-specific test sets and collaborate with subject matter experts.
Language: Python - Size: 420 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 17 - Forks: 0

Cybonto/OllaDeck
OllaDeck is a purple technology stack for Generative AI (text modality) cybersecurity. It provides a comprehensive set of tools for both blue team and red team operations in the context of text-based generative AI.
Language: Python - Size: 82.9 MB - Last synced at: about 11 hours ago - Pushed at: 7 months ago - Stars: 17 - Forks: 2

equinor/promptly
A prompt collection for testing and evaluation of LLMs.
Language: Jupyter Notebook - Size: 1.74 MB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 16 - Forks: 1

intuit-ai-research/DCR-consistency
DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models
Language: Python - Size: 2.07 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 16 - Forks: 1

honeyhiveai/realign
Realign is a testing and simulation framework for AI applications.
Language: Python - Size: 27.3 MB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 15 - Forks: 1

EveripediaNetwork/grokit
Grok Unofficial Python SDK for any X Premium account
Language: Python - Size: 34.2 KB - Last synced at: 9 days ago - Pushed at: 6 months ago - Stars: 15 - Forks: 5

aws-samples/fm-leaderboarder
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
Language: Python - Size: 511 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 14 - Forks: 4

Now-Join-Us/OmniEvalKit Fork of AIDC-AI/M3Bench
The code repository for "OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions"
Language: Python - Size: 3.82 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 13 - Forks: 2

ksm26/Pretraining-LLMs
Master the essential steps of pretraining large language models (LLMs). Learn to create high-quality datasets, configure model architectures, execute training runs, and assess model performance for efficient and effective LLM pretraining.
Language: Jupyter Notebook - Size: 29.3 KB - Last synced at: 26 days ago - Pushed at: 9 months ago - Stars: 13 - Forks: 5

minnesotanlp/cobbler
Code and data for ACL ARR 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Language: Jupyter Notebook - Size: 3.92 MB - Last synced at: 12 months ago - Pushed at: about 1 year ago - Stars: 13 - Forks: 1

AI4Bharat/Anudesh-Frontend
Language: JavaScript - Size: 166 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 12 - Forks: 7

aigc-apps/PertEval
[NeurIPS '24 Spotlight] PertEval: Unveiling Real Knowledge Capacity of LLMs via Knowledge-invariant Perturbations
Language: Jupyter Notebook - Size: 12.9 MB - Last synced at: 10 days ago - Pushed at: 6 months ago - Stars: 12 - Forks: 2

VILA-Lab/Open-LLM-Leaderboard
Open-LLM-Leaderboard: Open-Style Question Evaluation. Paper at https://arxiv.org/abs/2406.07545
Language: Python - Size: 4.33 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 12 - Forks: 0

AI4Bharat/MILU
MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of LLMs across 11 Indic languages.
Language: Python - Size: 1.73 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 11 - Forks: 4

gretelai/navigator-helpers 📦
Navigator Helpers
Language: Python - Size: 9.31 MB - Last synced at: 19 days ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

armingh2000/FactScoreLite
FactScoreLite is an implementation of the FactScore metric, designed for detailed accuracy assessment in text generation. This package builds upon the framework provided by the original FactScore repository, which is no longer maintained and contains outdated functions.
Language: Python - Size: 674 KB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 11 - Forks: 1

genia-dev/vibraniumdome
LLM Security Platform.
Language: Python - Size: 2.87 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 10 - Forks: 2

bowen-upenn/llm_token_bias
[EMNLP 2024] This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.
Language: Python - Size: 57.4 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 9 - Forks: 1

Networks-Learning/prediction-powered-ranking
Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
Language: Jupyter Notebook - Size: 4.74 MB - Last synced at: about 8 hours ago - Pushed at: 6 months ago - Stars: 9 - Forks: 1

LRudL/sad
Situational Awareness Dataset
Language: HTML - Size: 551 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 9 - Forks: 0

danilop/llm-test-mate
A simple testing framework to evaluate and validate LLM-generated content using string similarity, semantic similarity, and model-based evaluation.
Language: Python - Size: 83 KB - Last synced at: about 9 hours ago - Pushed at: 3 months ago - Stars: 8 - Forks: 0

amazon-science/llm-code-preference
Training and Benchmarking LLMs for Code Preference.
Language: Python - Size: 156 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 8 - Forks: 0

brucewlee/nutcracker
Large Model Evaluation Experiments
Language: Python - Size: 431 KB - Last synced at: 12 days ago - Pushed at: 7 months ago - Stars: 7 - Forks: 1

prompt-foundry/python-sdk
The prompt engineering, prompt management, and prompt evaluation tool for Python
Language: Python - Size: 20.7 MB - Last synced at: about 12 hours ago - Pushed at: 7 months ago - Stars: 7 - Forks: 0

yandex-research/mind-your-format
Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements
Language: Jupyter Notebook - Size: 9.12 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 0

HillPhelmuth/LlmAsJudgeEvalPlugins
LLM-as-judge evals as Semantic Kernel Plugins
Language: C# - Size: 890 KB - Last synced at: 14 days ago - Pushed at: 4 months ago - Stars: 6 - Forks: 1

mtuann/llm-updated-papers
Papers related to Large Language Models in all top venues
Size: 553 KB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 5 - Forks: 1

langfuse/oss-llmops-stack
Modular, open source LLMOps stack that separates concerns: LiteLLM unifies LLM APIs, manages routing and cost controls, and ensures high-availability, while Langfuse focuses on detailed observability, prompt versioning, and performance evaluations.
Size: 316 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 5 - Forks: 0

simranjeet97/Learn_RAG_from_Scratch_LLM
Learn Retrieval-Augmented Generation (RAG) from Scratch using LLMs from Hugging Face and Langchain or Python
Language: Jupyter Notebook - Size: 425 KB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 5 - Forks: 3

AI4Bharat/Anudesh
An open source platform to annotate data for Large language models - at scale
Size: 23.4 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 4 - Forks: 0

parea-ai/parea-sdk-ts
TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language: TypeScript - Size: 450 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 4 - Forks: 1

johnsonhk88/Deep-Research-With-Web-Scraping-by-LLM-And-AI-Agent
Use LLM/AI agent for Web scraping (collection data) and analysis data with deep research
Language: Jupyter Notebook - Size: 217 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 3 - Forks: 1

zabir-nabil/bangla-multilingual-llm-eval
Evaluation of Open and Closed-Source Multi-lingual LLMs for Low-Resource Bangla Language
Language: Jupyter Notebook - Size: 29.3 MB - Last synced at: 21 days ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0

yukinagae/genkitx-promptfoo
Community Plugin for Genkit to use Promptfoo
Language: TypeScript - Size: 553 KB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0

svilupp/Spehulak.jl
GenAI observability application in Julia
Language: Julia - Size: 1.9 MB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0
