An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: llm-serving

alibaba/ServeGen

A framework for generating realistic LLM serving workloads

Language: Python - Size: 115 MB - Last synced at: about 11 hours ago - Pushed at: about 12 hours ago - Stars: 14 - Forks: 2

efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

Language: C++ - Size: 32.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 816 - Forks: 37

tdchaitanya/looplm

🔄 LoopLM: Command line tool accessing LLMs directly from your terminal

Language: Python - Size: 1.35 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models.

Language: Python - Size: 20.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 14,847 - Forks: 1,900

skypilot-org/skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Language: Python - Size: 154 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8,191 - Forks: 661

ray-project/ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Language: Python - Size: 524 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 37,335 - Forks: 6,335

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Language: Python - Size: 53.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 48,708 - Forks: 7,727

ajithvcoder/TSAI-EMLO-4.0

Contains solutoins for assignments and learning notes from Extensive Machine Learning Operations course of The School of AI

Language: Python - Size: 32.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 1

borisdev/stack-sandbox Fork of michaeloliverx/python-poetry-docker-example

Stack Sandbox: uv & FastAPI & NextJS & Azure

Language: Dockerfile - Size: 47.9 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

NexusGPU/tensor-fusion

Tensor Fusion is a state-of-the-art GPU virtualization and pooling solution designed to optimize GPU cluster utilization to its fullest potential.

Language: Go - Size: 1.02 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 40 - Forks: 11

alibaba/rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Language: C++ - Size: 296 MB - Last synced at: 5 days ago - Pushed at: 24 days ago - Stars: 780 - Forks: 65

mosecorg/mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

Language: Python - Size: 1.14 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 843 - Forks: 61

vllm-project/vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

Language: Python - Size: 1.67 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 703 - Forks: 176

helixml/helix

♾️ Helix is a private GenAI stack for building AI applications with declarative pipelines, knowledge (RAG), API bindings, and first-class testing.

Language: Go - Size: 51 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 500 - Forks: 51

interestingLSY/swiftLLM

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

Language: Python - Size: 226 KB - Last synced at: 5 days ago - Pushed at: 17 days ago - Stars: 200 - Forks: 23

predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Language: Python - Size: 6.62 MB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 2,989 - Forks: 215

Adarshreddyash/surfing-weights

Surfing weights to edge devices

Language: Python - Size: 8.36 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

gpustack/gpustack

Manage GPU clusters for running AI models

Language: Python - Size: 94.1 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2,783 - Forks: 282

gty111/gLLM

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

Language: Python - Size: 1.26 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 12 - Forks: 1

thu-pacman/chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Language: Python - Size: 5.21 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1,124 - Forks: 73

friendliai/friendli-client

Friendli: the fastest serving engine for generative AI

Language: Python - Size: 4.88 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 46 - Forks: 7

liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

Language: HTML - Size: 22.1 MB - Last synced at: 9 days ago - Pushed at: 11 days ago - Stars: 18,073 - Forks: 2,121

rohan-paul/LLM-FineTuning-Large-Language-Models

LLM (Large Language Model) FineTuning

Language: Jupyter Notebook - Size: 11.3 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 538 - Forks: 126

superduper-io/superduper

Superduper: End-to-end framework for building custom AI applications and agents.

Language: Python - Size: 73.7 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 5,066 - Forks: 494

bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

Language: Python - Size: 95.6 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 7,740 - Forks: 840

microsoft/aici

AICI: Prompts as (Wasm) Programs

Language: Rust - Size: 9.71 MB - Last synced at: 10 days ago - Pushed at: 5 months ago - Stars: 2,027 - Forks: 83

MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

Language: Python - Size: 2.4 MB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 1,779 - Forks: 106

France-Travail/happy_vllm

A REST API for vLLM, production ready

Language: Python - Size: 859 KB - Last synced at: 7 days ago - Pushed at: 19 days ago - Stars: 21 - Forks: 2

bentoml/OpenLLM

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

Language: Python - Size: 41.1 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 11,288 - Forks: 722

zhihu/ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

Language: C++ - Size: 996 KB - Last synced at: 9 days ago - Pushed at: 22 days ago - Stars: 890 - Forks: 104

sugarcane-ai/sugarcane-ai

npm like package ecosystem for Prompts 🤖

Language: TypeScript - Size: 11.5 MB - Last synced at: about 4 hours ago - Pushed at: 4 months ago - Stars: 49 - Forks: 14

powerserve-project/PowerServe

High-speed and easy-use LLM serving framework for local deployment

Language: C++ - Size: 1.11 MB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 107 - Forks: 9

EmbeddedLLM/embeddedllm

EmbeddedLLM: API server for Embedded Device Deployment. Currently support CUDA/OpenVINO/IpexLLM/DirectML/CPU

Language: Python - Size: 12.6 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 39 - Forks: 1

cortecs-ai/cortecs-py

Lightweight wrapper for cortecs' provisioning API

Language: Python - Size: 418 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 6 - Forks: 0

bigai-nlco/TokenSwift

[ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

Language: Python - Size: 61.6 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 92 - Forks: 8

hpcaitech/SwiftInfer

Efficient AI Inference & Serving

Language: Python - Size: 508 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 469 - Forks: 29

galeselee/Awesome_LLM_System-PaperList

Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!

Size: 616 KB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 249 - Forks: 12

genlm/genlm-backend

High-performance backend for language model probabilistic programs

Language: Python - Size: 2.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 0

azminewasi/Awesome-LLMs-ICLR-24

It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.

Size: 821 KB - Last synced at: 23 days ago - Pushed at: about 1 year ago - Stars: 61 - Forks: 3

ray-project/ray-educational-materials 📦

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

Language: Jupyter Notebook - Size: 24 MB - Last synced at: 29 days ago - Pushed at: over 1 year ago - Stars: 393 - Forks: 76

Moha111-h/Qwen3

Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.

Language: Shell - Size: 3.07 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

AdrianMosnegutu/docscribe.nvim

A Neovim plugin for generating inline documentation for your functions using LLMs.

Language: Lua - Size: 7.32 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

mani-kantap/llm-inference-solutions

A collection of all available inference solutions for the LLMs

Size: 30.3 KB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 87 - Forks: 3

nuhmanpk/quick-llama

Run Ollama models anywhere easily

Language: Python - Size: 319 KB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

theneildave/ml-engineering

Machine Engineering Comprehensive Guide

Size: 1000 Bytes - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

torchpipe/torchpipe

Serving Inside Pytorch

Language: C++ - Size: 41.6 MB - Last synced at: 14 days ago - Pushed at: 30 days ago - Stars: 160 - Forks: 13

efficientscaling/Z1

Repo for "Z1: Efficient Test-time Scaling with Code"

Language: Python - Size: 422 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 45 - Forks: 1

HPMLL/BurstGPT

A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems

Language: Python - Size: 19 MB - Last synced at: about 2 months ago - Pushed at: 8 months ago - Stars: 159 - Forks: 9

Neural-Dragon-AI/Cynde

A Framework For Intelligence Farming

Language: Python - Size: 1.2 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 13 - Forks: 0

ehsanghaffar/llm-practice

A self-hosted personal chatbot API with FastAPI. It allows you to interact with the Llama2 LLM (and other open-source LLMs) to have natural language conversations, generate text, and perform various language-related tasks.

Language: Jupyter Notebook - Size: 108 KB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 11 - Forks: 2

France-Travail/benchmark_llm_serving

A library to benchmark LLMs via their API exposure

Language: Python - Size: 8.04 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 6 - Forks: 0

AkshaySyal/End-to-End-Basketball-QA-RAG-Capstone

Created a QA Chatbot powered by fine tuned text-to-sql LLM deployed on personal gaming laptop (Nvidia GTX 1650) using Ollama and Ngrok

Language: Jupyter Notebook - Size: 3.54 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 1

jbchouinard/llmailbot

A service for chatting with LLMs via email.

Language: Python - Size: 296 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Kira94-hkz/PowerServe

High-speed and easy-use LLM serving framework for local deployment

Size: 1000 Bytes - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

ivynya/illm

internet llm - access your ollama (or any other local llm) instance from across the internet

Language: Go - Size: 85.9 KB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

henryle97/llm-serving-benchmark

LLM Serving Libs Benchmark

Language: Python - Size: 0 Bytes - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

MinjaeKIM753/ClaudeComputerUseBeta-Win64

Claude 3.5 Sonnet ComputerUse (Beta) for Win64

Language: Python - Size: 198 KB - Last synced at: 2 months ago - Pushed at: 7 months ago - Stars: 10 - Forks: 4

CentML/llm-inference-bench

Lightweight and extensible LLM Inference serving benchmark tool written in Rust.

Language: Rust - Size: 18.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

genia-dev/vibraniumdome

LLM Security Platform.

Language: Python - Size: 2.87 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 10 - Forks: 2

george-mountain/web-app-builder--LLM

Building Static Web Applications using Large Language Model. From hand sketched documents, images and screenshots to proper web pages.

Language: Python - Size: 2.11 MB - Last synced at: 27 days ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

Jason-cs18/HetServe-Foundation

A Overview of Efficiently Serving Foundation Models across Edge Devices

Size: 358 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 13 - Forks: 0

IvanLuLyf/bunny-llm

Deno LLM API Service

Language: TypeScript - Size: 132 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 7 - Forks: 1

romitjain/gpt-benchmark

Making small models as fast as possible

Language: Python - Size: 1.91 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

unaidedelf8777/faster-outlines

A Lazy, high throughput and blazing fast structured text generation backend.

Language: Rust - Size: 3.68 MB - Last synced at: 7 days ago - Pushed at: 7 months ago - Stars: 5 - Forks: 0

oscinis-com/Awesome-LLM-Productization

Awesome-LLM-Productization: a curated list of tools/tricks/news/regulations about AI and Large Language Model (LLM) productization

Size: 275 KB - Last synced at: 28 days ago - Pushed at: 4 months ago - Stars: 23 - Forks: 4

slai-labs/get-beam

Run GPU inference and training jobs on serverless infrastructure that scales with you.

Language: Shell - Size: 5.96 MB - Last synced at: about 2 months ago - Pushed at: 12 months ago - Stars: 102 - Forks: 23

diverged/tavily-go

An unofficial Go port of the official Tavily API Python Wrapper.

Language: Go - Size: 17.6 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

fork123aniket/LLM-RAG-powered-QA-App

A Production-Ready, Scalable RAG-powered LLM-based Context-Aware QA App

Language: Python - Size: 22.5 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 5 - Forks: 1

EAlmazanG/sentiment-analysis-reviews

A cost-effective solution for stores and startups to analyze customer reviews, classify sentiment (positive, neutral, negative), and gain actionable insights through an interactive dashboard.

Language: Jupyter Notebook - Size: 34.9 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

biosfood/intel-llm-guide

A guide on how to run LLMs on intel CPUs

Language: Python - Size: 20.5 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

chenhunghan/ialacol 📦

🪶 Lightweight OpenAI drop-in replacement for Kubernetes

Language: Python - Size: 250 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 144 - Forks: 17

friendliai/lm-evaluation-harness Fork of EleutherAI/lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.

Language: Python - Size: 28.1 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

genia-dev/vibraniumdome-docs

LLM Security Platform Docs

Language: MDX - Size: 635 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

ray-project/ray-llm 📦

RayLLM - LLMs on Ray

Language: Python - Size: 1.98 MB - Last synced at: 7 months ago - Pushed at: about 1 year ago - Stars: 1,230 - Forks: 94

biomchen/llm-serving

Basic APIs for serving LLMs locally.

Language: Python - Size: 31.3 KB - Last synced at: 7 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

KevinZeng08/efficient-large-model-papers

A Curated Paper List for Efficient Large Models

Size: 1.95 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 1

substratusai/runbooks 📦

Finetune LLMs on K8s by using Runbooks

Language: Go - Size: 5.22 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 168 - Forks: 14

okikorg/okik

Okik is serving framework to deploy LLMs and much more.

Language: Python - Size: 5.13 MB - Last synced at: 28 days ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

valyu-network/Stitch

Stitch simplifies and scales LLM application deployment, reducing infrastructure complexity and costs.

Language: Python - Size: 2.53 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

LoopGlitch26/Hinglish-AI-Mentor

Hinglish Chatbot powered by Azure Cognitive Services, Google Translate and Open AI

Language: Jupyter Notebook - Size: 974 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 1

ray-project/llms-in-prod-workshop-2023 📦

Deploy and Scale LLM-based applications

Language: Jupyter Notebook - Size: 13.1 MB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 23 - Forks: 3

ray-project/anyscale-berkeley-ai-hackathon 📦

Ray and Anyscale for UC Berkeley AI Hackathon!

Language: Jupyter Notebook - Size: 77.1 KB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 11 - Forks: 0

george-mountain/LLM-Local-Streaming

Streaming of LLM responses in realtime using Fastapi and Streamlit.

Language: Python - Size: 32.2 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 1

sugarcane-ai/sugarcane-ai.github.io

Language: Astro - Size: 17.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 15 - Forks: 3

InquestGeronimo/horizon-takeoff

Automating the deployment of the Takeoff Server on AWS for LLMs

Language: Python - Size: 1.08 MB - Last synced at: 15 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

emmaalecrim/llm-ws

Typescript LLM Websocket reverse proxy built for streaming of various inference tasks

Language: TypeScript - Size: 673 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

suleymansevimli/run-llm-model-locally

You can run any large language model on your local machine with this repository.

Language: Python - Size: 1.95 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

asprenger/ray_vllm_inference

A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.

Language: Python - Size: 81.1 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 1

mddunlap924/LLM-Inference-Serving

This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.

Language: Jupyter Notebook - Size: 6.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

liux2/DL_env_Setups

Deep learning environment setups

Language: Shell - Size: 23.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ray-project/llm-application 📦

Language: Jupyter Notebook - Size: 20.5 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 2

Stosan/commentator

Language: Python - Size: 12.2 MB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 1

awesome-software/EasyEdit Fork of zjunlp/EasyEdit

An Easy-to-use Knowledge Editing Framework for LLMs.

Size: 15.5 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0