GitHub topics: llm-serving
alibaba/ServeGen
A framework for generating realistic LLM serving workloads
Language: Python - Size: 115 MB - Last synced at: about 11 hours ago - Pushed at: about 12 hours ago - Stars: 14 - Forks: 2

efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs
Language: C++ - Size: 32.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 816 - Forks: 37

tdchaitanya/looplm
🔄 LoopLM: Command line tool accessing LLMs directly from your terminal
Language: Python - Size: 1.35 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

sgl-project/sglang
SGLang is a fast serving framework for large language models and vision language models.
Language: Python - Size: 20.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 14,847 - Forks: 1,900

skypilot-org/skypilot
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Language: Python - Size: 154 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8,191 - Forks: 661

ray-project/ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Language: Python - Size: 524 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 37,335 - Forks: 6,335

vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language: Python - Size: 53.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 48,708 - Forks: 7,727

ajithvcoder/TSAI-EMLO-4.0
Contains solutoins for assignments and learning notes from Extensive Machine Learning Operations course of The School of AI
Language: Python - Size: 32.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 1

borisdev/stack-sandbox Fork of michaeloliverx/python-poetry-docker-example
Stack Sandbox: uv & FastAPI & NextJS & Azure
Language: Dockerfile - Size: 47.9 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

NexusGPU/tensor-fusion
Tensor Fusion is a state-of-the-art GPU virtualization and pooling solution designed to optimize GPU cluster utilization to its fullest potential.
Language: Go - Size: 1.02 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 40 - Forks: 11

alibaba/rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Language: C++ - Size: 296 MB - Last synced at: 5 days ago - Pushed at: 24 days ago - Stars: 780 - Forks: 65

mosecorg/mosec
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
Language: Python - Size: 1.14 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 843 - Forks: 61

vllm-project/vllm-ascend
Community maintained hardware plugin for vLLM on Ascend
Language: Python - Size: 1.67 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 703 - Forks: 176

helixml/helix
♾️ Helix is a private GenAI stack for building AI applications with declarative pipelines, knowledge (RAG), API bindings, and first-class testing.
Language: Go - Size: 51 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 500 - Forks: 51

interestingLSY/swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
Language: Python - Size: 226 KB - Last synced at: 5 days ago - Pushed at: 17 days ago - Stars: 200 - Forks: 23

predibase/lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Language: Python - Size: 6.62 MB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 2,989 - Forks: 215

Adarshreddyash/surfing-weights
Surfing weights to edge devices
Language: Python - Size: 8.36 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

gpustack/gpustack
Manage GPU clusters for running AI models
Language: Python - Size: 94.1 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2,783 - Forks: 282

gty111/gLLM
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
Language: Python - Size: 1.26 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 12 - Forks: 1

thu-pacman/chitu
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
Language: Python - Size: 5.21 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1,124 - Forks: 73

friendliai/friendli-client
Friendli: the fastest serving engine for generative AI
Language: Python - Size: 4.88 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 46 - Forks: 7

liguodongiot/llm-action
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Language: HTML - Size: 22.1 MB - Last synced at: 9 days ago - Pushed at: 11 days ago - Stars: 18,073 - Forks: 2,121

rohan-paul/LLM-FineTuning-Large-Language-Models
LLM (Large Language Model) FineTuning
Language: Jupyter Notebook - Size: 11.3 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 538 - Forks: 126

superduper-io/superduper
Superduper: End-to-end framework for building custom AI applications and agents.
Language: Python - Size: 73.7 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 5,066 - Forks: 494

bentoml/BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Language: Python - Size: 95.6 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 7,740 - Forks: 840

microsoft/aici
AICI: Prompts as (Wasm) Programs
Language: Rust - Size: 9.71 MB - Last synced at: 10 days ago - Pushed at: 5 months ago - Stars: 2,027 - Forks: 83

MoonshotAI/MoBA
MoBA: Mixture of Block Attention for Long-Context LLMs
Language: Python - Size: 2.4 MB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 1,779 - Forks: 106

France-Travail/happy_vllm
A REST API for vLLM, production ready
Language: Python - Size: 859 KB - Last synced at: 7 days ago - Pushed at: 19 days ago - Stars: 21 - Forks: 2

bentoml/OpenLLM
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
Language: Python - Size: 41.1 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 11,288 - Forks: 722

zhihu/ZhiLight
A highly optimized LLM inference acceleration engine for Llama and its variants.
Language: C++ - Size: 996 KB - Last synced at: 9 days ago - Pushed at: 22 days ago - Stars: 890 - Forks: 104

sugarcane-ai/sugarcane-ai
npm like package ecosystem for Prompts 🤖
Language: TypeScript - Size: 11.5 MB - Last synced at: about 4 hours ago - Pushed at: 4 months ago - Stars: 49 - Forks: 14

powerserve-project/PowerServe
High-speed and easy-use LLM serving framework for local deployment
Language: C++ - Size: 1.11 MB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 107 - Forks: 9

EmbeddedLLM/embeddedllm
EmbeddedLLM: API server for Embedded Device Deployment. Currently support CUDA/OpenVINO/IpexLLM/DirectML/CPU
Language: Python - Size: 12.6 MB - Last synced at: 5 days ago - Pushed at: 8 months ago - Stars: 39 - Forks: 1

cortecs-ai/cortecs-py
Lightweight wrapper for cortecs' provisioning API
Language: Python - Size: 418 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 6 - Forks: 0

bigai-nlco/TokenSwift
[ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
Language: Python - Size: 61.6 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 92 - Forks: 8

hpcaitech/SwiftInfer
Efficient AI Inference & Serving
Language: Python - Size: 508 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 469 - Forks: 29

galeselee/Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!
Size: 616 KB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 249 - Forks: 12

genlm/genlm-backend
High-performance backend for language model probabilistic programs
Language: Python - Size: 2.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 0

azminewasi/Awesome-LLMs-ICLR-24
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
Size: 821 KB - Last synced at: 23 days ago - Pushed at: about 1 year ago - Stars: 61 - Forks: 3

ray-project/ray-educational-materials 📦
This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
Language: Jupyter Notebook - Size: 24 MB - Last synced at: 29 days ago - Pushed at: over 1 year ago - Stars: 393 - Forks: 76

Moha111-h/Qwen3
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Language: Shell - Size: 3.07 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

AdrianMosnegutu/docscribe.nvim
A Neovim plugin for generating inline documentation for your functions using LLMs.
Language: Lua - Size: 7.32 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

mani-kantap/llm-inference-solutions
A collection of all available inference solutions for the LLMs
Size: 30.3 KB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 87 - Forks: 3

nuhmanpk/quick-llama
Run Ollama models anywhere easily
Language: Python - Size: 319 KB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 4 - Forks: 0

theneildave/ml-engineering
Machine Engineering Comprehensive Guide
Size: 1000 Bytes - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

torchpipe/torchpipe
Serving Inside Pytorch
Language: C++ - Size: 41.6 MB - Last synced at: 14 days ago - Pushed at: 30 days ago - Stars: 160 - Forks: 13

efficientscaling/Z1
Repo for "Z1: Efficient Test-time Scaling with Code"
Language: Python - Size: 422 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 45 - Forks: 1

HPMLL/BurstGPT
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
Language: Python - Size: 19 MB - Last synced at: about 2 months ago - Pushed at: 8 months ago - Stars: 159 - Forks: 9

Neural-Dragon-AI/Cynde
A Framework For Intelligence Farming
Language: Python - Size: 1.2 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 13 - Forks: 0

ehsanghaffar/llm-practice
A self-hosted personal chatbot API with FastAPI. It allows you to interact with the Llama2 LLM (and other open-source LLMs) to have natural language conversations, generate text, and perform various language-related tasks.
Language: Jupyter Notebook - Size: 108 KB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 11 - Forks: 2

France-Travail/benchmark_llm_serving
A library to benchmark LLMs via their API exposure
Language: Python - Size: 8.04 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 6 - Forks: 0

AkshaySyal/End-to-End-Basketball-QA-RAG-Capstone
Created a QA Chatbot powered by fine tuned text-to-sql LLM deployed on personal gaming laptop (Nvidia GTX 1650) using Ollama and Ngrok
Language: Jupyter Notebook - Size: 3.54 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 1

jbchouinard/llmailbot
A service for chatting with LLMs via email.
Language: Python - Size: 296 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Kira94-hkz/PowerServe
High-speed and easy-use LLM serving framework for local deployment
Size: 1000 Bytes - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

ivynya/illm
internet llm - access your ollama (or any other local llm) instance from across the internet
Language: Go - Size: 85.9 KB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

henryle97/llm-serving-benchmark
LLM Serving Libs Benchmark
Language: Python - Size: 0 Bytes - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

MinjaeKIM753/ClaudeComputerUseBeta-Win64
Claude 3.5 Sonnet ComputerUse (Beta) for Win64
Language: Python - Size: 198 KB - Last synced at: 2 months ago - Pushed at: 7 months ago - Stars: 10 - Forks: 4

CentML/llm-inference-bench
Lightweight and extensible LLM Inference serving benchmark tool written in Rust.
Language: Rust - Size: 18.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

genia-dev/vibraniumdome
LLM Security Platform.
Language: Python - Size: 2.87 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 10 - Forks: 2

george-mountain/web-app-builder--LLM
Building Static Web Applications using Large Language Model. From hand sketched documents, images and screenshots to proper web pages.
Language: Python - Size: 2.11 MB - Last synced at: 27 days ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

Jason-cs18/HetServe-Foundation
A Overview of Efficiently Serving Foundation Models across Edge Devices
Size: 358 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 13 - Forks: 0

IvanLuLyf/bunny-llm
Deno LLM API Service
Language: TypeScript - Size: 132 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 7 - Forks: 1

romitjain/gpt-benchmark
Making small models as fast as possible
Language: Python - Size: 1.91 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

unaidedelf8777/faster-outlines
A Lazy, high throughput and blazing fast structured text generation backend.
Language: Rust - Size: 3.68 MB - Last synced at: 7 days ago - Pushed at: 7 months ago - Stars: 5 - Forks: 0

oscinis-com/Awesome-LLM-Productization
Awesome-LLM-Productization: a curated list of tools/tricks/news/regulations about AI and Large Language Model (LLM) productization
Size: 275 KB - Last synced at: 28 days ago - Pushed at: 4 months ago - Stars: 23 - Forks: 4

slai-labs/get-beam
Run GPU inference and training jobs on serverless infrastructure that scales with you.
Language: Shell - Size: 5.96 MB - Last synced at: about 2 months ago - Pushed at: 12 months ago - Stars: 102 - Forks: 23

diverged/tavily-go
An unofficial Go port of the official Tavily API Python Wrapper.
Language: Go - Size: 17.6 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

fork123aniket/LLM-RAG-powered-QA-App
A Production-Ready, Scalable RAG-powered LLM-based Context-Aware QA App
Language: Python - Size: 22.5 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 5 - Forks: 1

EAlmazanG/sentiment-analysis-reviews
A cost-effective solution for stores and startups to analyze customer reviews, classify sentiment (positive, neutral, negative), and gain actionable insights through an interactive dashboard.
Language: Jupyter Notebook - Size: 34.9 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

biosfood/intel-llm-guide
A guide on how to run LLMs on intel CPUs
Language: Python - Size: 20.5 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

chenhunghan/ialacol 📦
🪶 Lightweight OpenAI drop-in replacement for Kubernetes
Language: Python - Size: 250 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 144 - Forks: 17

friendliai/lm-evaluation-harness Fork of EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
Language: Python - Size: 28.1 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

genia-dev/vibraniumdome-docs
LLM Security Platform Docs
Language: MDX - Size: 635 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

ray-project/ray-llm 📦
RayLLM - LLMs on Ray
Language: Python - Size: 1.98 MB - Last synced at: 7 months ago - Pushed at: about 1 year ago - Stars: 1,230 - Forks: 94

biomchen/llm-serving
Basic APIs for serving LLMs locally.
Language: Python - Size: 31.3 KB - Last synced at: 7 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

KevinZeng08/efficient-large-model-papers
A Curated Paper List for Efficient Large Models
Size: 1.95 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 1

substratusai/runbooks 📦
Finetune LLMs on K8s by using Runbooks
Language: Go - Size: 5.22 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 168 - Forks: 14

okikorg/okik
Okik is serving framework to deploy LLMs and much more.
Language: Python - Size: 5.13 MB - Last synced at: 28 days ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

valyu-network/Stitch
Stitch simplifies and scales LLM application deployment, reducing infrastructure complexity and costs.
Language: Python - Size: 2.53 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

LoopGlitch26/Hinglish-AI-Mentor
Hinglish Chatbot powered by Azure Cognitive Services, Google Translate and Open AI
Language: Jupyter Notebook - Size: 974 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 1

ray-project/llms-in-prod-workshop-2023 📦
Deploy and Scale LLM-based applications
Language: Jupyter Notebook - Size: 13.1 MB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 23 - Forks: 3

ray-project/anyscale-berkeley-ai-hackathon 📦
Ray and Anyscale for UC Berkeley AI Hackathon!
Language: Jupyter Notebook - Size: 77.1 KB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 11 - Forks: 0

george-mountain/LLM-Local-Streaming
Streaming of LLM responses in realtime using Fastapi and Streamlit.
Language: Python - Size: 32.2 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 1

sugarcane-ai/sugarcane-ai.github.io
Language: Astro - Size: 17.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 15 - Forks: 3

InquestGeronimo/horizon-takeoff
Automating the deployment of the Takeoff Server on AWS for LLMs
Language: Python - Size: 1.08 MB - Last synced at: 15 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

emmaalecrim/llm-ws
Typescript LLM Websocket reverse proxy built for streaming of various inference tasks
Language: TypeScript - Size: 673 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

suleymansevimli/run-llm-model-locally
You can run any large language model on your local machine with this repository.
Language: Python - Size: 1.95 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

asprenger/ray_vllm_inference
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
Language: Python - Size: 81.1 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 1

mddunlap924/LLM-Inference-Serving
This repository demonstrates LLM execution on CPUs using packages like llamafile, emphasizing low-latency, high-throughput, and cost-effective benefits for inference and serving.
Language: Jupyter Notebook - Size: 6.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

liux2/DL_env_Setups
Deep learning environment setups
Language: Shell - Size: 23.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ray-project/llm-application 📦
Language: Jupyter Notebook - Size: 20.5 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 2

Stosan/commentator
Language: Python - Size: 12.2 MB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 1

awesome-software/EasyEdit Fork of zjunlp/EasyEdit
An Easy-to-use Knowledge Editing Framework for LLMs.
Size: 15.5 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0
