Topic: "llm-inference"
nomic-ai/gpt4all
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Language: C++ - Size: 42.6 MB - Last synced at: 6 days ago - Pushed at: 18 days ago - Stars: 73,539 - Forks: 8,044

ray-project/ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Language: Python - Size: 530 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 37,439 - Forks: 6,363

gitleaks/gitleaks
Find secrets with Gitleaks 🔑
Language: Go - Size: 5.87 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 20,146 - Forks: 1,619

liguodongiot/llm-action
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Language: HTML - Size: 22.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 18,357 - Forks: 2,159

Lightning-AI/litgpt
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Language: Python - Size: 5.34 MB - Last synced at: about 5 hours ago - Pushed at: 1 day ago - Stars: 12,262 - Forks: 1,247

bentoml/OpenLLM
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
Language: Python - Size: 41.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11,334 - Forks: 727

mistralai/mistral-inference
Official inference library for Mistral models
Language: Jupyter Notebook - Size: 550 KB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 10,287 - Forks: 919

openvinotoolkit/openvino
OpenVINO™ is an open source toolkit for optimizing and deploying AI inference
Language: C++ - Size: 850 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 8,413 - Forks: 2,626

SJTU-IPADS/PowerInfer
High-speed Large Language Model Serving for Local Deployment
Language: C++ - Size: 11.1 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 8,217 - Forks: 432

bentoml/BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Language: Python - Size: 95.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 7,777 - Forks: 843

InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Language: Python - Size: 8.18 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 6,504 - Forks: 556

superduper-io/superduper
Superduper: End-to-end framework for building custom AI applications and agents.
Language: Python - Size: 73.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 5,069 - Forks: 496

kserve/kserve
Standardized Serverless ML Inference Platform on Kubernetes
Language: Python - Size: 426 MB - Last synced at: about 10 hours ago - Pushed at: 3 days ago - Stars: 4,245 - Forks: 1,186

xlite-dev/Awesome-LLM-Inference
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, Parallelism, MLA, etc.
Language: Python - Size: 115 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 4,100 - Forks: 283

FellouAI/eko
Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai
Language: TypeScript - Size: 1.39 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3,971 - Forks: 305

NVIDIA/GenerativeAIExamples
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Language: Jupyter Notebook - Size: 91.8 MB - Last synced at: about 13 hours ago - Pushed at: 4 days ago - Stars: 3,184 - Forks: 764

neuralmagic/deepsparse
Sparsity-aware deep learning inference runtime for CPUs
Language: Python - Size: 137 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 3,147 - Forks: 186

flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
Language: Cuda - Size: 3.95 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3,124 - Forks: 324

predibase/lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Language: Python - Size: 6.62 MB - Last synced at: 1 day ago - Pushed at: 24 days ago - Stars: 3,008 - Forks: 215

gpustack/gpustack
Simple, scalable AI model deployment on GPU clusters
Language: Python - Size: 94.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2,843 - Forks: 286

katanemo/archgw
The AI-native proxy server for agents. Arch handles the pesky low-level work in building agentic apps like calling specific tools, routing prompts to the right agents, clarifying vague inputs, unifying access and observability to any LLM, etc.
Language: Rust - Size: 19.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,716 - Forks: 152

databricks/dbrx
Code examples and resources for DBRX, a large language model developed by Databricks
Language: Python - Size: 63.5 KB - Last synced at: about 18 hours ago - Pushed at: about 1 year ago - Stars: 2,561 - Forks: 242

FasterDecoding/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Language: Jupyter Notebook - Size: 4.76 MB - Last synced at: 17 days ago - Pushed at: 12 months ago - Stars: 2,530 - Forks: 176

codelion/optillm
Optimizing inference proxy for LLMs
Language: Python - Size: 1.88 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 2,508 - Forks: 191

codelion/openevolve
Open-source implementation of AlphaEvolve
Language: Python - Size: 2.99 MB - Last synced at: about 13 hours ago - Pushed at: about 14 hours ago - Stars: 2,437 - Forks: 292

intel/intel-extension-for-transformers 📦
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Language: Python - Size: 585 MB - Last synced at: 12 days ago - Pushed at: 8 months ago - Stars: 2,171 - Forks: 213

b4rtaz/distributed-llama
Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.
Language: C++ - Size: 3.23 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 2,169 - Forks: 151

microsoft/aici
AICI: Prompts as (Wasm) Programs
Language: Rust - Size: 9.71 MB - Last synced at: 17 days ago - Pushed at: 5 months ago - Stars: 2,027 - Forks: 83

liltom-eth/llama2-webui
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
Language: Jupyter Notebook - Size: 1.03 MB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 1,958 - Forks: 207

SafeAILab/EAGLE
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
Language: Python - Size: 68.6 MB - Last synced at: 6 days ago - Pushed at: 12 days ago - Stars: 1,289 - Forks: 155

taielab/awesome-hacking-lists
A curated collection of top-tier penetration testing tools and productivity utilities across multiple domains. Join us to explore, contribute, and enhance your hacking toolkit!
Size: 6.43 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 1,099 - Forks: 229

sauravpanda/BrowserAI
Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser
Language: TypeScript - Size: 293 MB - Last synced at: 24 days ago - Pushed at: about 2 months ago - Stars: 1,098 - Forks: 95

lean-dojo/LeanCopilot
LLMs as Copilots for Theorem Proving in Lean
Language: C++ - Size: 1.21 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1,093 - Forks: 99

character-ai/prompt-poet
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
Language: Python - Size: 580 KB - Last synced at: about 15 hours ago - Pushed at: about 15 hours ago - Stars: 1,068 - Forks: 92

Lizonghang/prima.cpp
prima.cpp: Speeding up 70B-scale LLM inference on low-resource everyday home clusters
Language: C++ - Size: 55 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 937 - Forks: 62

felladrin/awesome-ai-web-search
List of software that allows searching the web with the assistance of AI: https://hf.co/spaces/felladrin/awesome-ai-web-search
Language: HTML - Size: 145 KB - Last synced at: about 15 hours ago - Pushed at: about 16 hours ago - Stars: 904 - Forks: 63

zhihu/ZhiLight
A highly optimized LLM inference acceleration engine for Llama and its variants.
Language: C++ - Size: 996 KB - Last synced at: 16 days ago - Pushed at: 29 days ago - Stars: 890 - Forks: 104

beam-cloud/beta9
Scalable Infrastructure for Running Your AI Workloads at Scale
Language: Go - Size: 11.7 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 819 - Forks: 52

harleyszhang/llm_note
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.
Language: Python - Size: 177 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 778 - Forks: 81

cactus-compute/cactus
Framework for running AI locally on mobile devices and wearables. Hardware-aware C/C++ backend with wrappers for Flutter & React Native. Kotlin & Swift coming soon.
Language: C++ - Size: 688 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 747 - Forks: 43

mukel/llama3.java
Practical Llama 3 inference in Java
Language: Java - Size: 187 KB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 736 - Forks: 90

stoyan-stoyanov/llmflows
LLMFlows - Simple, Explicit and Transparent LLM Apps
Language: Python - Size: 36.1 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 695 - Forks: 33

shyamsaktawat/OpenAlpha_Evolve
OpenAlpha_Evolve is an open-source Python framework inspired by the groundbreaking research on autonomous coding agents like DeepMind's AlphaEvolve.
Language: Python - Size: 204 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 694 - Forks: 109

inspector-apm/neuron-ai
The PHP Agent Development Kit - powered by Inspector.dev
Language: PHP - Size: 12.5 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 693 - Forks: 57

ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.
Language: Jupyter Notebook - Size: 7.07 MB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 673 - Forks: 118

dalisoft/awesome-hosting
List of awesome hosting sorted by minimal plan price
Size: 240 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 626 - Forks: 72

eastriverlee/LLM.swift
LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.
Language: C - Size: 169 MB - Last synced at: 23 days ago - Pushed at: about 1 month ago - Stars: 626 - Forks: 67

foldl/chatllm.cpp
Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)
Language: C++ - Size: 5.5 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 619 - Forks: 48

run-ai/genv
GPU environment and cluster management with LLM support
Language: Python - Size: 9.41 MB - Last synced at: 22 days ago - Pushed at: about 1 year ago - Stars: 606 - Forks: 34

zeux/calm
CUDA/Metal accelerated language model inference
Language: C - Size: 1.27 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 578 - Forks: 26

rohan-paul/LLM-FineTuning-Large-Language-Models
LLM (Large Language Model) FineTuning
Language: Jupyter Notebook - Size: 11.3 MB - Last synced at: 1 day ago - Pushed at: 2 months ago - Stars: 542 - Forks: 129

stanford-mast/blast
Browser-LLM Auto-Scaling Technology
Language: Python - Size: 3.12 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 522 - Forks: 23

feifeibear/long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Language: Python - Size: 4.61 MB - Last synced at: 17 days ago - Pushed at: 19 days ago - Stars: 505 - Forks: 46

anarchy-ai/LLM-VM
irresponsible innovation. Try now at https://chat.dev/
Language: Python - Size: 1.74 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 488 - Forks: 142

hpcaitech/SwiftInfer
Efficient AI Inference & Serving
Language: Python - Size: 508 KB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 470 - Forks: 29

deeppowers/deeppowers
DEEPPOWERS is a Fully Homomorphic Encryption (FHE) framework built for MCP (Model Context Protocol), aiming to provide end-to-end privacy protection and high-efficiency computation for the upstream and downstream ecosystem of the MCP protocol.
Language: C++ - Size: 1.74 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 453 - Forks: 38

TilmanGriesel/chipper
✨ AI interface for tinkerers (Ollama, Haystack RAG, Python)
Language: Python - Size: 84.5 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 450 - Forks: 42

felladrin/MiniSearch
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
Language: TypeScript - Size: 27.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 447 - Forks: 48

vectorch-ai/ScaleLLM
A high-performance inference system for large language models, designed for production environments.
Language: C++ - Size: 19 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 446 - Forks: 37

FlagAI-Open/Aquila2
The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.
Language: Python - Size: 30.9 MB - Last synced at: 14 days ago - Pushed at: 8 months ago - Stars: 441 - Forks: 31

Kenza-AI/sagify
LLMs and Machine Learning done easily
Language: Python - Size: 36.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 439 - Forks: 69

rizerphe/local-llm-function-calling
A tool for generating function arguments and choosing what function to call with local LLMs
Language: Python - Size: 163 KB - Last synced at: 21 days ago - Pushed at: over 1 year ago - Stars: 425 - Forks: 41

preternatural-explore/mlx-swift-chat
A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.
Language: Swift - Size: 39.1 KB - Last synced at: 3 days ago - Pushed at: 8 months ago - Stars: 412 - Forks: 27

ray-project/ray-educational-materials 📦
This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
Language: Jupyter Notebook - Size: 24 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 393 - Forks: 76

jax-ml/scaling-book
Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs
Language: HTML - Size: 54.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 388 - Forks: 50

EulerSearch/embedding_studio
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
Language: Python - Size: 10.2 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 380 - Forks: 5

microsoft/sarathi-serve
A low-latency & high-throughput serving engine for LLMs
Language: Python - Size: 2.34 MB - Last synced at: 1 day ago - Pushed at: 15 days ago - Stars: 378 - Forks: 48

NVIDIA/Star-Attention
Efficient LLM Inference over Long Sequences
Language: Python - Size: 1.05 MB - Last synced at: about 13 hours ago - Pushed at: 10 days ago - Stars: 377 - Forks: 19

andrewkchan/yalm
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Language: C++ - Size: 347 KB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 367 - Forks: 35

intel/neural-speed 📦
An innovative library for efficient LLM inference via low-bit quantization
Language: C++ - Size: 16.2 MB - Last synced at: 21 days ago - Pushed at: 10 months ago - Stars: 348 - Forks: 38

zjhellofss/KuiperLLama
校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
Language: C++ - Size: 2.27 MB - Last synced at: 22 days ago - Pushed at: 2 months ago - Stars: 346 - Forks: 88

AI-Hypercomputer/JetStream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
Language: Python - Size: 6.32 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 339 - Forks: 46

devflowinc/uzi
CLI for running large numbers of coding agents in parallel with git worktrees
Language: Go - Size: 63.5 MB - Last synced at: about 20 hours ago - Pushed at: 10 days ago - Stars: 325 - Forks: 8

alipay/PainlessInferenceAcceleration
Accelerate inference without tears
Language: Python - Size: 18.8 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 315 - Forks: 21

ugorsahin/TalkingHeads
A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi
Language: Python - Size: 258 KB - Last synced at: 23 days ago - Pushed at: 3 months ago - Stars: 313 - Forks: 63

morpheuslord/HackBot
AI-powered cybersecurity chatbot designed to provide helpful and accurate answers to your cybersecurity-related queries and also do code analysis and scan analysis.
Language: Python - Size: 56.6 KB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 308 - Forks: 48

unifyai/unify
Notion for AI Observability 📊
Language: Python - Size: 1.92 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 302 - Forks: 31

andrewkchan/deepseek.cpp
CPU inference for the DeepSeek family of large language models in C++
Language: C++ - Size: 656 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 296 - Forks: 32

structuredllm/syncode
Efficient and general syntactical decoding for Large Language Models
Language: Python - Size: 55.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 274 - Forks: 27

dusty-nv/NanoLLM
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
Language: Python - Size: 3.63 MB - Last synced at: 12 days ago - Pushed at: 8 months ago - Stars: 271 - Forks: 46

armbues/SiLLM
SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.
Language: Python - Size: 612 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 266 - Forks: 26

expectedparrot/edsl
Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
Language: Python - Size: 60.3 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 251 - Forks: 25

modelscope/dash-infer
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
Language: C - Size: 41.5 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 251 - Forks: 27

Infini-AI-Lab/TriForce
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Language: Python - Size: 71.7 MB - Last synced at: 29 days ago - Pushed at: 10 months ago - Stars: 250 - Forks: 17

galeselee/Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!
Size: 616 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 249 - Forks: 12

Picovoice/picollm
On-device LLM Inference Powered by X-Bit Quantization
Language: Python - Size: 98 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 247 - Forks: 14

JinjieNi/MixEval
The official evaluation suite and dynamic data release for MixEval.
Language: Python - Size: 9.37 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 242 - Forks: 41

inferflow/inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
Language: C++ - Size: 1.89 MB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 242 - Forks: 25

promptslab/LLMtuner
FineTune LLMs in few lines of code (Text2Text, Text2Speech, Speech2Text)
Language: Python - Size: 591 KB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 240 - Forks: 15

ubergarm/r1-ktransformers-guide
run DeepSeek-R1 GGUFs on KTransformers
Language: Python - Size: 1.91 MB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 228 - Forks: 15

arc53/llm-price-compass
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider price comparison, gpu benchmarks to price per token calculation, gpu benchmark table
Language: TypeScript - Size: 411 KB - Last synced at: 21 days ago - Pushed at: 6 months ago - Stars: 221 - Forks: 8

bytedance/ABQ-LLM
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
Language: C++ - Size: 53.9 MB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 221 - Forks: 21

C0deMunk33/bespoke_automata
Bespoke Automata is a GUI and deployment pipline for making complex AI agents locally and offline
Language: JavaScript - Size: 68 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 219 - Forks: 25

sophgo/LLM-TPU
Run generative AI models in sophgo BM1684X/BM1688
Language: C++ - Size: 240 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 218 - Forks: 37

efeslab/fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
Language: Python - Size: 1.72 MB - Last synced at: 28 days ago - Pushed at: 7 months ago - Stars: 210 - Forks: 20

interestingLSY/swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
Language: Python - Size: 226 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 209 - Forks: 24

MorpheusAIs/Morpheus
Morpheus - A Network For Powering Smart Agents - Compute + Code + Capital + Community
Language: JavaScript - Size: 112 MB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 203 - Forks: 143

FasterDecoding/REST
REST: Retrieval-Based Speculative Decoding, NAACL 2024
Language: C - Size: 1.06 MB - Last synced at: about 5 hours ago - Pushed at: 6 months ago - Stars: 202 - Forks: 14

EfficientMoE/MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
Language: Python - Size: 592 KB - Last synced at: 42 minutes ago - Pushed at: about 2 hours ago - Stars: 197 - Forks: 17

ByteDance-Seed/ShadowKV
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Language: Python - Size: 20.5 MB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 197 - Forks: 14
