An open API service providing repository metadata for many open source software ecosystems.

Topic: "llm-inference"

nomic-ai/gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

Language: C++ - Size: 42.6 MB - Last synced at: 6 days ago - Pushed at: 18 days ago - Stars: 73,539 - Forks: 8,044

ray-project/ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Language: Python - Size: 530 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 37,439 - Forks: 6,363

gitleaks/gitleaks

Find secrets with Gitleaks 🔑

Language: Go - Size: 5.87 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 20,146 - Forks: 1,619

liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

Language: HTML - Size: 22.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 18,357 - Forks: 2,159

Lightning-AI/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

Language: Python - Size: 5.34 MB - Last synced at: about 5 hours ago - Pushed at: 1 day ago - Stars: 12,262 - Forks: 1,247

bentoml/OpenLLM

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

Language: Python - Size: 41.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11,334 - Forks: 727

mistralai/mistral-inference

Official inference library for Mistral models

Language: Jupyter Notebook - Size: 550 KB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 10,287 - Forks: 919

openvinotoolkit/openvino

OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

Language: C++ - Size: 850 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 8,413 - Forks: 2,626

SJTU-IPADS/PowerInfer

High-speed Large Language Model Serving for Local Deployment

Language: C++ - Size: 11.1 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 8,217 - Forks: 432

bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

Language: Python - Size: 95.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 7,777 - Forks: 843

InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Language: Python - Size: 8.18 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 6,504 - Forks: 556

superduper-io/superduper

Superduper: End-to-end framework for building custom AI applications and agents.

Language: Python - Size: 73.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 5,069 - Forks: 496

kserve/kserve

Standardized Serverless ML Inference Platform on Kubernetes

Language: Python - Size: 426 MB - Last synced at: about 10 hours ago - Pushed at: 3 days ago - Stars: 4,245 - Forks: 1,186

xlite-dev/Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, Parallelism, MLA, etc.

Language: Python - Size: 115 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 4,100 - Forks: 283

FellouAI/eko

Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai

Language: TypeScript - Size: 1.39 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 3,971 - Forks: 305

NVIDIA/GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

Language: Jupyter Notebook - Size: 91.8 MB - Last synced at: about 13 hours ago - Pushed at: 4 days ago - Stars: 3,184 - Forks: 764

neuralmagic/deepsparse

Sparsity-aware deep learning inference runtime for CPUs

Language: Python - Size: 137 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 3,147 - Forks: 186

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

Language: Cuda - Size: 3.95 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3,124 - Forks: 324

predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Language: Python - Size: 6.62 MB - Last synced at: 1 day ago - Pushed at: 24 days ago - Stars: 3,008 - Forks: 215

gpustack/gpustack

Simple, scalable AI model deployment on GPU clusters

Language: Python - Size: 94.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2,843 - Forks: 286

katanemo/archgw

The AI-native proxy server for agents. Arch handles the pesky low-level work in building agentic apps like calling specific tools, routing prompts to the right agents, clarifying vague inputs, unifying access and observability to any LLM, etc.

Language: Rust - Size: 19.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,716 - Forks: 152

databricks/dbrx

Code examples and resources for DBRX, a large language model developed by Databricks

Language: Python - Size: 63.5 KB - Last synced at: about 18 hours ago - Pushed at: about 1 year ago - Stars: 2,561 - Forks: 242

FasterDecoding/Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Language: Jupyter Notebook - Size: 4.76 MB - Last synced at: 17 days ago - Pushed at: 12 months ago - Stars: 2,530 - Forks: 176

codelion/optillm

Optimizing inference proxy for LLMs

Language: Python - Size: 1.88 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 2,508 - Forks: 191

codelion/openevolve

Open-source implementation of AlphaEvolve

Language: Python - Size: 2.99 MB - Last synced at: about 13 hours ago - Pushed at: about 14 hours ago - Stars: 2,437 - Forks: 292

intel/intel-extension-for-transformers 📦

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Language: Python - Size: 585 MB - Last synced at: 12 days ago - Pushed at: 8 months ago - Stars: 2,171 - Forks: 213

b4rtaz/distributed-llama

Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

Language: C++ - Size: 3.23 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 2,169 - Forks: 151

microsoft/aici

AICI: Prompts as (Wasm) Programs

Language: Rust - Size: 9.71 MB - Last synced at: 17 days ago - Pushed at: 5 months ago - Stars: 2,027 - Forks: 83

liltom-eth/llama2-webui

Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.

Language: Jupyter Notebook - Size: 1.03 MB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 1,958 - Forks: 207

SafeAILab/EAGLE

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.

Language: Python - Size: 68.6 MB - Last synced at: 6 days ago - Pushed at: 12 days ago - Stars: 1,289 - Forks: 155

taielab/awesome-hacking-lists

A curated collection of top-tier penetration testing tools and productivity utilities across multiple domains. Join us to explore, contribute, and enhance your hacking toolkit!

Size: 6.43 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 1,099 - Forks: 229

sauravpanda/BrowserAI

Run local LLMs like llama, deepseek-distill, kokoro and more inside your browser

Language: TypeScript - Size: 293 MB - Last synced at: 24 days ago - Pushed at: about 2 months ago - Stars: 1,098 - Forks: 95

lean-dojo/LeanCopilot

LLMs as Copilots for Theorem Proving in Lean

Language: C++ - Size: 1.21 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1,093 - Forks: 99

character-ai/prompt-poet

Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.

Language: Python - Size: 580 KB - Last synced at: about 15 hours ago - Pushed at: about 15 hours ago - Stars: 1,068 - Forks: 92

Lizonghang/prima.cpp

prima.cpp: Speeding up 70B-scale LLM inference on low-resource everyday home clusters

Language: C++ - Size: 55 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 937 - Forks: 62

felladrin/awesome-ai-web-search

List of software that allows searching the web with the assistance of AI: https://hf.co/spaces/felladrin/awesome-ai-web-search

Language: HTML - Size: 145 KB - Last synced at: about 15 hours ago - Pushed at: about 16 hours ago - Stars: 904 - Forks: 63

zhihu/ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

Language: C++ - Size: 996 KB - Last synced at: 16 days ago - Pushed at: 29 days ago - Stars: 890 - Forks: 104

beam-cloud/beta9

Scalable Infrastructure for Running Your AI Workloads at Scale

Language: Go - Size: 11.7 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 819 - Forks: 52

harleyszhang/llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

Language: Python - Size: 177 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 778 - Forks: 81

cactus-compute/cactus

Framework for running AI locally on mobile devices and wearables. Hardware-aware C/C++ backend with wrappers for Flutter & React Native. Kotlin & Swift coming soon.

Language: C++ - Size: 688 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 747 - Forks: 43

mukel/llama3.java

Practical Llama 3 inference in Java

Language: Java - Size: 187 KB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 736 - Forks: 90

stoyan-stoyanov/llmflows

LLMFlows - Simple, Explicit and Transparent LLM Apps

Language: Python - Size: 36.1 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 695 - Forks: 33

shyamsaktawat/OpenAlpha_Evolve

OpenAlpha_Evolve is an open-source Python framework inspired by the groundbreaking research on autonomous coding agents like DeepMind's AlphaEvolve.

Language: Python - Size: 204 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 694 - Forks: 109

inspector-apm/neuron-ai

The PHP Agent Development Kit - powered by Inspector.dev

Language: PHP - Size: 12.5 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 693 - Forks: 57

ghimiresunil/LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing

LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.

Language: Jupyter Notebook - Size: 7.07 MB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 673 - Forks: 118

dalisoft/awesome-hosting

List of awesome hosting sorted by minimal plan price

Size: 240 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 626 - Forks: 72

eastriverlee/LLM.swift

LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.

Language: C - Size: 169 MB - Last synced at: 23 days ago - Pushed at: about 1 month ago - Stars: 626 - Forks: 67

foldl/chatllm.cpp

Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)

Language: C++ - Size: 5.5 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 619 - Forks: 48

run-ai/genv

GPU environment and cluster management with LLM support

Language: Python - Size: 9.41 MB - Last synced at: 22 days ago - Pushed at: about 1 year ago - Stars: 606 - Forks: 34

zeux/calm

CUDA/Metal accelerated language model inference

Language: C - Size: 1.27 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 578 - Forks: 26

rohan-paul/LLM-FineTuning-Large-Language-Models

LLM (Large Language Model) FineTuning

Language: Jupyter Notebook - Size: 11.3 MB - Last synced at: 1 day ago - Pushed at: 2 months ago - Stars: 542 - Forks: 129

stanford-mast/blast

Browser-LLM Auto-Scaling Technology

Language: Python - Size: 3.12 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 522 - Forks: 23

feifeibear/long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Language: Python - Size: 4.61 MB - Last synced at: 17 days ago - Pushed at: 19 days ago - Stars: 505 - Forks: 46

anarchy-ai/LLM-VM

irresponsible innovation. Try now at https://chat.dev/

Language: Python - Size: 1.74 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 488 - Forks: 142

hpcaitech/SwiftInfer

Efficient AI Inference & Serving

Language: Python - Size: 508 KB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 470 - Forks: 29

deeppowers/deeppowers

DEEPPOWERS is a Fully Homomorphic Encryption (FHE) framework built for MCP (Model Context Protocol), aiming to provide end-to-end privacy protection and high-efficiency computation for the upstream and downstream ecosystem of the MCP protocol.

Language: C++ - Size: 1.74 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 453 - Forks: 38

TilmanGriesel/chipper

✨ AI interface for tinkerers (Ollama, Haystack RAG, Python)

Language: Python - Size: 84.5 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 450 - Forks: 42

felladrin/MiniSearch

Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space

Language: TypeScript - Size: 27.8 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 447 - Forks: 48

vectorch-ai/ScaleLLM

A high-performance inference system for large language models, designed for production environments.

Language: C++ - Size: 19 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 446 - Forks: 37

FlagAI-Open/Aquila2

The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.

Language: Python - Size: 30.9 MB - Last synced at: 14 days ago - Pushed at: 8 months ago - Stars: 441 - Forks: 31

Kenza-AI/sagify

LLMs and Machine Learning done easily

Language: Python - Size: 36.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 439 - Forks: 69

rizerphe/local-llm-function-calling

A tool for generating function arguments and choosing what function to call with local LLMs

Language: Python - Size: 163 KB - Last synced at: 21 days ago - Pushed at: over 1 year ago - Stars: 425 - Forks: 41

preternatural-explore/mlx-swift-chat

A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.

Language: Swift - Size: 39.1 KB - Last synced at: 3 days ago - Pushed at: 8 months ago - Stars: 412 - Forks: 27

ray-project/ray-educational-materials 📦

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.

Language: Jupyter Notebook - Size: 24 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 393 - Forks: 76

jax-ml/scaling-book

Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUs

Language: HTML - Size: 54.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 388 - Forks: 50

EulerSearch/embedding_studio

Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.

Language: Python - Size: 10.2 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 380 - Forks: 5

microsoft/sarathi-serve

A low-latency & high-throughput serving engine for LLMs

Language: Python - Size: 2.34 MB - Last synced at: 1 day ago - Pushed at: 15 days ago - Stars: 378 - Forks: 48

NVIDIA/Star-Attention

Efficient LLM Inference over Long Sequences

Language: Python - Size: 1.05 MB - Last synced at: about 13 hours ago - Pushed at: 10 days ago - Stars: 377 - Forks: 19

andrewkchan/yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

Language: C++ - Size: 347 KB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 367 - Forks: 35

intel/neural-speed 📦

An innovative library for efficient LLM inference via low-bit quantization

Language: C++ - Size: 16.2 MB - Last synced at: 21 days ago - Pushed at: 10 months ago - Stars: 348 - Forks: 38

zjhellofss/KuiperLLama

校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。

Language: C++ - Size: 2.27 MB - Last synced at: 22 days ago - Pushed at: 2 months ago - Stars: 346 - Forks: 88

AI-Hypercomputer/JetStream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

Language: Python - Size: 6.32 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 339 - Forks: 46

devflowinc/uzi

CLI for running large numbers of coding agents in parallel with git worktrees

Language: Go - Size: 63.5 MB - Last synced at: about 20 hours ago - Pushed at: 10 days ago - Stars: 325 - Forks: 8

alipay/PainlessInferenceAcceleration

Accelerate inference without tears

Language: Python - Size: 18.8 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 315 - Forks: 21

ugorsahin/TalkingHeads

A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi

Language: Python - Size: 258 KB - Last synced at: 23 days ago - Pushed at: 3 months ago - Stars: 313 - Forks: 63

morpheuslord/HackBot

AI-powered cybersecurity chatbot designed to provide helpful and accurate answers to your cybersecurity-related queries and also do code analysis and scan analysis.

Language: Python - Size: 56.6 KB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 308 - Forks: 48

unifyai/unify

Notion for AI Observability 📊

Language: Python - Size: 1.92 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 302 - Forks: 31

andrewkchan/deepseek.cpp

CPU inference for the DeepSeek family of large language models in C++

Language: C++ - Size: 656 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 296 - Forks: 32

structuredllm/syncode

Efficient and general syntactical decoding for Large Language Models

Language: Python - Size: 55.4 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 274 - Forks: 27

dusty-nv/NanoLLM

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.

Language: Python - Size: 3.63 MB - Last synced at: 12 days ago - Pushed at: 8 months ago - Stars: 271 - Forks: 46

armbues/SiLLM

SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.

Language: Python - Size: 612 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 266 - Forks: 26

expectedparrot/edsl

Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.

Language: Python - Size: 60.3 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 251 - Forks: 25

modelscope/dash-infer

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

Language: C - Size: 41.5 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 251 - Forks: 27

Infini-AI-Lab/TriForce

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Language: Python - Size: 71.7 MB - Last synced at: 29 days ago - Pushed at: 10 months ago - Stars: 250 - Forks: 17

galeselee/Awesome_LLM_System-PaperList

Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!

Size: 616 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 249 - Forks: 12

Picovoice/picollm

On-device LLM Inference Powered by X-Bit Quantization

Language: Python - Size: 98 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 247 - Forks: 14

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Language: Python - Size: 9.37 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 242 - Forks: 41

inferflow/inferflow

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

Language: C++ - Size: 1.89 MB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 242 - Forks: 25

promptslab/LLMtuner

FineTune LLMs in few lines of code (Text2Text, Text2Speech, Speech2Text)

Language: Python - Size: 591 KB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 240 - Forks: 15

ubergarm/r1-ktransformers-guide

run DeepSeek-R1 GGUFs on KTransformers

Language: Python - Size: 1.91 MB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 228 - Forks: 15

arc53/llm-price-compass

This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider price comparison, gpu benchmarks to price per token calculation, gpu benchmark table

Language: TypeScript - Size: 411 KB - Last synced at: 21 days ago - Pushed at: 6 months ago - Stars: 221 - Forks: 8

bytedance/ABQ-LLM

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

Language: C++ - Size: 53.9 MB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 221 - Forks: 21

C0deMunk33/bespoke_automata

Bespoke Automata is a GUI and deployment pipline for making complex AI agents locally and offline

Language: JavaScript - Size: 68 MB - Last synced at: 5 months ago - Pushed at: about 1 year ago - Stars: 219 - Forks: 25

sophgo/LLM-TPU

Run generative AI models in sophgo BM1684X/BM1688

Language: C++ - Size: 240 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 218 - Forks: 37

efeslab/fiddler

[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration

Language: Python - Size: 1.72 MB - Last synced at: 28 days ago - Pushed at: 7 months ago - Stars: 210 - Forks: 20

interestingLSY/swiftLLM

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

Language: Python - Size: 226 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 209 - Forks: 24

MorpheusAIs/Morpheus

Morpheus - A Network For Powering Smart Agents - Compute + Code + Capital + Community

Language: JavaScript - Size: 112 MB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 203 - Forks: 143

FasterDecoding/REST

REST: Retrieval-Based Speculative Decoding, NAACL 2024

Language: C - Size: 1.06 MB - Last synced at: about 5 hours ago - Pushed at: 6 months ago - Stars: 202 - Forks: 14

EfficientMoE/MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

Language: Python - Size: 592 KB - Last synced at: 42 minutes ago - Pushed at: about 2 hours ago - Stars: 197 - Forks: 17

ByteDance-Seed/ShadowKV

[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Language: Python - Size: 20.5 MB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 197 - Forks: 14