Topic: "inference-acceleration"
thu-ml/SageAttention
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Language: Cuda - Size: 46.1 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,357 - Forks: 94

ali-vilab/TeaCache
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
Language: Python - Size: 22.5 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 686 - Forks: 26

thu-ml/SpargeAttn
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
Language: Cuda - Size: 55.4 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 459 - Forks: 28

autonomi-ai/nos
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
Language: Python - Size: 16.5 MB - Last synced at: 20 days ago - Pushed at: 11 months ago - Stars: 139 - Forks: 12

czg1225/AsyncDiff
Official implementation of "AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising"
Language: Python - Size: 64.7 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 130 - Forks: 6

dvlab-research/Q-LLM
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
Language: Python - Size: 6.84 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 29 - Forks: 0

jagennath-hari/DepthStream-Accelerator-ROS2-Integrated-Monocular-Depth-Inference
DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.
Language: C++ - Size: 10.9 MB - Last synced at: 15 days ago - Pushed at: about 1 month ago - Stars: 17 - Forks: 0

marty1885/scirknn
Convert and run scikit-learn MLPs on Rockchip NPU.
Language: Python - Size: 34.2 KB - Last synced at: 20 days ago - Pushed at: almost 2 years ago - Stars: 7 - Forks: 2

fangvv/TLEE
Code for paper "TLEE: Temporal-wise and Layer-wise Early Exiting Network for Efficient Video Recognition on Edge Devices"
Language: Python - Size: 85 KB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 0

fangvv/MTACP
Code for paper "Deep Reinforcement Learning based Multi-task Automated Channel Pruning for DNNs"
Language: Python - Size: 1020 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 1

Bisonai/ncnn Fork of Tencent/ncnn
Modified inference engine for quantized convolution using product quantization
Language: C++ - Size: 7.96 MB - Last synced at: 12 months ago - Pushed at: almost 3 years ago - Stars: 4 - Forks: 0

bzluan/AdaptPrune
The official repo for “Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models”.
Language: Python - Size: 13.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0
