An open API service providing repository metadata for many open source software ecosystems.

Topic: "inference-acceleration"

thu-ml/SageAttention

Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

Language: Cuda - Size: 46.1 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,357 - Forks: 94

ali-vilab/TeaCache

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Language: Python - Size: 22.5 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 686 - Forks: 26

thu-ml/SpargeAttn

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Language: Cuda - Size: 55.4 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 459 - Forks: 28

autonomi-ai/nos

⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.

Language: Python - Size: 16.5 MB - Last synced at: 20 days ago - Pushed at: 11 months ago - Stars: 139 - Forks: 12

czg1225/AsyncDiff

Official implementation of "AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising"

Language: Python - Size: 64.7 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 130 - Forks: 6

dvlab-research/Q-LLM

This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"

Language: Python - Size: 6.84 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 29 - Forks: 0

jagennath-hari/DepthStream-Accelerator-ROS2-Integrated-Monocular-Depth-Inference

DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.

Language: C++ - Size: 10.9 MB - Last synced at: 15 days ago - Pushed at: about 1 month ago - Stars: 17 - Forks: 0

marty1885/scirknn

Convert and run scikit-learn MLPs on Rockchip NPU.

Language: Python - Size: 34.2 KB - Last synced at: 20 days ago - Pushed at: almost 2 years ago - Stars: 7 - Forks: 2

fangvv/TLEE

Code for paper "TLEE: Temporal-wise and Layer-wise Early Exiting Network for Efficient Video Recognition on Edge Devices"

Language: Python - Size: 85 KB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 0

fangvv/MTACP

Code for paper "Deep Reinforcement Learning based Multi-task Automated Channel Pruning for DNNs"

Language: Python - Size: 1020 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 1

Bisonai/ncnn Fork of Tencent/ncnn

Modified inference engine for quantized convolution using product quantization

Language: C++ - Size: 7.96 MB - Last synced at: 12 months ago - Pushed at: almost 3 years ago - Stars: 4 - Forks: 0

bzluan/AdaptPrune

The official repo for “Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models”.

Language: Python - Size: 13.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0