GitHub topics: inference-optimization
HanzhiZhang-Ulrica/DAM
Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.
Language: Python - Size: 9.77 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

google/XNNPACK
High-efficiency floating-point neural network inference operators for mobile, server, and Web
Language: C - Size: 172 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2,039 - Forks: 419

Jench2103/transformer-roofline-analyzer
CLI tool for estimating compute, memory bandwidth, and operational intensity of transformer models from Hugging Face configuration files. Ideal for performance and hardware deployment analysis.
Language: Python - Size: 59.6 KB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

keli-wen/AGI-Study
The blog, read report and code example for AGI/LLM related knowledge.
Language: Python - Size: 19.5 MB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 39 - Forks: 1

vbdi/divprune
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Language: Python - Size: 11 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 18 - Forks: 0

alibaba/BladeDISC
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
Language: C++ - Size: 21.2 MB - Last synced at: 14 days ago - Pushed at: 5 months ago - Stars: 869 - Forks: 164

levisstrauss/Advanced-Food-Classification-System-EfficientNetB2
Computer vision project that classifies 101 food categories with 80.2% accuracy using fine-tuned EfficientNetB2 and PyTorch. Features interactive Gradio UI, optimized inference (~100ms/image), and strategic training on 20% of Food101 dataset for efficient resource utilization.
Language: Python - Size: 3.03 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

jiazhihao/TASO
The Tensor Algebra SuperOptimizer for Deep Learning
Language: C++ - Size: 1.21 MB - Last synced at: 18 days ago - Pushed at: over 2 years ago - Stars: 712 - Forks: 94

ccs96307/fast-llm-inference
Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.
Language: Jupyter Notebook - Size: 168 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 8 - Forks: 1

mit-han-lab/inter-operator-scheduler
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Language: C++ - Size: 3.13 MB - Last synced at: 28 days ago - Pushed at: about 3 years ago - Stars: 199 - Forks: 33

Wb-az/YOLOv8-disease-detection-agriculture
YOLOV8 - Object detection
Language: Jupyter Notebook - Size: 131 MB - Last synced at: 13 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 2

imedslab/pytorch_bn_fusion 📦
Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.
Language: Python - Size: 54.7 KB - Last synced at: 24 days ago - Pushed at: about 5 years ago - Stars: 197 - Forks: 29

ZFTurbo/Keras-inference-time-optimizer
Optimize layers structure of Keras model to reduce computation time
Language: Python - Size: 77.1 KB - Last synced at: 2 days ago - Pushed at: almost 5 years ago - Stars: 157 - Forks: 18

ksm26/Efficiently-Serving-LLMs
Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.
Language: Jupyter Notebook - Size: 2.34 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 3

shreyansh26/Accelerating-Cross-Encoder-Inference
Leveraging torch.compile to accelerate cross-encoder inference
Language: Python - Size: 3.84 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

grazder/template.cpp
A template for getting started writing code using GGML
Language: C++ - Size: 40 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 9 - Forks: 0

EZ-Optimium/Optimium
Your AI Catalyst: inference backend to maximize your model's inference performance
Language: C++ - Size: 101 MB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 5 - Forks: 0

OneAndZero24/TRTTL
TensorRT C++ Template Library
Language: C++ - Size: 423 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Harly-1506/Faster-Inference-yolov8
Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢
Language: Python - Size: 49.8 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 8 - Forks: 1

amazon-science/mlp-rank-pruning
MLP-Rank: A graph theoretical approach to structured pruning of deep neural networks based on weighted Page Rank centrality as introduced by the related thesis.
Language: Python - Size: 60.5 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 1

piotrostr/infer-trt
Interface for TensorRT engines inference along with an example of YOLOv4 engine being used.
Language: Python - Size: 17.6 KB - Last synced at: 9 days ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

yester31/TensorRT_Examples
All useful sample codes of tensorrt models using onnx
Language: Python - Size: 240 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 1

matteo-stat/transformers-nlp-multi-label-classification
This repo provides scripts for fine-tuning HuggingFace Transformers, setting up pipelines and optimizing multi-label classification models for inference. They are based on my experience developing a custom chatbot, I’m sharing these in the hope they will help others to quickly fine-tune and use models in their projects! 😊
Language: Python - Size: 31.3 KB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

matteo-stat/transformers-nlp-ner-token-classification
This repo provides scripts for fine-tuning HuggingFace Transformers, setting up pipelines and optimizing token classification models for inference. They are based on my experience developing a custom chatbot, I’m sharing these in the hope they will help others to quickly fine-tune and use models in their projects! 😊
Language: Python - Size: 22.5 KB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

manickavela29/EmoTwitter
OnnxRT based Inference Optimization of Roberta model trained for Sentiment Analysis On Twitter Dataset
Language: Jupyter Notebook - Size: 12.7 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

Bisonai/ncnn Fork of Tencent/ncnn
Modified inference engine for quantized convolution using product quantization
Language: C++ - Size: 7.96 MB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 4 - Forks: 0

cedrickchee/pytorch-mobile-android Fork of pytorch/android-demo-app
PyTorch Mobile: Android examples of usage in applications
Size: 53 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 1

cedrickchee/pytorch-mobile-ios Fork of pytorch/ios-demo-app
PyTorch Mobile: iOS examples
Size: 47.3 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

Rapternmn/PyTorch-Onnx-Tensorrt
A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3
Language: Python - Size: 2.83 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 80 - Forks: 18

ankdeshm/inference-optimization
A compilation of various ML and DL models and ways to optimize the their inferences.
Language: Jupyter Notebook - Size: 6.17 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

kiritigowda/mivisionx-inference-analyzer
MIVisionX Python Inference Analyzer uses pre-trained ONNX/NNEF/Caffe models to analyze inference results and summarize individual image results
Language: Python - Size: 11.7 MB - Last synced at: 2 months ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 3

goshaQ/inference-optimizer
A simple tool that applies structure-level optimizations (e.g. Quantization) to a TensorFlow model
Language: Python - Size: 6.84 KB - Last synced at: almost 2 years ago - Pushed at: almost 7 years ago - Stars: 0 - Forks: 1

lmaxwell/Armednn
cross-platform modular neural network inference library, small and efficient
Language: C++ - Size: 1.05 MB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 13 - Forks: 2

zhliuworks/Fast-MobileNetV2
🤖️ Optimized CUDA Kernels for Fast MobileNetV2 Inference
Language: Cuda - Size: 15 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 1

sjlee25/batch-partitioning
Batch Partitioning for Multi-PE Inference with TVM (2020)
Language: Python - Size: 3.79 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

aalbaali/LieBatch
Batch estimation on Lie groups
Language: MATLAB - Size: 3.5 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

effrosyni-papanastasiou/constrained-em
A constrained expectation-maximization algorithm for feasible graph inference.
Language: Jupyter Notebook - Size: 16.6 KB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 1 - Forks: 0

ieee820/ncnn Fork of Tencent/ncnn
ncnn is a high-performance neural network inference framework optimized for the mobile platform
Language: C++ - Size: 6.81 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0
