An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: inference-optimization

google/XNNPACK

High-efficiency floating-point neural network inference operators for mobile, server, and Web

Language: C - Size: 178 MB - Last synced at: about 16 hours ago - Pushed at: about 16 hours ago - Stars: 2,111 - Forks: 440

yester31/TensorRT_Examples

TensorRT in Practice: Model Conversion, Extension, and Advanced Inference Optimization

Language: Python - Size: 5.61 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 2 - Forks: 2

bentoml/llm-inference-handbook

Everything you need to know about LLM inference

Language: TypeScript - Size: 10.7 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 229 - Forks: 21

keli-wen/AGI-Study

The blog, read report and code example for AGI/LLM related knowledge.

Language: Python - Size: 19.5 MB - Last synced at: 4 days ago - Pushed at: 7 months ago - Stars: 44 - Forks: 1

alibaba/BladeDISC

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

Language: C++ - Size: 21.2 MB - Last synced at: 5 days ago - Pushed at: 9 months ago - Stars: 891 - Forks: 170

amazon-science/llm-rank-pruning

LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.

Language: Python - Size: 30.3 KB - Last synced at: 3 days ago - Pushed at: 10 months ago - Stars: 7 - Forks: 3

yester31/Monocular_Depth_Estimation_TRT

Language: Python - Size: 6.38 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 3 - Forks: 0

mvish7/dycoke_token_compression

This repo integrates DyCoke's token compression method with VLMs such as Gemma3 and InternVL3

Size: 3.91 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

mit-han-lab/inter-operator-scheduler

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration

Language: C++ - Size: 3.13 MB - Last synced at: 10 days ago - Pushed at: over 3 years ago - Stars: 201 - Forks: 33

Jench2103/transformer-roofline-analyzer

CLI tool for estimating compute, memory bandwidth, and operational intensity of transformer models from Hugging Face configuration files. Ideal for performance and hardware deployment analysis.

Language: Python - Size: 108 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

zaki0521/Hands-On-Large-Language-Models

Explore hands-on projects with large language models. Learn techniques and best practices to harness AI effectively. Join the journey! ๐Ÿค–๐ŸŒŸ

Language: Jupyter Notebook - Size: 13.5 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

ksm26/Efficiently-Serving-LLMs

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibaseโ€™s LoRAX framework inference server.

Language: Jupyter Notebook - Size: 2.34 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 17 - Forks: 4

Wb-az/yolov8-disease-detection-agriculture

YOLOV8 - Object detection

Language: Jupyter Notebook - Size: 131 MB - Last synced at: 19 days ago - Pushed at: 6 months ago - Stars: 2 - Forks: 2

ccs96307/fast-llm-inference

Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.

Language: Python - Size: 168 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 9 - Forks: 1

HanzhiZhang-Ulrica/DAM

Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.

Language: Python - Size: 9.77 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

BaiTheBest/SparseLLM

Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)

Language: Python - Size: 145 KB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 61 - Forks: 9

vbdi/divprune

[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Language: Python - Size: 11 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 18 - Forks: 0

levisstrauss/Advanced-Food-Classification-System-EfficientNetB2

Computer vision project that classifies 101 food categories with 80.2% accuracy using fine-tuned EfficientNetB2 and PyTorch. Features interactive Gradio UI, optimized inference (~100ms/image), and strategic training on 20% of Food101 dataset for efficient resource utilization.

Language: Python - Size: 3.03 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

jiazhihao/TASO

The Tensor Algebra SuperOptimizer for Deep Learning

Language: C++ - Size: 1.21 MB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 712 - Forks: 94

EZ-Optimium/Optimium

Your AI Catalyst: inference backend to maximize your model's inference performance

Language: C++ - Size: 101 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 5 - Forks: 0

imedslab/pytorch_bn_fusion ๐Ÿ“ฆ

Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.

Language: Python - Size: 54.7 KB - Last synced at: about 1 month ago - Pushed at: over 5 years ago - Stars: 197 - Forks: 29

ZFTurbo/Keras-inference-time-optimizer

Optimize layers structure of Keras model to reduce computation time

Language: Python - Size: 77.1 KB - Last synced at: 2 months ago - Pushed at: about 5 years ago - Stars: 157 - Forks: 18

shreyansh26/Accelerating-Cross-Encoder-Inference

Leveraging torch.compile to accelerate cross-encoder inference

Language: Python - Size: 3.84 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

cedrickchee/pytorch-mobile-ios Fork of pytorch/ios-demo-app

PyTorch Mobile: iOS examples

Size: 47.3 MB - Last synced at: 3 days ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 0

cedrickchee/pytorch-mobile-android Fork of pytorch/android-demo-app

PyTorch Mobile: Android examples of usage in applications

Size: 53 MB - Last synced at: 3 days ago - Pushed at: almost 6 years ago - Stars: 2 - Forks: 1

grazder/template.cpp

A template for getting started writing code using GGML

Language: C++ - Size: 40 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 9 - Forks: 0

OneAndZero24/TRTTL

TensorRT C++ Template Library

Language: C++ - Size: 423 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

Harly-1506/Faster-Inference-yolov8

Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy ๐Ÿ”ข

Language: Python - Size: 49.8 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 8 - Forks: 1

amazon-science/mlp-rank-pruning

MLP-Rank: A graph theoretical approach to structured pruning of deep neural networks based on weighted Page Rank centrality as introduced by the related thesis.

Language: Python - Size: 60.5 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 1

piotrostr/infer-trt

Interface for TensorRT engines inference along with an example of YOLOv4 engine being used.

Language: Python - Size: 17.6 KB - Last synced at: 3 months ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

matteo-stat/transformers-nlp-multi-label-classification

This repo provides scripts for fine-tuning HuggingFace Transformers, setting up pipelines and optimizing multi-label classification models for inference. They are based on my experience developing a custom chatbot, Iโ€™m sharing these in the hope they will help others to quickly fine-tune and use models in their projects! ๐Ÿ˜Š

Language: Python - Size: 31.3 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

matteo-stat/transformers-nlp-ner-token-classification

This repo provides scripts for fine-tuning HuggingFace Transformers, setting up pipelines and optimizing token classification models for inference. They are based on my experience developing a custom chatbot, Iโ€™m sharing these in the hope they will help others to quickly fine-tune and use models in their projects! ๐Ÿ˜Š

Language: Python - Size: 22.5 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

manickavela29/EmoTwitter

OnnxRT based Inference Optimization of Roberta model trained for Sentiment Analysis On Twitter Dataset

Language: Jupyter Notebook - Size: 12.7 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Bisonai/ncnn Fork of Tencent/ncnn

Modified inference engine for quantized convolution using product quantization

Language: C++ - Size: 7.96 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 4 - Forks: 0

Rapternmn/PyTorch-Onnx-Tensorrt

A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3

Language: Python - Size: 2.83 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 80 - Forks: 18

ankdeshm/inference-optimization

A compilation of various ML and DL models and ways to optimize the their inferences.

Language: Jupyter Notebook - Size: 6.17 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

kiritigowda/mivisionx-inference-analyzer

MIVisionX Python Inference Analyzer uses pre-trained ONNX/NNEF/Caffe models to analyze inference results and summarize individual image results

Language: Python - Size: 11.7 MB - Last synced at: 5 months ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 3

goshaQ/inference-optimizer

A simple tool that applies structure-level optimizations (e.g. Quantization) to a TensorFlow model

Language: Python - Size: 6.84 KB - Last synced at: about 2 years ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 1

lmaxwell/Armednn

cross-platform modular neural network inference library, small and efficient

Language: C++ - Size: 1.05 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 13 - Forks: 2

zhliuworks/Fast-MobileNetV2

๐Ÿค–๏ธ Optimized CUDA Kernels for Fast MobileNetV2 Inference

Language: Cuda - Size: 15 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 1

sjlee25/batch-partitioning

Batch Partitioning for Multi-PE Inference with TVM (2020)

Language: Python - Size: 3.79 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

aalbaali/LieBatch

Batch estimation on Lie groups

Language: MATLAB - Size: 3.5 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

effrosyni-papanastasiou/constrained-em

A constrained expectation-maximization algorithm for feasible graph inference.

Language: Jupyter Notebook - Size: 16.6 KB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

ieee820/ncnn Fork of Tencent/ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform

Language: C++ - Size: 6.81 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

Related Keywords
inference-optimization 44 deep-learning 9 machine-learning 6 onnx 6 pytorch 6 pruning 5 tensorrt 5 neural-network 5 quantization 4 onnxruntime 4 large-language-models 4 llm 4 computer-vision 3 object-detection 3 huggingface 3 inference 3 inference-engine 3 deep-learning-techniques 2 transformers 2 large-scale-deployment 2 machine-learning-operations 2 tensorflow 2 graph-theory 2 fine-tuning 2 pagerank 2 weighted-pagerank 2 huggingface-pipelines 2 huggingface-transformers 2 openvino-toolkit 2 nlp 2 convolutional-neural-networks 2 cpu 2 edge-ai 2 ultralytics 2 libtorch 2 pytorch-mobile 2 depth-pro 2 yolov8 2 cpp 2 acceleration 2 serving-infrastructure 2 amd 2 deep-neural-networks 2 efficient-ai 2 model-serving 2 model-inference-service 2 model-acceleration 2 avx2 1 token-classification 1 convolutional-neural-network 1 avx512 1 bert-models 1 roberta-model 1 sentiment-analysis 1 edge-machine-learning 1 inference-acceleration 1 mobile-deep-learning 1 rocm 1 torch-compile 1 ios-app 1 android-app 1 ggml 1 nvidia 1 template-library 1 image-processing 1 numpy-arrays 1 numpy-implementation 1 opencv 1 openvino 1 segmentation 1 torch 1 centrality-measures 1 multilayer-perceptron 1 structured-sparsity 1 multi-label-classification 1 text-classification 1 named-entity-recognition 1 ner 1 squeezenet 1 vgg 1 tensorflow-models 1 conv1d 1 eigen 1 eigen3 1 lstm 1 cuda-kernels 1 mobilenet-v2 1 data-parallelism 1 dl-compiler 1 dl-optimization 1 tvm 1 batch-optimization 1 g2o 1 lie-groups 1 state-estimation 1 expectation-maximisation-algorithm 1 expectation-maximization 1 feasibility 1 network-inference 1 arm-neon 1