quantization | Topic | Ecosyste.ms: Repos

Topic: "quantization"

hiyouga/LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Language: Python - Size: 46.2 MB - Last synced at: 6 days ago - Pushed at: 8 days ago - Stars: 48,256 - Forks: 5,881

ymcui/Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)

Language: Python - Size: 23 MB - Last synced at: about 12 hours ago - Pushed at: about 1 year ago - Stars: 18,829 - Forks: 1,890

SYSTRAN/faster-whisper

Faster Whisper transcription with CTranslate2

Language: Python - Size: 36.6 MB - Last synced at: 6 days ago - Pushed at: 12 days ago - Stars: 15,802 - Forks: 1,316

UFund-Me/Qbot

[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant

Language: Jupyter Notebook - Size: 387 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 11,264 - Forks: 1,620

bitsandbytes-foundation/bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

Language: Python - Size: 2.78 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 6,987 - Forks: 693

kornelski/pngquant

Lossy PNG compressor — pngquant command based on libimagequant library

Language: C - Size: 1.71 MB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 5,369 - Forks: 492

AutoGPTQ/AutoGPTQ 📦

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Language: Python - Size: 8.01 MB - Last synced at: 2 days ago - Pushed at: 30 days ago - Stars: 4,837 - Forks: 513

IntelLabs/distiller 📦

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller

Language: Jupyter Notebook - Size: 40.5 MB - Last synced at: 8 days ago - Pushed at: about 2 years ago - Stars: 4,390 - Forks: 805

OpenNMT/CTranslate2

Fast inference engine for Transformer models

Language: C++ - Size: 14.5 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 3,785 - Forks: 354

neuralmagic/deepsparse

Sparsity-aware deep learning inference runtime for CPUs

Language: Python - Size: 137 MB - Last synced at: 28 days ago - Pushed at: 10 months ago - Stars: 3,130 - Forks: 183

huawei-noah/Pretrained-Language-Model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Language: Python - Size: 29 MB - Last synced at: 28 days ago - Pushed at: over 1 year ago - Stars: 3,080 - Forks: 635

IntelLabs/nlp-architect 📦

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Language: Python - Size: 531 MB - Last synced at: 17 days ago - Pushed at: over 2 years ago - Stars: 2,939 - Forks: 447

huggingface/optimum

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

Language: Python - Size: 5.64 MB - Last synced at: 4 days ago - Pushed at: 11 days ago - Stars: 2,878 - Forks: 532

aaron-xichen/pytorch-playground

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

Language: Python - Size: 45.9 KB - Last synced at: 27 days ago - Pushed at: over 2 years ago - Stars: 2,658 - Forks: 614

Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6

Language: Python - Size: 18.4 MB - Last synced at: about 11 hours ago - Pushed at: 8 months ago - Stars: 2,644 - Forks: 204

intel/neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

Language: Python - Size: 469 MB - Last synced at: about 8 hours ago - Pushed at: 2 days ago - Stars: 2,398 - Forks: 267

dvmazur/mixtral-offloading

Run Mixtral-8x7B models in Colab or consumer desktops

Language: Python - Size: 261 KB - Last synced at: 27 days ago - Pushed at: about 1 year ago - Stars: 2,303 - Forks: 233

quic/aimet

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

Language: Python - Size: 19.8 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2,292 - Forks: 401

666DZY666/micronet

micronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference)、Low-Bit(≤2b)/Ternary and Binary(TWN/BNN/XNOR-Net); post-training-quantization(PTQ), 8-bit(tensorrt); 2、 pruning: normal、regular and group convolutional channel pruning; 3、 group convolution structure; 4、batch-normalization fuse for quantization. deploy: tensorrt, fp32/fp16/int8(ptq-calibration)、op-adapt(upsample)、dynamic_shape

Language: Python - Size: 6.61 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2,246 - Forks: 476

Efficient-ML/Awesome-Model-Quantization

A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.

Size: 61.5 MB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 2,084 - Forks: 221

pytorch/ao

PyTorch native quantization and sparsity for training and inference

Language: Python - Size: 30.5 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,020 - Forks: 257

intel/intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Language: Python - Size: 113 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1,844 - Forks: 273

OpenPPL/ppq

PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.

Language: Python - Size: 5.57 MB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 1,685 - Forks: 252

mit-han-lab/nunchaku

[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Language: Python - Size: 86 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1,672 - Forks: 90

open-mmlab/mmrazor

OpenMMLab Model Compression Toolbox and Benchmark.

Language: Python - Size: 11.1 MB - Last synced at: about 23 hours ago - Pushed at: 11 months ago - Stars: 1,588 - Forks: 237

PaddlePaddle/PaddleSlim

PaddleSlim is an open-source library for deep model compression and architecture search.

Language: Python - Size: 16.3 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 1,587 - Forks: 350

tensorflow/model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.

Language: Python - Size: 2.22 MB - Last synced at: 4 days ago - Pushed at: 3 months ago - Stars: 1,532 - Forks: 325

RWKV/rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

Language: C++ - Size: 42.1 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 1,511 - Forks: 104

thu-ml/SageAttention

Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

Language: Cuda - Size: 46.1 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 1,413 - Forks: 96

vllm-project/llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Language: Python - Size: 22.6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,308 - Forks: 125

Xilinx/brevitas

Brevitas: neural network quantization in PyTorch

Language: Python - Size: 20.2 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1,304 - Forks: 212

RahulSChand/gpu_poor

Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization

Language: JavaScript - Size: 1.56 MB - Last synced at: about 11 hours ago - Pushed at: 5 months ago - Stars: 1,303 - Forks: 73

huawei-noah/Efficient-Computing

Efficient computing methods developed by Huawei Noah's Ark Lab

Language: Jupyter Notebook - Size: 100 MB - Last synced at: 29 days ago - Pushed at: 6 months ago - Stars: 1,257 - Forks: 217

open-edge-platform/training_extensions

Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™

Language: Python - Size: 416 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,164 - Forks: 449

openvinotoolkit/nncf

Neural Network Compression Framework for enhanced OpenVINO™ inference

Language: Python - Size: 63.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,004 - Forks: 253

huggingface/optimum-quanto

A pytorch quantization backend for optimum

Language: Python - Size: 2.59 MB - Last synced at: 3 days ago - Pushed at: 17 days ago - Stars: 928 - Forks: 72

guan-yuan/awesome-AutoML-and-Lightweight-Models

A list of high-quality (newest) AutoML works and lightweight models including 1.) Neural Architecture Search, 2.) Lightweight Structures, 3.) Model Compression, Quantization and Acceleration, 4.) Hyperparameter Optimization, 5.) Automated Feature Engineering.

Size: 150 KB - Last synced at: 4 days ago - Pushed at: almost 4 years ago - Stars: 853 - Forks: 160

mit-han-lab/TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

Language: C++ - Size: 83.3 MB - Last synced at: 4 days ago - Pushed at: 10 months ago - Stars: 846 - Forks: 85

mit-han-lab/tinyengine

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 256KB Memory

Language: C - Size: 235 MB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 843 - Forks: 137

ImageOptim/libimagequant

Palette quantization library that powers pngquant and other PNG optimizers

Language: Rust - Size: 1.34 MB - Last synced at: 18 days ago - Pushed at: 3 months ago - Stars: 824 - Forks: 134

Xilinx/finn

Dataflow compiler for QNN inference on FPGAs

Language: Python - Size: 84.6 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 819 - Forks: 256

mobiusml/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

Language: Python - Size: 533 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 807 - Forks: 79

OpenGVLab/OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Language: Python - Size: 8.13 MB - Last synced at: 4 days ago - Pushed at: 7 months ago - Stars: 805 - Forks: 62

PINTO0309/onnx2tf

Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf). I don't need a Star, but give me a pull request.

Language: Python - Size: 3.75 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 782 - Forks: 77

csarron/awesome-emdl

Embedded and mobile deep learning research resources

Size: 88.9 KB - Last synced at: 4 days ago - Pushed at: about 2 years ago - Stars: 746 - Forks: 167

SqueezeAILab/SqueezeLLM

[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization

Language: Python - Size: 1.5 MB - Last synced at: about 11 hours ago - Pushed at: 9 months ago - Stars: 687 - Forks: 45

DeepVAC/deepvac

PyTorch Project Specification.

Language: Python - Size: 791 KB - Last synced at: about 17 hours ago - Pushed at: almost 4 years ago - Stars: 679 - Forks: 104

IST-DASLab/marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Language: Python - Size: 708 KB - Last synced at: 5 months ago - Pushed at: 8 months ago - Stars: 661 - Forks: 52

SforAiDl/KD_Lib

A Pytorch Knowledge Distillation library for benchmarking and extending works in the domains of Knowledge Distillation, Pruning, and Quantization.

Language: Python - Size: 22.2 MB - Last synced at: 29 days ago - Pushed at: about 2 years ago - Stars: 622 - Forks: 59

Ki6an/fastT5

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.

Language: Python - Size: 277 KB - Last synced at: 14 days ago - Pushed at: about 2 years ago - Stars: 578 - Forks: 73

Maknee/minigpt4.cpp

Port of MiniGPT4 in C++ (4bit, 5bit, 6bit, 8bit, 16bit CPU inference with GGML)

Language: C++ - Size: 2.12 MB - Last synced at: 26 days ago - Pushed at: almost 2 years ago - Stars: 565 - Forks: 27

google/qkeras

QKeras: a quantization deep learning library for Tensorflow Keras

Language: Python - Size: 1.53 MB - Last synced at: 22 days ago - Pushed at: about 1 month ago - Stars: 562 - Forks: 105

thulab/DeepHash

An Open-Source Package for Deep Learning to Hash (DeepHash)

Language: Python - Size: 7.71 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 541 - Forks: 126

thu-ml/SpargeAttn

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Language: Cuda - Size: 55.4 MB - Last synced at: about 6 hours ago - Pushed at: about 7 hours ago - Stars: 525 - Forks: 31

ModelCloud/GPTQModel

Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

Language: Python - Size: 11.8 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 513 - Forks: 76

cedrickchee/awesome-ml-model-compression

Awesome machine learning model compression research papers, quantization, tools, and learning material.

Size: 213 KB - Last synced at: 10 days ago - Pushed at: 8 months ago - Stars: 513 - Forks: 60

MPolaris/onnx2tflite

Tool for onnx->keras or onnx->tflite. Hope this tool can help you.

Language: Python - Size: 175 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 503 - Forks: 39

Zhen-Dong/Awesome-Quantization-Papers

List of papers related to neural network quantization in recent AI conferences and journals.

Size: 309 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 478 - Forks: 39

ModelTC/llmc

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Language: Python - Size: 28.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 470 - Forks: 53

DerryHub/BEVFormer_tensorrt

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins (float/half/half2/int8).

Language: Python - Size: 403 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 466 - Forks: 76

huggingface/optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools

Language: Jupyter Notebook - Size: 20.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 460 - Forks: 130

intel/auto-round

Advanced Quantization Algorithm for LLMs/VLMs.

Language: Python - Size: 10.5 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 449 - Forks: 37

hailo-ai/hailo_model_zoo

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

Language: Python - Size: 5.83 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 435 - Forks: 57

1duo/awesome-ai-infrastructures

Infrastructures™ for Machine Learning Training/Inference in Production.

Size: 11.8 MB - Last synced at: 4 days ago - Pushed at: almost 6 years ago - Stars: 413 - Forks: 74

sony/model_optimization

Model Compression Toolkit (MCT) is an open source project for neural network model optimization under efficient, constrained hardware. This project provides researchers, developers, and engineers advanced quantization and compression tools for deploying state-of-the-art neural networks.

Language: Python - Size: 22.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 391 - Forks: 67

neuralmagic/sparsezoo

Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes

Language: Python - Size: 1.81 MB - Last synced at: 1 day ago - Pushed at: 6 days ago - Stars: 384 - Forks: 26

mit-han-lab/haq

[CVPR 2019, Oral] HAQ: Hardware-Aware Automated Quantization with Mixed Precision

Language: Python - Size: 64.5 KB - Last synced at: about 2 months ago - Pushed at: about 4 years ago - Stars: 380 - Forks: 85

tpoisonooo/llama.onnx

LLaMa/RWKV onnx models, quantization and testcase

Language: Python - Size: 1.3 MB - Last synced at: about 11 hours ago - Pushed at: almost 2 years ago - Stars: 363 - Forks: 31

Zhen-Dong/HAWQ

Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

Language: Python - Size: 691 KB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 361 - Forks: 80

SqueezeAILab/KVQuant

[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Language: Python - Size: 19.8 MB - Last synced at: 3 days ago - Pushed at: 9 months ago - Stars: 348 - Forks: 30

Xiuyu-Li/q-diffusion

[ICCV 2023] Q-Diffusion: Quantizing Diffusion Models.

Language: Python - Size: 5.97 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 347 - Forks: 24

inisis/brocolli

Everything in Torch Fx

Language: Python - Size: 5.9 MB - Last synced at: 28 days ago - Pushed at: 11 months ago - Stars: 341 - Forks: 61

megvii-research/FQ-ViT

[IJCAI 2022] FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

Language: Python - Size: 729 KB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 329 - Forks: 48

megvii-research/Sparsebit

A model compression and acceleration toolbox based on pytorch.

Language: Python - Size: 7.45 MB - Last synced at: 6 months ago - Pushed at: over 1 year ago - Stars: 327 - Forks: 40

neuralmagic/sparsify

ML model optimization product to accelerate inference.

Language: Python - Size: 7.18 MB - Last synced at: 29 days ago - Pushed at: about 1 year ago - Stars: 326 - Forks: 30

caoyue10/DeepHash-Papers

Must-read papers on deep learning to hash (DeepHash)

Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 319 - Forks: 78

Beomi/BitNet-Transformers

0️⃣1️⃣🤗 BitNet-Transformers: Huggingface Transformers Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch with Llama(2) Architecture

Language: Python - Size: 588 KB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 302 - Forks: 31

jy-yuan/KIVI

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Language: Python - Size: 16.7 MB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 295 - Forks: 30

datawhalechina/awesome-compression

模型压缩的小白入门教程

Size: 302 MB - Last synced at: 5 days ago - Pushed at: 6 months ago - Stars: 274 - Forks: 34

sinanuozdemir/quick-start-guide-to-llms

The Official Repo for "Quick Start Guide to Large Language Models"

Language: Jupyter Notebook - Size: 90.6 MB - Last synced at: 29 days ago - Pushed at: about 1 month ago - Stars: 264 - Forks: 145

Bisonai/awesome-edge-machine-learning

A curated list of awesome edge machine learning resources, including research papers, inference engines, challenges, books, meetups and others.

Language: Python - Size: 135 KB - Last synced at: 11 days ago - Pushed at: about 2 years ago - Stars: 261 - Forks: 50

amirgholami/ZeroQ

[CVPR'20] ZeroQ: A Novel Zero Shot Quantization Framework

Language: Python - Size: 5.47 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 258 - Forks: 52

THU-MIG/torch-model-compression

针对pytorch模型的自动化模型结构分析和修改工具集，包含自动分析模型结构的模型压缩算法库

Language: Python - Size: 132 KB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 250 - Forks: 41

blue-oil/blueoil 📦

Bring Deep Learning to small devices

Language: Python - Size: 291 MB - Last synced at: over 1 year ago - Pushed at: about 4 years ago - Stars: 249 - Forks: 85

datawhalechina/llm-deploy

大模型/LLM推理和部署理论与实践

Size: 100 MB - Last synced at: 11 days ago - Pushed at: 2 months ago - Stars: 248 - Forks: 36

kssteven418/I-BERT

[ICML'21 Oral] I-BERT: Integer-only BERT Quantization

Language: Python - Size: 6.38 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 241 - Forks: 34

microsoft/LQ-Nets 📦

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Language: Python - Size: 28.3 KB - Last synced at: 4 days ago - Pushed at: over 2 years ago - Stars: 241 - Forks: 69