Topic: "gemm"
OpenNMT/CTranslate2
Fast inference engine for Transformer models
Language: C++ - Size: 14.5 MB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 3,810 - Forks: 358

flame/how-to-optimize-gemm
Language: C - Size: 2.18 MB - Last synced at: 6 days ago - Pushed at: almost 2 years ago - Stars: 1,874 - Forks: 357

CNugteren/CLBlast
Tuned OpenCL BLAS
Language: C++ - Size: 6.7 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 1,103 - Forks: 207

flame/blislab
BLISlab: A Sandbox for Optimizing GEMM
Language: C - Size: 6.8 MB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 513 - Forks: 107

Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language: Cuda - Size: 1.1 MB - Last synced at: 3 days ago - Pushed at: 9 months ago - Stars: 414 - Forks: 79

salykova/sgemm.c
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
Language: C - Size: 2.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 347 - Forks: 22

yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Language: Cuda - Size: 1.09 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 297 - Forks: 46

mratsim/laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Language: Nim - Size: 3.65 MB - Last synced at: about 21 hours ago - Pushed at: over 1 year ago - Stars: 285 - Forks: 14

coderonion/awesome-cuda-and-hpc
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
Size: 55.7 KB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 268 - Forks: 31

ROCm/Tensile
Stretching GPU performance for GEMMs and tensor contractions.
Language: Python - Size: 95.1 MB - Last synced at: 3 days ago - Pushed at: 7 days ago - Stars: 241 - Forks: 162

cp2k/dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
Language: Fortran - Size: 618 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 142 - Forks: 48

yui0/slibs
Single file libraries for C/C++
Language: C - Size: 12.9 MB - Last synced at: 20 days ago - Pushed at: 10 months ago - Stars: 121 - Forks: 11

ROCm/hipBLASLt
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
Language: Assembly - Size: 1010 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 96 - Forks: 134

enp1s0/ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Language: Cuda - Size: 186 KB - Last synced at: about 6 hours ago - Pushed at: 2 months ago - Stars: 65 - Forks: 5

yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
Language: C - Size: 3.33 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 65 - Forks: 16

Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Language: Cuda - Size: 459 KB - Last synced at: 14 days ago - Pushed at: 9 months ago - Stars: 61 - Forks: 5

BoooC/CNN-Accelerator-Based-on-Eyeriss-v2
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
Language: Verilog - Size: 156 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 45 - Forks: 3

aredden/torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Language: Cuda - Size: 45.9 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 41 - Forks: 3

hma02/cublasHgemm-P100
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
Language: Cuda - Size: 18.6 KB - Last synced at: over 1 year ago - Pushed at: almost 6 years ago - Stars: 35 - Forks: 11

iVishalr/GEMM
Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.
Language: C - Size: 12.7 KB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 31 - Forks: 4

eth-cscs/spla
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
Language: C++ - Size: 929 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 29 - Forks: 7

CoffeeBeforeArch/mmul
Serial and parallel implementations of matrix multiplication
Language: C++ - Size: 1.35 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 29 - Forks: 3

andylolu2/simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
Language: Cuda - Size: 92.8 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 26 - Forks: 2

szagoruyko/openai-gemm.pytorch
PyTorch bindings for openai-gemm
Language: Python - Size: 1.95 KB - Last synced at: about 2 months ago - Pushed at: over 8 years ago - Stars: 20 - Forks: 4

hma02/cublasgemm-benchmark
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Language: Cuda - Size: 9.77 KB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 19 - Forks: 16

KarhouTam/cuda-kernels
Some common CUDA kernel implementations (Not the fastest).
Language: Cuda - Size: 57.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 17 - Forks: 1

Bruce-Lee-LY/cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
Language: C++ - Size: 2.14 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 17 - Forks: 2

XiaoSong9905/dgemm-knl
DGEMM on KNL, achieve 75% MKL
Language: C++ - Size: 73.2 KB - Last synced at: 13 days ago - Pushed at: about 3 years ago - Stars: 17 - Forks: 0

enp1s0/cuMpSGEMM
Fast SGEMM emulation on Tensor Cores
Language: Cuda - Size: 476 KB - Last synced at: about 6 hours ago - Pushed at: 3 months ago - Stars: 12 - Forks: 1

Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Language: Cuda - Size: 854 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

mz24cn/gemm_optimization
The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能,提供binary,开盒即用。
Language: C - Size: 87.1 MB - Last synced at: almost 2 years ago - Pushed at: about 6 years ago - Stars: 10 - Forks: 5

merledu/magma-si
Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL
Language: Scala - Size: 46.8 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 3

xziya/gemm-opt
Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.
Language: C++ - Size: 39.1 KB - Last synced at: 6 months ago - Pushed at: almost 4 years ago - Stars: 8 - Forks: 0

yui0/ugemm
GEMM
Language: C - Size: 103 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 7 - Forks: 3

blackccpie/fastconv
fast 2D convolution implementation benchmark
Language: C++ - Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 6 - Forks: 2

yester31/CUDA_EX
CUDA kernel functions
Language: Cuda - Size: 92.9 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 4 - Forks: 2

coderonion/moblas
BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.
Language: Mojo - Size: 5.86 KB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

dev0x13/gemm-benchmark-2023
Benchmarks for some modern (2023) high-performance floating-point GEMM implementations compared to Mojo language
Language: Mojo - Size: 688 KB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

cyrusmsk/gemm_apple
GEMM on Apple Silicon
Language: Python - Size: 7.81 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 0

ZhangGe6/how-to-optimize-playground
High-performance computing (HPC) demos since I was a freshmen.
Language: C - Size: 1020 KB - Last synced at: about 2 months ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 0

yester31/OpenCL_EX
Development of deep learning inference code by OpenCL kerenl function.
Language: C++ - Size: 27.7 MB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 0

yester31/GEMM_Conv2d_CUDA
CUDA Gemm Convolution implementation
Language: C++ - Size: 564 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

KaiserKlayton/lpa_cnn
Low Precision Arithmetic for Convolutional Neural Network Inference
Language: C++ - Size: 263 MB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 3 - Forks: 0

JoeruCodes/CUDA-GEMM-kernel
My attempt of making a GEMM kernel...
Language: Cuda - Size: 67.4 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

fattorib/thunderkittens-simple-gemm
Simple Tensorcore GEMM in ThunderKittens
Language: Cuda - Size: 13.7 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

jiegec/sgemm-optimize
Optimization of sgemm in Kunpeng platform
Language: C - Size: 1.89 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

jhson989/fast-conv
Fast Convoluion Implementation via CUDA
Language: Cuda - Size: 195 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

scocoyash/Convolution-To-Gemm
My experiments with convolution
Language: C++ - Size: 23.4 KB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 1

xylcbd/gemm_base
gemm baseline code.
Language: C++ - Size: 4.88 KB - Last synced at: almost 2 years ago - Pushed at: over 7 years ago - Stars: 2 - Forks: 0

govindansriram/sm89-kernels
SM89 Optimized CUDA Kernels
Language: Cuda - Size: 65.4 KB - Last synced at: about 8 hours ago - Pushed at: about 9 hours ago - Stars: 1 - Forks: 0

ArcAII/chat
Xode Open Source Code assistant
Language: JavaScript - Size: 35.3 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

daedalus/AIscripts
Scripts created with AI assistance from white papers and publications
Language: Jupyter Notebook - Size: 910 KB - Last synced at: 23 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

p-anastas/PARALiA-GEMMex
Language: C++ - Size: 1.58 MB - Last synced at: 5 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

TensorBFS/CuTropicalGEMM.jl
The fastest Tropical number matrix multiplication on GPU
Language: Julia - Size: 1.46 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

LRZ-BADW/OMMOP
OpenMP Matrix Multiplication Offloading Playground
Language: C++ - Size: 31.3 KB - Last synced at: about 15 hours ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

pminhtam/xnor_conv_pytorch_extension
XNOR-Net with binary conv2d kernels with XNOR GEMM op, support both CPU and GPU.
Language: C - Size: 66.4 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

jerinphilip/MozIntGemm
Wrapper around intgemm (x86_64) and ruy (ARM) to switch between both based on architecture and provide a fast matrix multiplication backend for Mozilla Firefox's translation feature.
Language: C++ - Size: 234 KB - Last synced at: 5 days ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 2

a-sidorova/gpu_opencl_cource
Course Programming on new Architecture-1 (GPU), autumn 2021
Language: C++ - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

andreytkachenko/yarblas
Yet another rust BLAS
Language: Rust - Size: 32.2 KB - Last synced at: 2 months ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

adelj88/rocm_wmma_gemm
WMMA GEMM in ROCm for RDNA GPUs
Language: C++ - Size: 44.9 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

snalo/SPOGA
Repo for the SPOGA Accelerator - Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels
Language: Python - Size: 86.9 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Serge45/amdgpu-arch-gemm
Yet another AMD GCN assembly generator for GEMM
Language: C++ - Size: 118 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

TeamBipartite/csc485b-202409-a4
High throughput data-parallel GEMM implementations in Cuda using Cuda cores and Tensor cores
Language: C++ - Size: 834 KB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

nirw4nna/YAMI
Yet Another Machine Inference framework
Language: C++ - Size: 794 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

DongqiShen/iLLM
Implementing LLM from scratch. (Developing...)
Language: C - Size: 24.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

digital-nomad-cheng/matmul_cuda_kernel_tvm
Generate optimized MatMul cuda kernel automatically using tvm auto schedule
Language: Jupyter Notebook - Size: 48.8 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

junyoung1992/OpenCL-GEMM
GEMM Optimization
Language: C - Size: 26.4 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

martins0n/matmul
Matrix-matrix multiplication implementations benchmarking
Language: Rust - Size: 43.9 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

PhuNH/hpc-aa
High Performance Computing - Algorithms and Applications Course in WS18-19 at TUM
Language: C++ - Size: 634 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

koallen/gemm-optimization
My experiments on optimizing GEMM
Language: C - Size: 47.9 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 1

riskybacon/mnist_arma_blas
Language: C++ - Size: 49.8 KB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0
