GitHub topics: gemm
daedalus/AIscripts
Scripts created with AI of white papers and publications
Language: Jupyter Notebook - Size: 34.2 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

ROCm/Tensile
Stretching GPU performance for GEMMs and tensor contractions.
Language: Python - Size: 95 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 235 - Forks: 158

OpenNMT/CTranslate2
Fast inference engine for Transformer models
Language: C++ - Size: 14.5 MB - Last synced at: 1 day ago - Pushed at: 15 days ago - Stars: 3,753 - Forks: 350

salykova/sgemm.c
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
Language: C - Size: 2.82 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 347 - Forks: 22

JoeruCodes/CUDA-GEMM-kernel
My attempt of making a GEMM kernel...
Language: Cuda - Size: 67.4 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2 - Forks: 0

KarhouTam/cuda-kernels
Some common CUDA kernel implementations (Not the fastest).
Language: Cuda - Size: 57.6 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 17 - Forks: 1

coderonion/awesome-cuda-and-hpc
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
Size: 54.7 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 249 - Forks: 29

ROCm/hipBLASLt
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
Language: Assembly - Size: 875 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 90 - Forks: 117

CNugteren/CLBlast
Tuned OpenCL BLAS
Language: C++ - Size: 6.7 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,096 - Forks: 204

xlite-dev/CUDA-Learn-Notes
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
Language: Cuda - Size: 262 MB - Last synced at: 9 days ago - Pushed at: 11 days ago - Stars: 3,433 - Forks: 367

cp2k/dbcsr
DBCSR: Distributed Block Compressed Sparse Row matrix library
Language: Fortran - Size: 608 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 142 - Forks: 48

flame/how-to-optimize-gemm
Language: C - Size: 2.18 MB - Last synced at: 16 days ago - Pushed at: over 1 year ago - Stars: 1,855 - Forks: 356

enp1s0/ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Language: Cuda - Size: 186 KB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 59 - Forks: 5

Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Language: Cuda - Size: 459 KB - Last synced at: 16 days ago - Pushed at: 8 months ago - Stars: 60 - Forks: 5

eth-cscs/spla
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
Language: C++ - Size: 929 KB - Last synced at: 10 days ago - Pushed at: 10 months ago - Stars: 29 - Forks: 7

flame/blislab
BLISlab: A Sandbox for Optimizing GEMM
Language: C - Size: 6.8 MB - Last synced at: 20 days ago - Pushed at: almost 4 years ago - Stars: 513 - Forks: 107

Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language: Cuda - Size: 1.1 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 379 - Forks: 77

snalo/SPOGA
Repo for the SPOGA Accelerator - Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels
Language: Python - Size: 86.9 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

cyrusmsk/gemm_apple
GEMM on Apple Silicon
Language: Python - Size: 7.81 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 3 - Forks: 0

Serge45/amdgpu-arch-gemm
Yet another AMD GCN assembly generator for GEMM
Language: C++ - Size: 118 KB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

LRZ-BADW/OMMOP
OpenMP Matrix Multiplication Offloading Playground
Language: C++ - Size: 31.3 KB - Last synced at: 2 days ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

Bruce-Lee-LY/cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
Language: C++ - Size: 2.14 MB - Last synced at: 10 days ago - Pushed at: 7 months ago - Stars: 17 - Forks: 2

enp1s0/cuMpSGEMM
Fast SGEMM emulation on Tensor Cores
Language: Cuda - Size: 476 KB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 10 - Forks: 1

fattorib/thunderkittens-simple-gemm
Simple Tensorcore GEMM in ThunderKittens
Language: Cuda - Size: 13.7 KB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

iVishalr/GEMM
Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.
Language: C - Size: 12.7 KB - Last synced at: 20 days ago - Pushed at: almost 4 years ago - Stars: 31 - Forks: 4

BoooC/CNN-Accelerator-Based-on-Eyeriss-v2
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
Language: Verilog - Size: 156 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 45 - Forks: 3

mratsim/laser
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Language: Nim - Size: 3.65 MB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 285 - Forks: 14

ArcAII/chat
Xode Open Source Code assistant
Language: JavaScript - Size: 35.3 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Language: Cuda - Size: 1.09 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 297 - Forks: 46

TeamBipartite/csc485b-202409-a4
High throughput data-parallel GEMM implementations in Cuda using Cuda cores and Tensor cores
Language: C++ - Size: 834 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

aredden/torch-cublas-hgemm
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Language: Cuda - Size: 45.9 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 41 - Forks: 3

yester31/CUDA_EX
CUDA kernel functions
Language: Cuda - Size: 92.9 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 2

nirw4nna/YAMI
Yet Another Machine Inference framework
Language: C++ - Size: 794 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

p-anastas/PARALiA-GEMMex
Language: C++ - Size: 1.58 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

coderonion/moblas
BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.
Language: Mojo - Size: 5.86 KB - Last synced at: 5 days ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

yui0/slibs
Single file libraries for C/C++
Language: C - Size: 12.9 MB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 117 - Forks: 12

Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Language: Cuda - Size: 854 KB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

ZhangGe6/how-to-optimize-playground
High-performance computing (HPC) demos since I was a freshmen.
Language: C - Size: 1020 KB - Last synced at: 17 days ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 0

jiegec/sgemm-optimize
Optimization of sgemm in Kunpeng platform
Language: C - Size: 1.89 MB - Last synced at: 22 days ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

andylolu2/simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
Language: Cuda - Size: 92.8 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 26 - Forks: 2

dev0x13/gemm-benchmark-2023
Benchmarks for some modern (2023) high-performance floating-point GEMM implementations compared to Mojo language
Language: Mojo - Size: 688 KB - Last synced at: 17 days ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

CoffeeBeforeArch/mmul
Serial and parallel implementations of matrix multiplication
Language: C++ - Size: 1.35 MB - Last synced at: 12 months ago - Pushed at: about 4 years ago - Stars: 29 - Forks: 3

a-sidorova/gpu_opencl_cource
Course Programming on new Architecture-1 (GPU), autumn 2021
Language: C++ - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

martins0n/matmul
Matrix-matrix multiplication implementations benchmarking
Language: Rust - Size: 43.9 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

xziya/gemm-opt
Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.
Language: C++ - Size: 39.1 KB - Last synced at: 5 months ago - Pushed at: over 3 years ago - Stars: 8 - Forks: 0

merledu/magma-si
Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL
Language: Scala - Size: 46.8 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 3

jhson989/fast-conv
Fast Convoluion Implementation via CUDA
Language: Cuda - Size: 195 KB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0

junyoung1992/OpenCL-GEMM
GEMM Optimization
Language: C - Size: 26.4 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

szagoruyko/openai-gemm.pytorch
PyTorch bindings for openai-gemm
Language: Python - Size: 1.95 KB - Last synced at: 17 days ago - Pushed at: about 8 years ago - Stars: 20 - Forks: 4

TensorBFS/CuTropicalGEMM.jl
The fastest Tropical number matrix multiplication on GPU
Language: Julia - Size: 1.46 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
Language: C - Size: 3.33 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 65 - Forks: 16

blackccpie/fastconv
fast 2D convolution implementation benchmark
Language: C++ - Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 6 - Forks: 2

DongqiShen/iLLM
Implementing LLM from scratch. (Developing...)
Language: C - Size: 24.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

hma02/cublasHgemm-P100
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
Language: Cuda - Size: 18.6 KB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 35 - Forks: 11

yui0/ugemm
GEMM
Language: C - Size: 103 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 3

xylcbd/gemm_base
gemm baseline code.
Language: C++ - Size: 4.88 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 2 - Forks: 0

mz24cn/gemm_optimization
The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能,提供binary,开盒即用。
Language: C - Size: 87.1 MB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 10 - Forks: 5

XiaoSong9905/dgemm-knl
DGEMM on KNL, achieve 75% MKL
Language: C++ - Size: 73.2 KB - Last synced at: almost 2 years ago - Pushed at: almost 3 years ago - Stars: 4 - Forks: 0

jerinphilip/MozIntGemm
Wrapper around intgemm (x86_64) and ruy (ARM) to switch between both based on architecture and provide a fast matrix multiplication backend for Mozilla Firefox's translation feature.
Language: C++ - Size: 234 KB - Last synced at: 2 months ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 2

hma02/cublasgemm-benchmark
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Language: Cuda - Size: 9.77 KB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 19 - Forks: 16

digital-nomad-cheng/matmul_cuda_kernel_tvm
Generate optimized MatMul cuda kernel automatically using tvm auto schedule
Language: Jupyter Notebook - Size: 48.8 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

pminhtam/xnor_conv_pytorch_extension
XNOR-Net with binary conv2d kernels with XNOR GEMM op, support both CPU and GPU.
Language: C - Size: 66.4 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

XiaoSong9905/cuda-v100-kernels
CUDA Kernels on V100
Language: Cuda - Size: 29.3 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

yester31/OpenCL_EX
Development of deep learning inference code by OpenCL kerenl function.
Language: C++ - Size: 27.7 MB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 0

yester31/GEMM_Conv2d_CUDA
CUDA Gemm Convolution implementation
Language: C++ - Size: 564 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 3 - Forks: 0

scocoyash/Convolution-To-Gemm
My experiments with convolution
Language: C++ - Size: 23.4 KB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 1

KaiserKlayton/lpa_cnn
Low Precision Arithmetic for Convolutional Neural Network Inference
Language: C++ - Size: 263 MB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 3 - Forks: 0

PhuNH/hpc-aa
High Performance Computing - Algorithms and Applications Course in WS18-19 at TUM
Language: C++ - Size: 634 KB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

andreytkachenko/yarblas
Yet another rust BLAS
Language: Rust - Size: 32.2 KB - Last synced at: about 1 month ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

koallen/gemm-optimization
My experiments on optimizing GEMM
Language: C - Size: 47.9 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 1

riskybacon/mnist_arma_blas
Language: C++ - Size: 49.8 KB - Last synced at: about 2 years ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0
