GitHub topics: gemm

Repositories

daedalus/AIscripts

Scripts created with AI of white papers and publications

Language: Jupyter Notebook - Size: 34.2 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

ROCm/Tensile

Stretching GPU performance for GEMMs and tensor contractions.

Language: Python - Size: 95 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 235 - Forks: 158

OpenNMT/CTranslate2

Fast inference engine for Transformer models

Language: C++ - Size: 14.5 MB - Last synced at: 1 day ago - Pushed at: 15 days ago - Stars: 3,753 - Forks: 350

salykova/sgemm.c

Multi-Threaded FP32 Matrix Multiplication on x86 CPUs

Language: C - Size: 2.82 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 347 - Forks: 22

JoeruCodes/CUDA-GEMM-kernel

My attempt of making a GEMM kernel...

Language: Cuda - Size: 67.4 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2 - Forks: 0

KarhouTam/cuda-kernels

Some common CUDA kernel implementations (Not the fastest).

Language: Cuda - Size: 57.6 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 17 - Forks: 1

coderonion/awesome-cuda-and-hpc

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

Size: 54.7 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 249 - Forks: 29

ROCm/hipBLASLt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

Language: Assembly - Size: 875 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 90 - Forks: 117

CNugteren/CLBlast

Tuned OpenCL BLAS

Language: C++ - Size: 6.7 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,096 - Forks: 204

xlite-dev/CUDA-Learn-Notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

Language: Cuda - Size: 262 MB - Last synced at: 9 days ago - Pushed at: 11 days ago - Stars: 3,433 - Forks: 367

cp2k/dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library

Language: Fortran - Size: 608 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 142 - Forks: 48

flame/how-to-optimize-gemm

Language: C - Size: 2.18 MB - Last synced at: 16 days ago - Pushed at: over 1 year ago - Stars: 1,855 - Forks: 356

enp1s0/ozIMMU

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

Language: Cuda - Size: 186 KB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 59 - Forks: 5

Bruce-Lee-LY/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Language: Cuda - Size: 459 KB - Last synced at: 16 days ago - Pushed at: 8 months ago - Stars: 60 - Forks: 5

eth-cscs/spla

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

Language: C++ - Size: 929 KB - Last synced at: 10 days ago - Pushed at: 10 months ago - Stars: 29 - Forks: 7

flame/blislab

BLISlab: A Sandbox for Optimizing GEMM

Language: C - Size: 6.8 MB - Last synced at: 20 days ago - Pushed at: almost 4 years ago - Stars: 513 - Forks: 107

Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language: Cuda - Size: 1.1 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 379 - Forks: 77

snalo/SPOGA

Repo for the SPOGA Accelerator - Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels

Language: Python - Size: 86.9 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

cyrusmsk/gemm_apple

GEMM on Apple Silicon

Language: Python - Size: 7.81 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 3 - Forks: 0

Serge45/amdgpu-arch-gemm

Yet another AMD GCN assembly generator for GEMM

Language: C++ - Size: 118 KB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

LRZ-BADW/OMMOP

OpenMP Matrix Multiplication Offloading Playground

Language: C++ - Size: 31.3 KB - Last synced at: 2 days ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

Bruce-Lee-LY/cutlass_gemm

Multiple GEMM operators are constructed with cutlass to support LLM inference.

Language: C++ - Size: 2.14 MB - Last synced at: 10 days ago - Pushed at: 7 months ago - Stars: 17 - Forks: 2

enp1s0/cuMpSGEMM

Fast SGEMM emulation on Tensor Cores

Language: Cuda - Size: 476 KB - Last synced at: 2 days ago - Pushed at: 2 months ago - Stars: 10 - Forks: 1

fattorib/thunderkittens-simple-gemm

Simple Tensorcore GEMM in ThunderKittens

Language: Cuda - Size: 13.7 KB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

iVishalr/GEMM

Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

Language: C - Size: 12.7 KB - Last synced at: 20 days ago - Pushed at: almost 4 years ago - Stars: 31 - Forks: 4

BoooC/CNN-Accelerator-Based-on-Eyeriss-v2

A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network

Language: Verilog - Size: 156 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 45 - Forks: 3

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

Language: Nim - Size: 3.65 MB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 285 - Forks: 14

ArcAII/chat

Xode Open Source Code assistant

Language: JavaScript - Size: 35.3 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Language: Cuda - Size: 1.09 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 297 - Forks: 46

TeamBipartite/csc485b-202409-a4

High throughput data-parallel GEMM implementations in Cuda using Cuda cores and Tensor cores

Language: C++ - Size: 834 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

aredden/torch-cublas-hgemm

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Language: Cuda - Size: 45.9 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 41 - Forks: 3

yester31/CUDA_EX

CUDA kernel functions

Language: Cuda - Size: 92.9 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 2

nirw4nna/YAMI

Yet Another Machine Inference framework

Language: C++ - Size: 794 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

p-anastas/PARALiA-GEMMex

Language: C++ - Size: 1.58 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

coderonion/moblas

BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.

Language: Mojo - Size: 5.86 KB - Last synced at: 5 days ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

yui0/slibs

Single file libraries for C/C++

Language: C - Size: 12.9 MB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 117 - Forks: 12

Bruce-Lee-LY/cuda_back2back_hgemm

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

Language: Cuda - Size: 854 KB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

ZhangGe6/how-to-optimize-playground

High-performance computing (HPC) demos since I was a freshmen.

Language: C - Size: 1020 KB - Last synced at: 17 days ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 0

jiegec/sgemm-optimize

Optimization of sgemm in Kunpeng platform

Language: C - Size: 1.89 MB - Last synced at: 22 days ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

andylolu2/simpleGEMM

The simplest but fast implementation of matrix multiplication in CUDA.

Language: Cuda - Size: 92.8 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 26 - Forks: 2

dev0x13/gemm-benchmark-2023

Benchmarks for some modern (2023) high-performance floating-point GEMM implementations compared to Mojo language

Language: Mojo - Size: 688 KB - Last synced at: 17 days ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

CoffeeBeforeArch/mmul

Serial and parallel implementations of matrix multiplication

Language: C++ - Size: 1.35 MB - Last synced at: 12 months ago - Pushed at: about 4 years ago - Stars: 29 - Forks: 3

a-sidorova/gpu_opencl_cource

Course Programming on new Architecture-1 (GPU), autumn 2021

Language: C++ - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

martins0n/matmul

Matrix-matrix multiplication implementations benchmarking

Language: Rust - Size: 43.9 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

xziya/gemm-opt

Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.

Language: C++ - Size: 39.1 KB - Last synced at: 5 months ago - Pushed at: over 3 years ago - Stars: 8 - Forks: 0

merledu/magma-si

Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL

Language: Scala - Size: 46.8 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 3

jhson989/fast-conv

Fast Convoluion Implementation via CUDA

Language: Cuda - Size: 195 KB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0

junyoung1992/OpenCL-GEMM

GEMM Optimization

Language: C - Size: 26.4 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

szagoruyko/openai-gemm.pytorch

PyTorch bindings for openai-gemm

Language: Python - Size: 1.95 KB - Last synced at: 17 days ago - Pushed at: about 8 years ago - Stars: 20 - Forks: 4

TensorBFS/CuTropicalGEMM.jl

The fastest Tropical number matrix multiplication on GPU

Language: Julia - Size: 1.46 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

Language: C - Size: 3.33 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 65 - Forks: 16

blackccpie/fastconv

fast 2D convolution implementation benchmark

Language: C++ - Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 6 - Forks: 2

DongqiShen/iLLM

Implementing LLM from scratch. (Developing...)

Language: C - Size: 24.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

hma02/cublasHgemm-P100

Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm

Language: Cuda - Size: 18.6 KB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 35 - Forks: 11

yui0/ugemm

GEMM

Language: C - Size: 103 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 3

xylcbd/gemm_base

gemm baseline code.

Language: C++ - Size: 4.88 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 2 - Forks: 0

mz24cn/gemm_optimization

The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。

Language: C - Size: 87.1 MB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 10 - Forks: 5

gemm 71 cuda 27 matrix-multiplication 17 gpu 15 blas 14 cublas 11 opencl 8 gemm-optimization 7 hpc 7 cpp 6 convolution 6 nvidia 5 openmp 5 pytorch 5 mkl 5 simd 5 cuda-kernels 5 cuda-programming 5 deep-learning 4 hgemm 4 linear-algebra 4 matrix-multiply 4 machine-learning 4 tensor-core 4 sgemm 4 llm 3 blis 3 cutlass 3 c 3 gpgpu 3 avx 3 assembly 3 matmul 3 math 3 parallel-computing 3 accelerator 3 gemv 2 clblas 2 tvm 2 rocm 2 cudnn 2 x86 2 float16 2 llm-inference 2 optimization 2 mojo 2 tensor 2 openblas 2 parallel 2 im2col 2 convolutional-neural-networks 2 half-precision 2 glsl 2 benchmark 2 single-header-lib 2 tensorcores 2 tensorcore 2 mixed-precision 2 code-optimization 2 sparse-matrix 2 mpi 2 ai 2 gpu-computing 2 hip 2 python 2 cpu 2 amd 2 benchmarking 2 deep-neural-networks 2 avx2 2 alsa 1 ascii 1 audio 1 transpose 1 codec 1 encoder 1 scan 1 flac 1 reduce 1 xnor-net 1 kms 1 xnor-convolutions 1 m4a 1 mp3 1 pytorch-extension 1 bloomfilter 1 math-library 1 rust-lang 1 performance 1 autotuning 1 gpus 1 multi-gpu 1 uncut-gemms 1 eigen 1 fortran 1 gonum 1 lapack 1 rust 1 fermi 1 numpy 1