gemm | Topic | Ecosyste.ms: Repos

Topic: "gemm"

OpenNMT/CTranslate2

Fast inference engine for Transformer models

Language: C++ - Size: 14.5 MB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 3,810 - Forks: 358

flame/how-to-optimize-gemm

Language: C - Size: 2.18 MB - Last synced at: 6 days ago - Pushed at: almost 2 years ago - Stars: 1,874 - Forks: 357

CNugteren/CLBlast

Tuned OpenCL BLAS

Language: C++ - Size: 6.7 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 1,103 - Forks: 207

flame/blislab

BLISlab: A Sandbox for Optimizing GEMM

Language: C - Size: 6.8 MB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 513 - Forks: 107

Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language: Cuda - Size: 1.1 MB - Last synced at: 3 days ago - Pushed at: 9 months ago - Stars: 414 - Forks: 79

salykova/sgemm.c

Multi-Threaded FP32 Matrix Multiplication on x86 CPUs

Language: C - Size: 2.82 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 347 - Forks: 22

yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Language: Cuda - Size: 1.09 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 297 - Forks: 46

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

Language: Nim - Size: 3.65 MB - Last synced at: about 21 hours ago - Pushed at: over 1 year ago - Stars: 285 - Forks: 14

coderonion/awesome-cuda-and-hpc

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

Size: 55.7 KB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 268 - Forks: 31

ROCm/Tensile

Stretching GPU performance for GEMMs and tensor contractions.

Language: Python - Size: 95.1 MB - Last synced at: 3 days ago - Pushed at: 7 days ago - Stars: 241 - Forks: 162

cp2k/dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library

Language: Fortran - Size: 618 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 142 - Forks: 48

yui0/slibs

Single file libraries for C/C++

Language: C - Size: 12.9 MB - Last synced at: 20 days ago - Pushed at: 10 months ago - Stars: 121 - Forks: 11

ROCm/hipBLASLt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

Language: Assembly - Size: 1010 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 96 - Forks: 134

enp1s0/ozIMMU

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

Language: Cuda - Size: 186 KB - Last synced at: about 6 hours ago - Pushed at: 2 months ago - Stars: 65 - Forks: 5

yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

Language: C - Size: 3.33 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 65 - Forks: 16

Bruce-Lee-LY/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Language: Cuda - Size: 459 KB - Last synced at: 14 days ago - Pushed at: 9 months ago - Stars: 61 - Forks: 5

BoooC/CNN-Accelerator-Based-on-Eyeriss-v2

A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network

Language: Verilog - Size: 156 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 45 - Forks: 3

aredden/torch-cublas-hgemm

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Language: Cuda - Size: 45.9 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 41 - Forks: 3

hma02/cublasHgemm-P100

Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm

Language: Cuda - Size: 18.6 KB - Last synced at: over 1 year ago - Pushed at: almost 6 years ago - Stars: 35 - Forks: 11

iVishalr/GEMM

Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

Language: C - Size: 12.7 KB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 31 - Forks: 4

eth-cscs/spla

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

Language: C++ - Size: 929 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 29 - Forks: 7

CoffeeBeforeArch/mmul

Serial and parallel implementations of matrix multiplication

Language: C++ - Size: 1.35 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 29 - Forks: 3

andylolu2/simpleGEMM

The simplest but fast implementation of matrix multiplication in CUDA.

Language: Cuda - Size: 92.8 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 26 - Forks: 2

szagoruyko/openai-gemm.pytorch

PyTorch bindings for openai-gemm

Language: Python - Size: 1.95 KB - Last synced at: about 2 months ago - Pushed at: over 8 years ago - Stars: 20 - Forks: 4

hma02/cublasgemm-benchmark

code for benchmarking GPU performance based on cublasSgemm and cublasHgemm

Language: Cuda - Size: 9.77 KB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 19 - Forks: 16

KarhouTam/cuda-kernels

Some common CUDA kernel implementations (Not the fastest).

Language: Cuda - Size: 57.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 17 - Forks: 1

Bruce-Lee-LY/cutlass_gemm

Multiple GEMM operators are constructed with cutlass to support LLM inference.

Language: C++ - Size: 2.14 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 17 - Forks: 2

XiaoSong9905/dgemm-knl

DGEMM on KNL, achieve 75% MKL

Language: C++ - Size: 73.2 KB - Last synced at: 13 days ago - Pushed at: about 3 years ago - Stars: 17 - Forks: 0

enp1s0/cuMpSGEMM

Fast SGEMM emulation on Tensor Cores

Language: Cuda - Size: 476 KB - Last synced at: about 6 hours ago - Pushed at: 3 months ago - Stars: 12 - Forks: 1

Bruce-Lee-LY/cuda_back2back_hgemm

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

Language: Cuda - Size: 854 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

mz24cn/gemm_optimization

The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。

Language: C - Size: 87.1 MB - Last synced at: almost 2 years ago - Pushed at: about 6 years ago - Stars: 10 - Forks: 5

merledu/magma-si

Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL

Language: Scala - Size: 46.8 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 3

xziya/gemm-opt

Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.

Language: C++ - Size: 39.1 KB - Last synced at: 6 months ago - Pushed at: almost 4 years ago - Stars: 8 - Forks: 0

yui0/ugemm

GEMM

Language: C - Size: 103 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 7 - Forks: 3

blackccpie/fastconv

fast 2D convolution implementation benchmark

Language: C++ - Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 6 - Forks: 2

yester31/CUDA_EX

CUDA kernel functions

Language: Cuda - Size: 92.9 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 4 - Forks: 2

coderonion/moblas

BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.

Language: Mojo - Size: 5.86 KB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

dev0x13/gemm-benchmark-2023

Benchmarks for some modern (2023) high-performance floating-point GEMM implementations compared to Mojo language

Language: Mojo - Size: 688 KB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

cyrusmsk/gemm_apple

GEMM on Apple Silicon

Language: Python - Size: 7.81 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 0

ZhangGe6/how-to-optimize-playground

High-performance computing (HPC) demos since I was a freshmen.

Language: C - Size: 1020 KB - Last synced at: about 2 months ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 0

yester31/OpenCL_EX

Development of deep learning inference code by OpenCL kerenl function.

Language: C++ - Size: 27.7 MB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 0

yester31/GEMM_Conv2d_CUDA

CUDA Gemm Convolution implementation

Language: C++ - Size: 564 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

KaiserKlayton/lpa_cnn

Low Precision Arithmetic for Convolutional Neural Network Inference

Language: C++ - Size: 263 MB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 3 - Forks: 0

JoeruCodes/CUDA-GEMM-kernel

My attempt of making a GEMM kernel...

Language: Cuda - Size: 67.4 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

fattorib/thunderkittens-simple-gemm

Simple Tensorcore GEMM in ThunderKittens

Language: Cuda - Size: 13.7 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

jiegec/sgemm-optimize

Optimization of sgemm in Kunpeng platform

Language: C - Size: 1.89 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

jhson989/fast-conv

Fast Convoluion Implementation via CUDA

Language: Cuda - Size: 195 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

scocoyash/Convolution-To-Gemm

My experiments with convolution

Language: C++ - Size: 23.4 KB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 2 - Forks: 1

xylcbd/gemm_base

gemm baseline code.

Language: C++ - Size: 4.88 KB - Last synced at: almost 2 years ago - Pushed at: over 7 years ago - Stars: 2 - Forks: 0

govindansriram/sm89-kernels

SM89 Optimized CUDA Kernels

Language: Cuda - Size: 65.4 KB - Last synced at: about 8 hours ago - Pushed at: about 9 hours ago - Stars: 1 - Forks: 0

ArcAII/chat

Xode Open Source Code assistant

Language: JavaScript - Size: 35.3 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

daedalus/AIscripts

Scripts created with AI assistance from white papers and publications

Language: Jupyter Notebook - Size: 910 KB - Last synced at: 23 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

p-anastas/PARALiA-GEMMex

Language: C++ - Size: 1.58 MB - Last synced at: 5 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

TensorBFS/CuTropicalGEMM.jl

The fastest Tropical number matrix multiplication on GPU

Language: Julia - Size: 1.46 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

LRZ-BADW/OMMOP

OpenMP Matrix Multiplication Offloading Playground

Language: C++ - Size: 31.3 KB - Last synced at: about 15 hours ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

pminhtam/xnor_conv_pytorch_extension

XNOR-Net with binary conv2d kernels with XNOR GEMM op, support both CPU and GPU.

Language: C - Size: 66.4 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

jerinphilip/MozIntGemm

Wrapper around intgemm (x86_64) and ruy (ARM) to switch between both based on architecture and provide a fast matrix multiplication backend for Mozilla Firefox's translation feature.

Language: C++ - Size: 234 KB - Last synced at: 5 days ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 2

a-sidorova/gpu_opencl_cource

Course Programming on new Architecture-1 (GPU), autumn 2021

Language: C++ - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

andreytkachenko/yarblas

Yet another rust BLAS

Language: Rust - Size: 32.2 KB - Last synced at: 2 months ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

adelj88/rocm_wmma_gemm

WMMA GEMM in ROCm for RDNA GPUs

Language: C++ - Size: 44.9 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

snalo/SPOGA

Repo for the SPOGA Accelerator - Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels

Language: Python - Size: 86.9 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Serge45/amdgpu-arch-gemm

Yet another AMD GCN assembly generator for GEMM

Language: C++ - Size: 118 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

TeamBipartite/csc485b-202409-a4

High throughput data-parallel GEMM implementations in Cuda using Cuda cores and Tensor cores

Language: C++ - Size: 834 KB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

nirw4nna/YAMI

Yet Another Machine Inference framework

Language: C++ - Size: 794 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

DongqiShen/iLLM

Implementing LLM from scratch. (Developing...)

Language: C - Size: 24.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

digital-nomad-cheng/matmul_cuda_kernel_tvm

Generate optimized MatMul cuda kernel automatically using tvm auto schedule

Language: Jupyter Notebook - Size: 48.8 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

junyoung1992/OpenCL-GEMM

GEMM Optimization

Language: C - Size: 26.4 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

martins0n/matmul

Matrix-matrix multiplication implementations benchmarking

Language: Rust - Size: 43.9 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

PhuNH/hpc-aa

High Performance Computing - Algorithms and Applications Course in WS18-19 at TUM

Language: C++ - Size: 634 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

koallen/gemm-optimization

My experiments on optimizing GEMM

Language: C - Size: 47.9 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 1

riskybacon/mnist_arma_blas

Language: C++ - Size: 49.8 KB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0