Topic: "matmul"
eth-cscs/COSMA
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
Language: C++ - Size: 8.39 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 205 - Forks: 29

eth-cscs/Tiled-MM
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
Language: C++ - Size: 759 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 6

gha3mi/formatmul
ForMatmul - A Fortran library that overloads the matmul function to enable efficient matrix multiplication with/without coarray.
Language: Fortran - Size: 11.2 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

paxbun/float-matmul
Floating-point matrix multiplication implementation (arbitrary precision)
Language: Verilog - Size: 37.1 KB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 3 - Forks: 2

sagi21805/matmul-npu
Matrix multiplication on the NPU inside RK3588
Language: C++ - Size: 74.2 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 2 - Forks: 0

LaserBorg/circuitpython_benchmark
Raspberry Pi Pico (RP2040) and Adafruit Metro M7 (NXP IMXRT10XX) benchmark
Language: Python - Size: 8.79 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

Awrsha/Advanced-CUDA-Programming-GPU-Architecture
This repository provides a comprehensive guide to optimizing GPU kernels for performance, with a focus on NVIDIA GPUs. It covers key tools and techniques such as CUDA, PyTorch, and Triton, aimed at improving computational efficiency for deep learning and scientific computing tasks.
Language: Cuda - Size: 25.2 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

LRZ-BADW/OMMOP
OpenMP Matrix Multiplication Offloading Playground
Language: C++ - Size: 31.3 KB - Last synced at: 4 days ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

Alexieviri/Parallel-Computing-on-CUDA Fork of Russia163Samara/CUDA-labs
📰 This repository contains time measurements of various algorithms on the CPU and GPU using PyCuda: matrix multiplication, Pi computation, and bilateral filtering.
Size: 4.96 MB - Last synced at: 12 months ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

alprn42/Instruction-Counter
In this project, ınstruction numbers from a c program are counted with pin and c++.
Language: C++ - Size: 19.5 KB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

delveopers/Axon
Lightweight multi-dimensional array manipulation library powered by GPU
Language: C++ - Size: 121 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

eduand-alvarez/CUDA_Custom_MatMul_Experiment
This project integrates a custom CUDA-based matrix multiplication kernel into a PyTorch deep learning model, leveraging GPU acceleration for matrix operations. The goal is to compare the performance of this custom kernel with PyTorch's built-in matrix multiplication and demonstrate how custom CUDA kernels can optimize compute-intensive operations.
Language: Python - Size: 14.6 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

akifejaz/HwVerification
This repo contains the python scripts for MatMul's all modules testing.
Language: Python - Size: 30.2 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

digital-nomad-cheng/matmul_cuda_kernel_tvm
Generate optimized MatMul cuda kernel automatically using tvm auto schedule
Language: Jupyter Notebook - Size: 48.8 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

WilliamSpanfelner/day-76-computation_with_numpy
Check out the power of NumPy
Language: Jupyter Notebook - Size: 2.35 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

akifejaz/matmul-testbench
This is the simple script that generate matrixes of size 4 by 4, for testing Matmul.
Language: Python - Size: 21 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

jhson989/matmul_cublas
cuBLAS GEMM Example for FP32 MatMul
Language: Cuda - Size: 7.81 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

jhson989/SYCL-heterogeneous
CPU, GPU, and FPGA matrix multiplication examples via SYCL
Language: C++ - Size: 3.91 KB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

martins0n/matmul
Matrix-matrix multiplication implementations benchmarking
Language: Rust - Size: 43.9 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0
