matmul | Topic | Ecosyste.ms: Repos

Topic: "matmul"

eth-cscs/COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm

Language: C++ - Size: 8.39 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 205 - Forks: 29

eth-cscs/Tiled-MM

Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.

Language: C++ - Size: 759 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 6

gha3mi/formatmul

ForMatmul - A Fortran library that overloads the matmul function to enable efficient matrix multiplication with/without coarray.

Language: Fortran - Size: 11.2 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

paxbun/float-matmul

Floating-point matrix multiplication implementation (arbitrary precision)

Language: Verilog - Size: 37.1 KB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 3 - Forks: 2

sagi21805/matmul-npu

Matrix multiplication on the NPU inside RK3588

Language: C++ - Size: 74.2 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 2 - Forks: 0

LaserBorg/circuitpython_benchmark

Raspberry Pi Pico (RP2040) and Adafruit Metro M7 (NXP IMXRT10XX) benchmark

Language: Python - Size: 8.79 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

Awrsha/Advanced-CUDA-Programming-GPU-Architecture

This repository provides a comprehensive guide to optimizing GPU kernels for performance, with a focus on NVIDIA GPUs. It covers key tools and techniques such as CUDA, PyTorch, and Triton, aimed at improving computational efficiency for deep learning and scientific computing tasks.

Language: Cuda - Size: 25.2 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

LRZ-BADW/OMMOP

OpenMP Matrix Multiplication Offloading Playground

Language: C++ - Size: 31.3 KB - Last synced at: 4 days ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

Alexieviri/Parallel-Computing-on-CUDA Fork of Russia163Samara/CUDA-labs

📰 This repository contains time measurements of various algorithms on the CPU and GPU using PyCuda: matrix multiplication, Pi computation, and bilateral filtering.

Size: 4.96 MB - Last synced at: 12 months ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

alprn42/Instruction-Counter

In this project, ınstruction numbers from a c program are counted with pin and c++.

Language: C++ - Size: 19.5 KB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

delveopers/Axon

Lightweight multi-dimensional array manipulation library powered by GPU

Language: C++ - Size: 121 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

eduand-alvarez/CUDA_Custom_MatMul_Experiment

This project integrates a custom CUDA-based matrix multiplication kernel into a PyTorch deep learning model, leveraging GPU acceleration for matrix operations. The goal is to compare the performance of this custom kernel with PyTorch's built-in matrix multiplication and demonstrate how custom CUDA kernels can optimize compute-intensive operations.

Language: Python - Size: 14.6 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0