GitHub topics: tensor-core
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language: Cuda - Size: 1.1 MB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 408 - Forks: 79

Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Language: Cuda - Size: 459 KB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 61 - Forks: 5

Bruce-Lee-LY/cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
Language: C++ - Size: 2.14 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 17 - Forks: 2

Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Language: C++ - Size: 1.99 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 35 - Forks: 4

Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Language: Cuda - Size: 854 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

fan1997/DTC-SpMM-ASPLOS24
Codes for DTC-SpMM (ASPLOS'24)
Language: C++ - Size: 1.23 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

junqi-xie-learning/CS4302-Assignments 📦
The lab assignments from CS4302 Parallel and Distributed Programming (2022 Fall) with my solutions
Language: C++ - Size: 4.86 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0
