GitHub topics: tensor-core
Dartayous/FP16-vs-FP32-A-GPU-Lab-in-Frames
A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.
Language: Python - Size: 9.33 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Language: C++ - Size: 1.99 MB - Last synced at: 12 days ago - Pushed at: 6 months ago - Stars: 40 - Forks: 6

Bruce-Lee-LY/cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
Language: C++ - Size: 2.81 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 2

Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Language: Cuda - Size: 459 KB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 62 - Forks: 7

Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language: Cuda - Size: 1.1 MB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 414 - Forks: 79

Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Language: Cuda - Size: 854 KB - Last synced at: 5 months ago - Pushed at: almost 2 years ago - Stars: 11 - Forks: 2

fan1997/DTC-SpMM-ASPLOS24
Codes for DTC-SpMM (ASPLOS'24)
Language: C++ - Size: 1.23 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

junqi-xie-learning/CS4302-Assignments 📦
The lab assignments from CS4302 Parallel and Distributed Programming (2022 Fall) with my solutions
Language: C++ - Size: 4.86 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0
