An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: tensor-core

Dartayous/FP16-vs-FP32-A-GPU-Lab-in-Frames

A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.

Language: Python - Size: 9.33 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

Bruce-Lee-LY/flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

Language: C++ - Size: 1.99 MB - Last synced at: 12 days ago - Pushed at: 6 months ago - Stars: 40 - Forks: 6

Bruce-Lee-LY/cutlass_gemm

Multiple GEMM operators are constructed with cutlass to support LLM inference.

Language: C++ - Size: 2.81 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 2

Bruce-Lee-LY/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Language: Cuda - Size: 459 KB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 62 - Forks: 7

Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language: Cuda - Size: 1.1 MB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 414 - Forks: 79

Bruce-Lee-LY/cuda_back2back_hgemm

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

Language: Cuda - Size: 854 KB - Last synced at: 5 months ago - Pushed at: almost 2 years ago - Stars: 11 - Forks: 2

fan1997/DTC-SpMM-ASPLOS24

Codes for DTC-SpMM (ASPLOS'24)

Language: C++ - Size: 1.23 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

junqi-xie-learning/CS4302-Assignments 📦

The lab assignments from CS4302 Parallel and Distributed Programming (2022 Fall) with my solutions

Language: C++ - Size: 4.86 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0