GitHub topics: tensor-core

Repositories

Dartayous/FP16-vs-FP32-A-GPU-Lab-in-Frames

A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.

Language: Python - Size: 9.33 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

Bruce-Lee-LY/flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

Language: C++ - Size: 1.99 MB - Last synced at: 12 days ago - Pushed at: 6 months ago - Stars: 40 - Forks: 6

Bruce-Lee-LY/cutlass_gemm

Multiple GEMM operators are constructed with cutlass to support LLM inference.

Language: C++ - Size: 2.81 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 2

Bruce-Lee-LY/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Language: Cuda - Size: 459 KB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 62 - Forks: 7

Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language: Cuda - Size: 1.1 MB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 414 - Forks: 79

Bruce-Lee-LY/cuda_back2back_hgemm

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

Language: Cuda - Size: 854 KB - Last synced at: 5 months ago - Pushed at: almost 2 years ago - Stars: 11 - Forks: 2

fan1997/DTC-SpMM-ASPLOS24

Codes for DTC-SpMM (ASPLOS'24)

Language: C++ - Size: 1.23 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

junqi-xie-learning/CS4302-Assignments 📦

The lab assignments from CS4302 Parallel and Distributed Programming (2022 Fall) with my solutions

Language: C++ - Size: 4.86 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Related Keywords

tensor-core 8 cuda 6 nvidia 5 gpu 5 cublas 4 gemm 4 matrix-multiply 4 hgemm 3 llm 2 cutlass 2 openmp 1 cublaslt 1 spmm 1 cuda-core 1 gemv 1 sparse-matrix 1 hgemv 1 back2back-gemm 1 back2back-hgemm 1 fused-gemm 1 fused-hgemm 1 nvidia-gpu 1 reordering 1 cupy 1 deep-learning 1 fp16 1 fp32 1 gpu-benchmark 1 mixed-precision 1 nsight 1 nvtx 1 performance-engineering 1 pytorch 1 reproducible-research 1 flash-attention 1 flash-attention-2 1 inference 1 large-language-model 1 mha 1 multi-head-attention 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos