GitHub topics: cublas
kevmo314/scuda
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Language: C++ - Size: 2.6 MB - Last synced at: 1 day ago - Pushed at: about 1 month ago - Stars: 1,714 - Forks: 65

coderonion/awesome-cuda-and-hpc
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
Size: 55.7 KB - Last synced at: 1 day ago - Pushed at: 19 days ago - Stars: 263 - Forks: 31

Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language: Cuda - Size: 1.1 MB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 408 - Forks: 79

coreylowman/cudarc
Safe rust wrapper around CUDA toolkit
Language: Rust - Size: 2.91 MB - Last synced at: 2 days ago - Pushed at: 8 days ago - Stars: 836 - Forks: 99

cupy/cupy
NumPy & SciPy for GPU
Language: Python - Size: 40.6 MB - Last synced at: 2 days ago - Pushed at: 14 days ago - Stars: 10,198 - Forks: 910

enp1s0/CULiP
Library for profiling the execution time of CUDA official library functions
Language: Cuda - Size: 154 KB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 10 - Forks: 0

rbaygildin/learn-gpgpu
Algorithms implemented in CUDA + resources about GPGPU
Language: Cuda - Size: 226 KB - Last synced at: 1 day ago - Pushed at: over 3 years ago - Stars: 56 - Forks: 14

lebedov/scikit-cuda
Python interface to GPU-powered libraries
Language: Python - Size: 2.44 MB - Last synced at: about 18 hours ago - Pushed at: over 1 year ago - Stars: 991 - Forks: 181

bokutotu/zenu
A Deep Learning framework with very few dependencies, Written in Rust
Language: Rust - Size: 8.12 MB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 63 - Forks: 1

Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Language: Cuda - Size: 459 KB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 61 - Forks: 5

jagennath-hari/CUDA-Accelerated-Visual-Inertial-Odometry-Fusion
Harness the power of GPU acceleration for fusing visual odometry and IMU data with an advanced Unscented Kalman Filter (UKF) implementation. Developed in C++ and utilizing CUDA, cuBLAS, and cuSOLVER, this system offers unparalleled real-time performance in state and covariance estimation for robotics and autonomous system applications.
Language: Cuda - Size: 211 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 25 - Forks: 2

mnicely/nvml_examples
Examples showing how to utilize the NVML library for GPU monitoring
Language: C++ - Size: 256 KB - Last synced at: 24 days ago - Pushed at: almost 3 years ago - Stars: 28 - Forks: 1

VORTICITY-INC/VTensor
VTensor, a C++ library, facilitates tensor manipulation on GPUs, emulating the python-numpy style for ease of use. It leverages RMM (RAPIDS Memory Manager) for efficient device memory management. It also supports xtensor for host memory operations.
Language: C++ - Size: 6.65 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

Tyler-Hilbert/CUDA-LinearRegression
Linear Regression in CUDA
Language: Cuda - Size: 390 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

Bruce-Lee-LY/cuda_hook
Hooked CUDA-related dynamic libraries by using automated code generation tools.
Language: C - Size: 717 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 150 - Forks: 41

Bruce-Lee-LY/cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
Language: C++ - Size: 2.14 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 17 - Forks: 2

coderonion/cuda-beginner-course-cpp-version
bilibili视频【CUDA 12.x 并行编程入门(C++版)】配套代码
Language: Cuda - Size: 20.5 KB - Last synced at: 1 day ago - Pushed at: 9 months ago - Stars: 30 - Forks: 4

coderonion/cuda-beginner-course-rust-version
bilibili视频【CUDA 12.x 并行编程入门(Rust版)】配套代码
Language: Rust - Size: 10.7 KB - Last synced at: 1 day ago - Pushed at: 9 months ago - Stars: 6 - Forks: 0

mnovak42/leuven
Framework, toolkit and ready-to-use applications for numerical linear algebra dependent machine learning algorithms.
Language: C++ - Size: 45.4 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Smorodov/nano_bfm
Basel morphable face model mesh and texture generator using GPU.
Language: C - Size: 8.97 MB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 13 - Forks: 6

Black-Phoenix/CUDA-MLP
A multilayer perceptron (for simple image classification), accelerated with CUDA
Language: CMake - Size: 1.33 MB - Last synced at: 3 days ago - Pushed at: over 5 years ago - Stars: 16 - Forks: 1

sasagawa888/deeppipe2
Deep Learning library using GPU(CUDA/cuBLAS)
Language: Elixir - Size: 835 MB - Last synced at: about 1 month ago - Pushed at: over 3 years ago - Stars: 94 - Forks: 6

AaronJackson/cl-cublas 📦
:cow: Harness the power of the GPU with '((((((cuBLAS in Common Lisp
Language: Common Lisp - Size: 53.7 KB - Last synced at: 3 days ago - Pushed at: about 6 years ago - Stars: 5 - Forks: 1

machineko/SwiftCUBLAS
SwiftCUBLAS is a wrapper for cuBLAS APIs with extra utilities for ease of usage, along with a suite of tests. The repository is tested on the newest (v12.5) CUDA runtime API on both Linux and Windows.
Language: Swift - Size: 647 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

conradsnicta/bandicoot-code
Bandicoot: C++ library for GPU linear algebra & scientific computing - https://coot.sourceforge.io
Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 29 - Forks: 5

gritukan/hamkaas
Language: C++ - Size: 802 KB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 25 - Forks: 1

rxwei/cuda-swift
Parallel Computing Library for Linux and macOS & NVIDIA CUDA Wrapper
Language: Swift - Size: 315 KB - Last synced at: 16 days ago - Pushed at: about 8 years ago - Stars: 82 - Forks: 8

tigercosmos/simple-vgg16-cu
Simple VGG16 implemented in CUDA
Language: Cuda - Size: 2.93 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 7 - Forks: 1

mnicely/computeWorks_examples
Matrix multiplication example performed with OpenMP, OpenACC, BLAS, cuBLABS, and CUDA
Language: C++ - Size: 834 KB - Last synced at: 24 days ago - Pushed at: almost 3 years ago - Stars: 7 - Forks: 1

yester31/CUDA_EX
CUDA kernel functions
Language: Cuda - Size: 92.9 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 2

versi379/Optimized-Matrix-Multiplication
This project utilizes CUDA and cuBLAS to optimize matrix multiplication, achieving up to a 5x speedup on large matrices by leveraging GPU acceleration. It also improves memory efficiency and reduces data transfer times between CPU and GPU.
Language: C++ - Size: 4.88 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

coderonion/moblas
BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.
Language: Mojo - Size: 5.86 KB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

Bruce-Lee-LY/cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Language: Cuda - Size: 854 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

BorisLestsov/CudaInference
Cuda NN inference
Language: C++ - Size: 41.2 MB - Last synced at: 7 months ago - Pushed at: about 5 years ago - Stars: 4 - Forks: 1

TApplencourt/mkl-verbose-toolkit
Tools to run and parse MKL verbose mode
Language: Python - Size: 70.3 KB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 17 - Forks: 4

coderonion/cuda-beginner-course-python-version
bilibili视频【CUDA 12.x 并行编程入门(Python版)】配套代码
Language: Python - Size: 3.91 KB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 0

Bruce-Lee-LY/matrix_multiply
Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.
Language: C++ - Size: 12.7 KB - Last synced at: 1 day ago - Pushed at: over 2 years ago - Stars: 14 - Forks: 2

mojo-cc/monum Fork of codingonion/monum
Mojo BLAS (Basic Linear Algebra Subprograms)
Language: Mojo - Size: 7.81 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

unum-cloud/udsb
Unlimited Data-Science Benchmarks for Numeric, Tabular and Graph Workloads
Language: Jupyter Notebook - Size: 3.57 MB - Last synced at: about 14 hours ago - Pushed at: about 2 years ago - Stars: 9 - Forks: 1

eth-cscs/Tiled-MM
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
Language: C++ - Size: 759 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 21 - Forks: 6

fff-rs/rust-cublas 📦
Safe CUDA cuBLAS wrapper for the Rust language.
Language: Rust - Size: 1.24 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 5 - Forks: 1

zhaocc1106/cuxx-programing
一些cuda库的样例,cuda、cublas、cublaslt、cusparse...
Language: Cuda - Size: 54.7 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

maximilianbehr/cuPolar
Newton's and Hayley's Method for the Matrix Polar Decomposition using CUDA
Language: Cuda - Size: 28.3 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

maximilianbehr/cuNMF
Nonnegative matrix factorizations using CUDA
Language: Cuda - Size: 3.51 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

jhson989/matmul_cublas
cuBLAS GEMM Example for FP32 MatMul
Language: Cuda - Size: 7.81 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

maximilianbehr/cuexpm
Matrix Exponential Approximation using CUDA
Language: Cuda - Size: 57.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

tmrob2/cuda2rust_sandpit
Minimal examples to get CUDA linear algebra programs working with Rust using CC & FFI.
Language: Rust - Size: 13.7 KB - Last synced at: 1 day ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

gigernau/PCAHyperspectralClassifier
Classification of Hyperspectral Images ( HSIs ) with Principal Component Analysis ( PCA ) in CUDA ( cuBLAS ).
Language: Python - Size: 24.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

alignedalignof/cuda-matmul
Explore performance implications of various matrix multiplication approaches using GPU/CUDA compared to CPU side processing
Language: C++ - Size: 407 KB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

Peter9606/neuralnetwork
A simple neural network implemented by using cudnn and cublas
Language: C++ - Size: 216 KB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

nikitanosov1/parallel-programming
Лабораторные работы по курсу "Параллельное программирование"
Language: C++ - Size: 2.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Zlisch/LargeMM
A CUBLAS‐CUDA Based Implementation of Multi-GPU Large Matrix Multiplication
Language: Cuda - Size: 196 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

nikulukani/pycublasxt
Python interface to the NVIDIA CublasXt API
Language: C++ - Size: 13.7 KB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 6 - Forks: 2

hma02/cublasHgemm-P100
Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm
Language: Cuda - Size: 18.6 KB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 35 - Forks: 11

rvcgeeks/rvc-mnist-cnn-gpu
A MNIST handwritten digit classifier written from scratch in Cuda - C
Language: Cuda - Size: 4.47 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 1

devincody/DSAbeamformer
Real-time GPU Beamformer for DSA110 written in C/CUDA
Language: Jupyter Notebook - Size: 4.16 MB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 15 - Forks: 6

Kostisef/cuda-matrix-multiplication
A CUDA approach for computing the multiplication of a transposed matrix with the initial one, using the cuBLAS library.
Language: Cuda - Size: 9.77 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

dc-fukuoka/gpumm
gpumm - matrix-matrix multiplication by using CUDA, cublas, cublasxt and OpenACC.
Language: Cuda - Size: 7.6 MB - Last synced at: 10 months ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0

tenxlenx/GpuDct
A library to extract DCT hashes with CUDA
Language: Cuda - Size: 8.79 KB - Last synced at: 10 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

mz24cn/gemm_optimization
The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能,提供binary,开盒即用。
Language: C - Size: 87.1 MB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 10 - Forks: 5

ironhide23586/SHMatrix
A neat C++ custom Matrix class to perform super-fast GPU (or CPU) powered Matrix/Vector computations with minimal code, leveraging the power of cuBLAS where applicable.
Language: C++ - Size: 42.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 8 years ago - Stars: 2 - Forks: 1

Erellu/ste-Matrix
C++ CUDA-compatible template class that provides an interface for generic purpose matrix related algorithms and computations. Includes Matlab-like functions. This is mainly an example of how to use CUDA code with C++. Don't expect such high performance.
Language: C++ - Size: 193 KB - Last synced at: almost 2 years ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 1

milk-org/milk-package 📦
Modular Image processing Library toolKit (milk)
Language: C - Size: 14.6 MB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

hma02/cublasgemm-benchmark
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
Language: Cuda - Size: 9.77 KB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 19 - Forks: 16

chenxuhao/caffe-escoin
Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs
Language: C++ - Size: 37.8 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 12 - Forks: 2

ilwoolyu/HSD
HSD: Hierarchical Spherical Defomration for Cortical Surface Registration
Language: C++ - Size: 4.76 MB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 7 - Forks: 1

Ending2015a/StableFluid-CUDA
A really old project that implemented the Stable Fluids using CUDA, cuBLAS and cuSPARSE
Language: C++ - Size: 114 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

am-shb/cuda-elm
Extreme Learning Machine for image classification implemented using Cuda C++ and cuBLAS
Language: Jupyter Notebook - Size: 193 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

yester31/GEMM_Conv2d_CUDA
CUDA Gemm Convolution implementation
Language: C++ - Size: 564 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

Himeyama/cuda-nmf
NMF calculations are performed on NVIDIA GPUs using the Cuda API. (GEM released)
Language: C++ - Size: 67.4 KB - Last synced at: 3 months ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

SuperbTUM/kmeans-pycuda
A general k-means algorithm with L2 distance using pyCUDA
Language: Jupyter Notebook - Size: 499 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

pmontalb/CudaLightKernels
Collection of CUDA wrappers for a simplified kernel call
Language: Cuda - Size: 97.7 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

PanosAntoniadis/cuda-exercises-ntua
Lab exercise of Parallel Processing course in NTUA regarding CUDA programming
Language: Cuda - Size: 2.84 MB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 9 - Forks: 0

vxj9800/cuda-matrix-lib
Language: Cuda - Size: 7.28 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

miraliahmadli/YoloV2-C
Language: Python - Size: 231 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

Tecnarca/CPU-GPU-speed-comparison
A comparison between single thread program against multi-threading and CUDA approaches
Language: C++ - Size: 551 KB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 4 - Forks: 0

prasunanand/cuda_d
D bindings for CUDA
Language: D - Size: 64.5 KB - Last synced at: about 12 hours ago - Pushed at: almost 8 years ago - Stars: 1 - Forks: 0

PineApple777/myCUBLASBasic
custom Basic cuBLAS example from modifying NVIDIA cuBLAS Samples
Language: Makefile - Size: 171 KB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

zqy767/cl-matrix
matrix operation using gpu
Language: Common Lisp - Size: 20.5 KB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

naetherm/DerelictCuBLAS
Dynamic bindings to the CuBLAS library for the D Programming Language.
Language: D - Size: 14.6 KB - Last synced at: about 2 months ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

itohdak/mnist_euslisp
learning mnist with euslisp
Language: Common Lisp - Size: 468 KB - Last synced at: almost 2 years ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 1

dendisuhubdy/cupy Fork of cupy/cupy
NumPy-like API accelerated with CUDA
Language: Python - Size: 13.6 MB - Last synced at: 12 months ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

leobrl/XBlas
Language: Cuda - Size: 25.4 KB - Last synced at: 8 months ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0
