GitHub topics: cublas

Repositories

kevmo314/scuda

SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.

Language: C++ - Size: 2.6 MB - Last synced at: 1 day ago - Pushed at: about 1 month ago - Stars: 1,714 - Forks: 65

coderonion/awesome-cuda-and-hpc

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

Size: 55.7 KB - Last synced at: 1 day ago - Pushed at: 19 days ago - Stars: 263 - Forks: 31

Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Language: Cuda - Size: 1.1 MB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 408 - Forks: 79

coreylowman/cudarc

Safe rust wrapper around CUDA toolkit

Language: Rust - Size: 2.91 MB - Last synced at: 2 days ago - Pushed at: 8 days ago - Stars: 836 - Forks: 99

cupy/cupy

NumPy & SciPy for GPU

Language: Python - Size: 40.6 MB - Last synced at: 2 days ago - Pushed at: 14 days ago - Stars: 10,198 - Forks: 910

enp1s0/CULiP

Library for profiling the execution time of CUDA official library functions

Language: Cuda - Size: 154 KB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 10 - Forks: 0

rbaygildin/learn-gpgpu

Algorithms implemented in CUDA + resources about GPGPU

Language: Cuda - Size: 226 KB - Last synced at: 1 day ago - Pushed at: over 3 years ago - Stars: 56 - Forks: 14

lebedov/scikit-cuda

Python interface to GPU-powered libraries

Language: Python - Size: 2.44 MB - Last synced at: about 18 hours ago - Pushed at: over 1 year ago - Stars: 991 - Forks: 181

bokutotu/zenu

A Deep Learning framework with very few dependencies, Written in Rust

Language: Rust - Size: 8.12 MB - Last synced at: 5 days ago - Pushed at: 3 months ago - Stars: 63 - Forks: 1

Bruce-Lee-LY/cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Language: Cuda - Size: 459 KB - Last synced at: 1 day ago - Pushed at: 8 months ago - Stars: 61 - Forks: 5

jagennath-hari/CUDA-Accelerated-Visual-Inertial-Odometry-Fusion

Harness the power of GPU acceleration for fusing visual odometry and IMU data with an advanced Unscented Kalman Filter (UKF) implementation. Developed in C++ and utilizing CUDA, cuBLAS, and cuSOLVER, this system offers unparalleled real-time performance in state and covariance estimation for robotics and autonomous system applications.

Language: Cuda - Size: 211 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 25 - Forks: 2

mnicely/nvml_examples

Examples showing how to utilize the NVML library for GPU monitoring

Language: C++ - Size: 256 KB - Last synced at: 24 days ago - Pushed at: almost 3 years ago - Stars: 28 - Forks: 1

VORTICITY-INC/VTensor

VTensor, a C++ library, facilitates tensor manipulation on GPUs, emulating the python-numpy style for ease of use. It leverages RMM (RAPIDS Memory Manager) for efficient device memory management. It also supports xtensor for host memory operations.

Language: C++ - Size: 6.65 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

Tyler-Hilbert/CUDA-LinearRegression

Linear Regression in CUDA

Language: Cuda - Size: 390 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

Bruce-Lee-LY/cuda_hook

Hooked CUDA-related dynamic libraries by using automated code generation tools.

Language: C - Size: 717 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 150 - Forks: 41

Bruce-Lee-LY/cutlass_gemm

Multiple GEMM operators are constructed with cutlass to support LLM inference.

Language: C++ - Size: 2.14 MB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 17 - Forks: 2

coderonion/cuda-beginner-course-cpp-version

bilibili视频【CUDA 12.x 并行编程入门(C++版)】配套代码

Language: Cuda - Size: 20.5 KB - Last synced at: 1 day ago - Pushed at: 9 months ago - Stars: 30 - Forks: 4

coderonion/cuda-beginner-course-rust-version

bilibili视频【CUDA 12.x 并行编程入门(Rust版)】配套代码

Language: Rust - Size: 10.7 KB - Last synced at: 1 day ago - Pushed at: 9 months ago - Stars: 6 - Forks: 0

mnovak42/leuven

Framework, toolkit and ready-to-use applications for numerical linear algebra dependent machine learning algorithms.

Language: C++ - Size: 45.4 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Smorodov/nano_bfm

Basel morphable face model mesh and texture generator using GPU.

Language: C - Size: 8.97 MB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 13 - Forks: 6

Black-Phoenix/CUDA-MLP

A multilayer perceptron (for simple image classification), accelerated with CUDA

Language: CMake - Size: 1.33 MB - Last synced at: 3 days ago - Pushed at: over 5 years ago - Stars: 16 - Forks: 1

sasagawa888/deeppipe2

Deep Learning library using GPU(CUDA/cuBLAS)

Language: Elixir - Size: 835 MB - Last synced at: about 1 month ago - Pushed at: over 3 years ago - Stars: 94 - Forks: 6

AaronJackson/cl-cublas 📦

:cow: Harness the power of the GPU with '((((((cuBLAS in Common Lisp

Language: Common Lisp - Size: 53.7 KB - Last synced at: 3 days ago - Pushed at: about 6 years ago - Stars: 5 - Forks: 1

machineko/SwiftCUBLAS

SwiftCUBLAS is a wrapper for cuBLAS APIs with extra utilities for ease of usage, along with a suite of tests. The repository is tested on the newest (v12.5) CUDA runtime API on both Linux and Windows.

Language: Swift - Size: 647 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

conradsnicta/bandicoot-code

Bandicoot: C++ library for GPU linear algebra & scientific computing - https://coot.sourceforge.io

Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 29 - Forks: 5

gritukan/hamkaas

Language: C++ - Size: 802 KB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 25 - Forks: 1

rxwei/cuda-swift

Parallel Computing Library for Linux and macOS & NVIDIA CUDA Wrapper

Language: Swift - Size: 315 KB - Last synced at: 16 days ago - Pushed at: about 8 years ago - Stars: 82 - Forks: 8

tigercosmos/simple-vgg16-cu

Simple VGG16 implemented in CUDA

Language: Cuda - Size: 2.93 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 7 - Forks: 1

mnicely/computeWorks_examples

Matrix multiplication example performed with OpenMP, OpenACC, BLAS, cuBLABS, and CUDA

Language: C++ - Size: 834 KB - Last synced at: 24 days ago - Pushed at: almost 3 years ago - Stars: 7 - Forks: 1

yester31/CUDA_EX

CUDA kernel functions

Language: Cuda - Size: 92.9 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 2

versi379/Optimized-Matrix-Multiplication

This project utilizes CUDA and cuBLAS to optimize matrix multiplication, achieving up to a 5x speedup on large matrices by leveraging GPU acceleration. It also improves memory efficiency and reduces data transfer times between CPU and GPU.

Language: C++ - Size: 4.88 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

coderonion/moblas

BLAS (Basic Linear Algebra Subprograms) library written in mojo programming language.

Language: Mojo - Size: 5.86 KB - Last synced at: 1 day ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

Bruce-Lee-LY/cuda_back2back_hgemm

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

Language: Cuda - Size: 854 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

BorisLestsov/CudaInference

Cuda NN inference

Language: C++ - Size: 41.2 MB - Last synced at: 7 months ago - Pushed at: about 5 years ago - Stars: 4 - Forks: 1

TApplencourt/mkl-verbose-toolkit

Tools to run and parse MKL verbose mode

Language: Python - Size: 70.3 KB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 17 - Forks: 4

coderonion/cuda-beginner-course-python-version

bilibili视频【CUDA 12.x 并行编程入门(Python版)】配套代码

Language: Python - Size: 3.91 KB - Last synced at: 1 day ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 0

Bruce-Lee-LY/matrix_multiply

Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.

Language: C++ - Size: 12.7 KB - Last synced at: 1 day ago - Pushed at: over 2 years ago - Stars: 14 - Forks: 2

mojo-cc/monum Fork of codingonion/monum

Mojo BLAS (Basic Linear Algebra Subprograms)

Language: Mojo - Size: 7.81 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

unum-cloud/udsb

Unlimited Data-Science Benchmarks for Numeric, Tabular and Graph Workloads

Language: Jupyter Notebook - Size: 3.57 MB - Last synced at: about 14 hours ago - Pushed at: about 2 years ago - Stars: 9 - Forks: 1

eth-cscs/Tiled-MM

Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.

Language: C++ - Size: 759 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 21 - Forks: 6

fff-rs/rust-cublas 📦

Safe CUDA cuBLAS wrapper for the Rust language.

Language: Rust - Size: 1.24 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 5 - Forks: 1

zhaocc1106/cuxx-programing

一些cuda库的样例，cuda、cublas、cublaslt、cusparse...

Language: Cuda - Size: 54.7 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

maximilianbehr/cuPolar

Newton's and Hayley's Method for the Matrix Polar Decomposition using CUDA

Language: Cuda - Size: 28.3 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

maximilianbehr/cuNMF

Nonnegative matrix factorizations using CUDA

Language: Cuda - Size: 3.51 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

jhson989/matmul_cublas

cuBLAS GEMM Example for FP32 MatMul

Language: Cuda - Size: 7.81 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

maximilianbehr/cuexpm

Matrix Exponential Approximation using CUDA

Language: Cuda - Size: 57.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

tmrob2/cuda2rust_sandpit

Minimal examples to get CUDA linear algebra programs working with Rust using CC & FFI.

Language: Rust - Size: 13.7 KB - Last synced at: 1 day ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

gigernau/PCAHyperspectralClassifier

Classification of Hyperspectral Images ( HSIs ) with Principal Component Analysis ( PCA ) in CUDA ( cuBLAS ).

Language: Python - Size: 24.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

alignedalignof/cuda-matmul

Explore performance implications of various matrix multiplication approaches using GPU/CUDA compared to CPU side processing

Language: C++ - Size: 407 KB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

Peter9606/neuralnetwork

A simple neural network implemented by using cudnn and cublas

Language: C++ - Size: 216 KB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

nikitanosov1/parallel-programming

Лабораторные работы по курсу "Параллельное программирование"

Language: C++ - Size: 2.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Zlisch/LargeMM

A CUBLAS‐CUDA Based Implementation of Multi-GPU Large Matrix Multiplication

Language: Cuda - Size: 196 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

nikulukani/pycublasxt

Python interface to the NVIDIA CublasXt API

Language: C++ - Size: 13.7 KB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 6 - Forks: 2

hma02/cublasHgemm-P100

Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm

Language: Cuda - Size: 18.6 KB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 35 - Forks: 11

rvcgeeks/rvc-mnist-cnn-gpu

A MNIST handwritten digit classifier written from scratch in Cuda - C

Language: Cuda - Size: 4.47 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 1

devincody/DSAbeamformer

Real-time GPU Beamformer for DSA110 written in C/CUDA

Language: Jupyter Notebook - Size: 4.16 MB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 15 - Forks: 6

Kostisef/cuda-matrix-multiplication

A CUDA approach for computing the multiplication of a transposed matrix with the initial one, using the cuBLAS library.

Language: Cuda - Size: 9.77 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

dc-fukuoka/gpumm

gpumm - matrix-matrix multiplication by using CUDA, cublas, cublasxt and OpenACC.

Language: Cuda - Size: 7.6 MB - Last synced at: 10 months ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0

tenxlenx/GpuDct

A library to extract DCT hashes with CUDA

Language: Cuda - Size: 8.79 KB - Last synced at: 10 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

mz24cn/gemm_optimization

The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。

Language: C - Size: 87.1 MB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 10 - Forks: 5

ironhide23586/SHMatrix

A neat C++ custom Matrix class to perform super-fast GPU (or CPU) powered Matrix/Vector computations with minimal code, leveraging the power of cuBLAS where applicable.

Language: C++ - Size: 42.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 8 years ago - Stars: 2 - Forks: 1

Erellu/ste-Matrix

C++ CUDA-compatible template class that provides an interface for generic purpose matrix related algorithms and computations. Includes Matlab-like functions. This is mainly an example of how to use CUDA code with C++. Don't expect such high performance.

Language: C++ - Size: 193 KB - Last synced at: almost 2 years ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 1

milk-org/milk-package 📦

Modular Image processing Library toolKit (milk)

Language: C - Size: 14.6 MB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

hma02/cublasgemm-benchmark

code for benchmarking GPU performance based on cublasSgemm and cublasHgemm

Language: Cuda - Size: 9.77 KB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 19 - Forks: 16

chenxuhao/caffe-escoin

Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs

Language: C++ - Size: 37.8 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 12 - Forks: 2

ilwoolyu/HSD

HSD: Hierarchical Spherical Defomration for Cortical Surface Registration

Language: C++ - Size: 4.76 MB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 7 - Forks: 1

Ending2015a/StableFluid-CUDA

A really old project that implemented the Stable Fluids using CUDA, cuBLAS and cuSPARSE

Language: C++ - Size: 114 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

am-shb/cuda-elm

Extreme Learning Machine for image classification implemented using Cuda C++ and cuBLAS

Language: Jupyter Notebook - Size: 193 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

yester31/GEMM_Conv2d_CUDA

CUDA Gemm Convolution implementation

Language: C++ - Size: 564 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

Himeyama/cuda-nmf

NMF calculations are performed on NVIDIA GPUs using the Cuda API. (GEM released)

Language: C++ - Size: 67.4 KB - Last synced at: 3 months ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

SuperbTUM/kmeans-pycuda

A general k-means algorithm with L2 distance using pyCUDA

Language: Jupyter Notebook - Size: 499 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

pmontalb/CudaLightKernels

Collection of CUDA wrappers for a simplified kernel call

Language: Cuda - Size: 97.7 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

PanosAntoniadis/cuda-exercises-ntua

Lab exercise of Parallel Processing course in NTUA regarding CUDA programming

Language: Cuda - Size: 2.84 MB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 9 - Forks: 0

vxj9800/cuda-matrix-lib

Language: Cuda - Size: 7.28 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

miraliahmadli/YoloV2-C

Language: Python - Size: 231 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

Tecnarca/CPU-GPU-speed-comparison

A comparison between single thread program against multi-threading and CUDA approaches

Language: C++ - Size: 551 KB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 4 - Forks: 0

prasunanand/cuda_d

D bindings for CUDA

Language: D - Size: 64.5 KB - Last synced at: about 12 hours ago - Pushed at: almost 8 years ago - Stars: 1 - Forks: 0

PineApple777/myCUBLASBasic

custom Basic cuBLAS example from modifying NVIDIA cuBLAS Samples

Language: Makefile - Size: 171 KB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

zqy767/cl-matrix

matrix operation using gpu

Language: Common Lisp - Size: 20.5 KB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

naetherm/DerelictCuBLAS

Dynamic bindings to the CuBLAS library for the D Programming Language.

Language: D - Size: 14.6 KB - Last synced at: about 2 months ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

itohdak/mnist_euslisp

learning mnist with euslisp

Language: Common Lisp - Size: 468 KB - Last synced at: almost 2 years ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 1

dendisuhubdy/cupy Fork of cupy/cupy

NumPy-like API accelerated with CUDA

Language: Python - Size: 13.6 MB - Last synced at: 12 months ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

leobrl/XBlas

Language: Cuda - Size: 25.4 KB - Last synced at: 8 months ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0

Related Keywords

cublas 83 cuda 67 gpu 29 nvidia 14 cudnn 13 cusolver 12 gemm 11 cpp 9 cuda-programming 9 blas 9 matrix-multiplication 8 gpu-acceleration 8 cusparse 7 python 7 rust 6 curand 6 numpy 6 matrix 6 hpc 6 deep-learning 5 gpu-computing 5 linear-algebra 5 openmp 5 matrix-multiply 5 machine-learning 5 cuda-kernels 5 lapack 4 cublasxt 4 tensor-core 4 openblas 4 gpu-programming 4 scientific-computing 4 tensor 4 mkl 4 cublaslt 3 opencl 3 nvrtc 3 high-performance-computing 3 nvml 3 hgemm 3 nvcc 3 pytorch 3 parallel-programming 3 matmul 2 swift 2 clblas 2 convolutional-neural-networks 2 simd 2 matrix-functions 2 mojo 2 inference 2 math 2 gonum 2 fortran 2 nsight 2 mnist 2 opencv 2 openacc 2 cutlass 2 llm 2 dlang 2 matrix-factorization 2 c 2 matrix-calculations 2 tiling 2 nccl 2 cupy 2 shared-memory 2 nvtx 2 rocm 2 cufft 2 ai 2 image-processing 2 parallel-computing 2 pycuda 2 clang 1 half-precision 1 matrix-decompositions 1 p100 1 precision 1 hyperspectral-image-classification 1 classification 1 v100 1 rust-lang 1 artificial-intelligence 1 rocblasxt 1 cnn-classification 1 handwritten-digit-recognition 1 edge-computing 1 rocblas 1 beamforming 1 amd 1 cc 1 matrix-computations 1 neural-networks 1 neuralnetwork 1 deadlock-avoidance 1 mpi 1 exponential 1 neural-network 1