An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: cuda-programming

FaresArgus/artaxerxes

Adaptive high-performance stress tester "artaxerxes" supports GPU, io_uring, DPDK, and eBPF/XDP for advanced cybersecurity labs. Ideal for network testing. 🚀🛠️

Language: C - Size: 26.4 KB - Last synced at: about 10 hours ago - Pushed at: about 12 hours ago - Stars: 0 - Forks: 0

NVIDIA/cccl

CUDA Core Compute Libraries

Language: C++ - Size: 84.2 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1,736 - Forks: 234

artarchi/TaskFlow

TaskFlow is a MERN stack Todo application that enables users to manage their tasks efficiently. With features like JWT authentication and a responsive UI, it provides a seamless experience for both desktop and mobile users. 🐙🌐

Language: CSS - Size: 3.33 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

sail-sg/Adan

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Language: Python - Size: 1.31 MB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 795 - Forks: 69

Rust-GPU/Rust-CUDA

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

Language: Rust - Size: 5.99 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 4,510 - Forks: 191

MuGdxy/muda

μ-Cuda, COVER THE LAST MILE OF CUDA. With features: intellisense-friendly, structured launch, automatic cuda graph generation and updating.

Language: C++ - Size: 14.7 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 179 - Forks: 9

harleyszhang/llm_note

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

Language: Python - Size: 184 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 793 - Forks: 86

trieck/pixienn

A modern C++ reimplementation of Darknet with CUDA support for efficient neural network inference

Language: C++ - Size: 4.57 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 0

phbastosa/WASMEM2D

Wavefield Analysis for Seismic Modeling in Elastic Media.

Language: Cuda - Size: 30.3 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

berserk-23115/GPU-Specialisation-IP

Independent Project Submission for GPU programming specialisation : Anushk Kumar

Language: Cuda - Size: 16.6 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

berserk-23115/GPU-Specialisation-Capstone

GPU Programming Specialisation Capstone Project submission by Anushk Kumar

Language: Cuda - Size: 20.5 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

HenryNdubuaku/cuda-tutorials

CUDA tutorials for Maths & ML tutorials with examples, covers multi-gpus, fused attention, winograd convolution, reinforcement learning.

Language: Cuda - Size: 428 KB - Last synced at: about 7 hours ago - Pushed at: about 1 month ago - Stars: 185 - Forks: 5

phbastosa/SeisFAT3D

Modeling, inversion and migration focusing on seismic first-arrivals.

Language: Cuda - Size: 236 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 10 - Forks: 2

toxy4ny/artaxerxes

Artaxerxes - Adaptive High-Performance Stress Tester v.1.0. Rebuild old version Xerxes DDoS. Supports GPU+io_uring, DPDK, eBPF/XDP with intelligent fallbacks. Educational tool for advanced cybersecurity labs

Language: C - Size: 27.3 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 2 - Forks: 0

ZaidMohsin457/Parallelizing-GNN

This project demonstrates parallelization techniques for Graph Neural Networks (GNNs) using: CUDA for GPU acceleration MPI (mpi4py) for distributed computing Python Multiprocessing for parallel processing The implementation uses the PubMed dataset from PyTorch Geometric and a 2-layer GCN model.

Language: Python - Size: 448 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 1

taskflow/taskflow

A General-purpose Task-parallel Programming System using Modern C++

Language: C++ - Size: 138 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 10,993 - Forks: 1,287

harshrajhrj/cuda-programming

This repository consists of CUDA programming (specifically for Deep Learning) in C++ and Python. Links: https://github.com/Infatoshi/mnist-cuda

Language: Cuda - Size: 906 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

PaddleJitLab/CUDATutorial

A self-learning tutorail for CUDA High Performance Programing.

Language: JavaScript - Size: 108 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 658 - Forks: 69

MuhammadMuazen/Simple-Matrices-Multiplication-Using-Cuda

Just a simple matrices multiplication using cuda

Language: Cuda - Size: 8.63 MB - Last synced at: 12 days ago - Pushed at: about 1 year ago - Stars: 4 - Forks: 0

RainerMtb/cuvista

Accelerated Optical Video Stabilizer, Cuda, OpenCL, Avx512

Language: C++ - Size: 45.1 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 9 - Forks: 1

real-space/AngstromCube

A parallel and GPU-accelerated Code for Real-Space All-Electron Linear-Scaling Density Functional Theory

Language: C++ - Size: 33.2 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 7 - Forks: 2

emptysoal/YOLOv5-TensorRT-lib-Python

The code of YOLOv5 inferencing with TensorRT C++ api is packaged into a dynamic link library , then called through Python.

Language: Cuda - Size: 750 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 15 - Forks: 1

branebb/nn-framework

Framework for creating neural networks using C++ and CUDA platform. This project is part of my final university assignment for bachelor's degree.

Language: Cuda - Size: 64.5 KB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

coreylowman/cudarc

Safe rust wrapper around CUDA toolkit

Language: Rust - Size: 2.91 MB - Last synced at: 20 days ago - Pushed at: 2 months ago - Stars: 862 - Forks: 106

hetan-official/CUDA_C_Best_Practices_Guide-In-Chinese

This is a Chinese translation of the CUDA_C_Best_Practices_Guide

Size: 35.2 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 1 - Forks: 0

emptysoal/TensorRT-v8-YOLOv5-v5.0

Based on TensorRT v8.2, build network for YOLOv5-v5.0 by myself, speed up YOLOv5-v5.0 inferencing

Language: C++ - Size: 431 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 12 - Forks: 1

Krasnomakov/EventDrivenArchitecture

Prototypes of Event-Driven Architecture with Computer Vision, games, aniamtion and LLM models

Language: Python - Size: 110 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

jrajan14/CUDA_Programs

Nvidia CUDA Programs. High-performance computing with my collection of CUDA programs, meticulously crafted to harness the immense power of NVIDIA's GPU architecture. From blazingly fast simulations to data-intensive parallel processing, these programs showcase my passion for pushing the boundaries of performance optimization.

Language: Cuda - Size: 30.8 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 5 - Forks: 2

Guillaume-Helbecque/GPU-accelerated-tree-search-Chapel

GPU-accelerated tree search: Investigating Chapel versus CUDA/HIP+X

Language: C - Size: 488 KB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 2 - Forks: 1

phbastosa/WASMEM3D

Wavefield Analysis for Seismic Modeling in Elastic Media.

Language: Cuda - Size: 26.4 KB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

Cat-Gawr/AI-Python

Una piccola AI che il suo picco massimo di risposta è stato di 0.02 secondi di risposta | Konata ~ 2025

Language: Python - Size: 898 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 2 - Forks: 0

jaredhoberock/ubu

Language: C++ - Size: 1.97 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 1 - Forks: 0

eyalroz/cuda-api-wrappers

Thin, unified, C++-flavored wrappers for the CUDA APIs

Language: C++ - Size: 2.85 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 842 - Forks: 84

AlexJMercer/Fractal-Art

Generating Fractals in C++ using SFML. For the ultimate visual stimulation and in-depth code!

Language: C++ - Size: 4.84 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 2 - Forks: 0

govindansriram/sm89-kernels

SM89 Optimized CUDA Kernels

Language: Cuda - Size: 75.2 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Kaminyou/Flash-Attention-Practice

An minimal CUDA implementation of FlashAttention v1 and v2

Language: Python - Size: 19.5 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Mazharuddin-Mohammed/QDSim

High-performance 2D Quantum Dot (QD) Simulator implemented in C++ and Python

Language: C++ - Size: 1.26 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

dpetrosy/Fractal

This project is a Fractal Visualizer developed in C++ with SFML and CUDA.

Language: C++ - Size: 4.86 MB - Last synced at: 22 days ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

RRZE-HPC/MD-Bench

A performance-oriented prototyping harness for state of the art Molecular Dynamics algorithms

Language: C - Size: 4.57 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 15 - Forks: 8

Fantasya63/DistributedRayTracer

A small path tracer that runs in the gpu with the use of numba cuda in python.

Language: Python - Size: 26.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

florist-notes/aicore_s

AI, IoT and Robotics Hardware + ROS

Language: Jupyter Notebook - Size: 361 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 1

LuongHuuPhuc/Project_2024-2

Parallel programming for Merge sort algorithm using OpenMP and CUDA

Language: Cuda - Size: 3.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

marcoplaitano/counting-sort-cuda

Parallelized version of Counting Sort using CUDA

Language: C - Size: 26.4 KB - Last synced at: 22 days ago - Pushed at: over 3 years ago - Stars: 4 - Forks: 0

MatteoFasulo/Multi-layer-Neural-Network

A Parallel implementation for a particular kind of multi-layer Neural Network

Language: Cuda - Size: 3.76 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

guoriyue/warp-from-device

Language: Cuda - Size: 1.71 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 1

jerry060599/KittenGpuLBVH

A high performance and friendly GPU LBVH implementation.

Language: Cuda - Size: 90.8 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 24 - Forks: 4

mit-han-lab/TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

Language: C++ - Size: 83.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 852 - Forks: 85

ashvardanian/cuda-python-starter-kit

Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11

Language: Cuda - Size: 238 KB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 26 - Forks: 3

Momijiichigo/particle_field_simulation

Simulating the particle fields by using the time-evolution equations extracted from Euler-Lagrange equations of fields

Language: Jupyter Notebook - Size: 523 KB - Last synced at: 15 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

coderonion/cuda-beginner-course-cpp-version

bilibili视频【CUDA 12.x 并行编程入门(C++版)】配套代码

Language: Cuda - Size: 20.5 KB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 29 - Forks: 5

lokk798/HPC-Quiz-Bank

A collection of multiple choice questions (MCQs) on High Performance Computing (HPC) and Lab solutions

Size: 15.6 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

AdroitAnandAI/Parallel-RNG-using-GPU

Parallel implementation of inherently sequential algorithms using mathematical hacks. Random Number Generators - Additive LFG and GFSR - implemented with NVIDIA CUDA using Continuous Subsequence Technique and Leap Frog Technique

Language: Cuda - Size: 3.27 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

huygnguyen04/cpu-gpu-ndp-work

Exploring CPU/GPU memory hierarchies, cache modeling, DRAM simulation, GPU programming with CUDA, and near-data processing using PIMeval-PIMbench - CS 6501 CPU/GPU Memory Systems @ UVA Spring '25

Language: C++ - Size: 21.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

FahimFBA/CUDA-WSL2-Ubuntu

Install CUDA on Windows11 using WSL2

Language: Jupyter Notebook - Size: 10.4 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 62 - Forks: 4

artuppp/EllipseFitCUDA

Ellipse Fit Implementation in CUDA

Language: Cuda - Size: 41 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 0

artuppp/PupilTrackingGPUPublic

GPU implementations of new, high-performance pupil tracking algorithms, as presented in our paper [cuElSe and cuExCuSe: Highly Parallel and Accurate GPU-based Pupil Tracking for Real-World Applications]

Language: Cuda - Size: 1.34 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 0

DMMutua/CUDA_Projects

An Implementation of a variety of Algorithms & Technical Papers Mostly Related to Machine Learning & Deep Learning in CUDA C

Language: Cuda - Size: 3.91 KB - Last synced at: 18 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

PerHuepenbecker/Cudyn

CUDA library for irregular tasks using a dynamic block-internal balancing mechanism

Language: Cuda - Size: 44.1 MB - Last synced at: 21 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Malisha4065/CUDA-Tutorials

CUDA Tutorials for beginners

Language: Cuda - Size: 4.88 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

tgautam03/xGeMM

Accelerated General (FP32) Matrix Multiplication from scratch in CUDA

Language: Cuda - Size: 5.8 MB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 115 - Forks: 7

akileshas/gpuX

100 days of GPU programming !!!

Size: 31.6 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

brucefan1983/CUDA-Programming

Sample codes for my CUDA programming book

Language: Cuda - Size: 9.13 MB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 1,712 - Forks: 347

sundar2k22/Attention-Mechanism-HPC

High Performance Computing project accelerating Transformer Attention using OpenMP and CUDA.

Language: C++ - Size: 381 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

ayushraina2028/DS295-Parallel-Programming-2025

This repository contains my latex notes for Parallel Programming and all my implementations using CUDA C/C++, Open MP and MPI

Language: Jupyter Notebook - Size: 44.2 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

ishankkumar-007/canny_CUDA-MPI

parallel version of Canny Edge Detection algorithm -- CUDA, MPI

Language: C++ - Size: 258 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Masoudjafaripour/Transformer-CUDA Fork of saimeghana-y/Transformer-CUDA

Building upon original repo, trying to implement encoder-decoder transformer using CUDA

Language: Python - Size: 20.5 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

maya-undefined/gpu-desktop-calculator

Language: Cuda - Size: 48.8 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 7 - Forks: 0

mikeroyal/CUDA-Guide

CUDA Guide

Language: Cuda - Size: 83 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 64 - Forks: 7

nosferalatu/SimpleGPUHashTable

A simple GPU hash table implemented in CUDA using lock free techniques

Language: Cuda - Size: 297 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 394 - Forks: 41

Ramshankar07/CUDA-implementation-llama3.1

This repository is CUDA implementation for LLAMA 3.1 open models

Language: Cuda - Size: 393 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

lokk798/cuda-prgoramming-basics

This repository contains CUDA-based implementations of several parallel computing algorithms and operations, focusing on high-performance GPU computations using NVIDIA's CUDA framework.

Language: Cuda - Size: 10.7 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Cyxuan0311/CUDA-Studying-Notes

The repository is a notes of the studying about CUDA.

Language: Cuda - Size: 23.4 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

JoeruCodes/CUDA-GEMM-kernel

My attempt of making a GEMM kernel...

Language: Cuda - Size: 67.4 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

KarhouTam/cuda-kernels

Some common CUDA kernel implementations (Not the fastest).

Language: Cuda - Size: 57.6 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 17 - Forks: 1

kartavyaantani/CUDA_IMAGE_PROCESSING

A CUDA-accelerated image processing project featuring multiple GPU-based filters and enhancement techniques. Implements convolution, edge detection, Non-Local Means (NLM) denoising, K-Nearest Neighbors (KNN), and pixelization. Each operation is optimized using CUDA kernels for real-time performance on large images. The project supports command-line

Language: Jupyter Notebook - Size: 5.4 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

yashkathe/Image-Noise-Reduction-with-CUDA

This project conducts an analysis of image denoising technique - median blur, comparing GPU-accelerated (Numba) and CPU-based (OpenCV) processing speeds.

Language: Jupyter Notebook - Size: 25.4 MB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

lehoangan2906/CUDA_basics

A simple implementation of operations on vectors and matrices, optimized for running on Nvidia GPU with CUDA

Language: Jupyter Notebook - Size: 2.91 MB - Last synced at: 26 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

m15kh/Cuda_Programming

CUDA programming enables parallel computing on NVIDIA GPUs for high-performance tasks like deep learning and scientific computing

Language: Cuda - Size: 790 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

sy-project/yuuna

Opensource GameEngine for DX11 & CUDA

Language: C++ - Size: 115 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

EmonRezaBD/CUDA-Programming

This repo contains CUDA Programming with C++. Projects are done to learn CUDA from scratch.

Language: Cuda - Size: 42.4 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

alexnet819/Cuda_Playground

Language: Cuda - Size: 14.6 KB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Koushikphy/Intro-to-CUDA-Fortran

A Complete beginner's introduction to programming with CUDA Fortran

Size: 200 KB - Last synced at: 3 months ago - Pushed at: almost 3 years ago - Stars: 26 - Forks: 1

colesmcintosh/pycuda-numpy-vector-ops

Accelerating NumPy Vector Operations with PyCUDA

Language: Jupyter Notebook - Size: 8.79 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

HuangCongQing/cuda-learning

cuda编程学习入门

Language: Cuda - Size: 5.66 MB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 35 - Forks: 6

MolSSI-Education/gpu_programming_beginner

Fundamentals of heterogeneous parallel programming with CUDA C/C++ at the beginner level.

Language: Python - Size: 5.25 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 11 - Forks: 2

saeedahmadicp/Fundamentals-of-Accelerated-Computing-with-CUDA-Python

Fundamentals of Accelerated Computing with CUDA Python

Language: Jupyter Notebook - Size: 6.63 MB - Last synced at: 9 days ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0

tgautam03/tGeMM

General Matrix Multiplication using NVIDIA Tensor Cores

Language: Cuda - Size: 47.9 KB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 13 - Forks: 3

GPUEngineering/GPUtils

A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)

Language: Cuda - Size: 401 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

emptysoal/cuda-image-preprocess

Speed up image preprocess with cuda when handle image or tensorrt inference

Language: Cuda - Size: 91.8 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 63 - Forks: 5

jacobolarsson/Cuda-Pathtracer

Hardware accelerated pathtracing graphics engine made with C++ and CUDA

Language: C++ - Size: 19.6 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

SunsetQuest/CudaPAD

CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.

Language: C# - Size: 1.18 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 117 - Forks: 16

pragati-chaturvedi/Basic-Matrix-Multiplication-CUDA

This project demonstrates Basic Matrix Multiplication implemented using CUDA to accelerate matrix computations on an NVIDIA GPU. The code is designed to take two square matrices as input and multiply them in parallel, showcasing the power of CUDA in optimizing computational tasks.

Language: Cuda - Size: 321 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

tgautam03/xFilters

GPU (CUDA) accelerated filters using 2D convolution for high resolution images.

Language: C++ - Size: 58.2 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 6 - Forks: 1

Awrsha/Advanced-CUDA-Programming-GPU-Architecture

This repository provides a comprehensive guide to optimizing GPU kernels for performance, with a focus on NVIDIA GPUs. It covers key tools and techniques such as CUDA, PyTorch, and Triton, aimed at improving computational efficiency for deep learning and scientific computing tasks.

Language: Cuda - Size: 25.2 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

Jorgedavyd/nsight.nvim

A developer oriented Neovim framework for CUDA performance profiling and analysis.

Language: Lua - Size: 230 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

TheUnsolvedDev/CUDA_NN_FS

This repository features a from-scratch implementation of a neural network using CUDA and C. The primary goal of this project is to leverage CUDA's parallel computing capabilities to significantly accelerate the training and inference processes of neural networks, utilizing the computational power of NVIDIA GPUs.

Language: Cuda - Size: 61.3 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 9 - Forks: 0

seieric/gst-dsobjectsmask

📀NVIDIA DeepStream integrated GStreamer Plugin. Mask objects with cuda cores on Jetson boards. Fast and smooth since everything is done on NVMM.🏎

Language: C++ - Size: 4.83 MB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 1

awaldis/cuda-experiments

A place to explore the capabilities and limits of CUDA parallel processing.

Language: C++ - Size: 28.3 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

chinmaydk99/Flash_Attention_v2_GPU_Optimized

Custom implementation of FlashAttention v2 from scratch using CUDA and Triton. Optimized for high-performance memory-efficient attention in Transformer models.

Language: Jupyter Notebook - Size: 65.4 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

jayeshthk/Parallel_Computing Fork of ShashankDavalgi/Parallel_Computing

CUDA computing example repo. with complex matrix multiplication.

Language: C - Size: 14.6 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0