GitHub topics: flash-attention
xlite-dev/Awesome-LLM-Inference
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.
Language: Python - Size: 115 MB - Last synced at: about 7 hours ago - Pushed at: 6 days ago - Stars: 4,157 - Forks: 287

8e8bdba457c18cf692a95fe2ec67000b/VulkanCooperativeMatrixAttention
Vulkan & GLSL implementation of FlashAttention-2
Size: 1.95 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

xlite-dev/LeetCUDA
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.
Language: Cuda - Size: 263 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4,833 - Forks: 524

Delxrius/MiniMax-01
MiniMax-01 is a simple implementation of the MiniMax algorithm, a widely used strategy for decision-making in two-player turn-based games like Tic-Tac-Toe. The algorithm aims to minimize the maximum possible loss for the player, making it a popular choice for developing AI opponents in various game scenarios.
Size: 1000 Bytes - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 4 - Forks: 0

ymcui/Chinese-LLaMA-Alpaca-2
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Language: Python - Size: 8.15 MB - Last synced at: 7 days ago - Pushed at: 9 months ago - Stars: 7,156 - Forks: 571

QwenLM/Qwen
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Language: Python - Size: 35.4 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 18,501 - Forks: 1,515

Bruce-Lee-LY/decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
Language: C++ - Size: 867 KB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 37 - Forks: 4

LuluW8071/Building-LLM-from-Scratch
GPT-2 from scratch with Flash Attention
Language: Jupyter Notebook - Size: 212 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

xlite-dev/ffpa-attn
📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.
Language: Cuda - Size: 4.21 MB - Last synced at: 15 days ago - Pushed at: about 2 months ago - Stars: 185 - Forks: 8

CoinCheung/gdGPT
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Language: Python - Size: 1.1 MB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 96 - Forks: 9

kyegomez/FlashMHA
An simple pytorch implementation of Flash MultiHead Attention
Language: Jupyter Notebook - Size: 85 KB - Last synced at: 17 days ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 2

erfanzar/jax-flash-attn2
A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).
Language: Python - Size: 6.19 MB - Last synced at: 12 days ago - Pushed at: 4 months ago - Stars: 24 - Forks: 0

InternLM/InternEvo
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Language: Python - Size: 6.8 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 389 - Forks: 67

MoonshotAI/MoBA
MoBA: Mixture of Block Attention for Long-Context LLMs
Language: Python - Size: 2.4 MB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 1,779 - Forks: 106

InternLM/InternLM
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
Language: Python - Size: 7.12 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 6,908 - Forks: 485

pxl-th/NNop.jl
Flash Attention & friends in pure Julia
Language: Julia - Size: 43.9 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 1

DAMO-NLP-SG/Inf-CLIP
[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
Language: Python - Size: 3.79 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 245 - Forks: 10

dcarpintero/pangolin-guard
Open, Lightweight Model for AI Safety.
Language: Jupyter Notebook - Size: 11.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

MasterSkepticista/gpt2
Training GPT-2 on FineWeb-Edu in JAX/Flax
Language: Python - Size: 104 KB - Last synced at: 22 days ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

kklemon/FlashPerceiver
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
Language: Python - Size: 712 KB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 26 - Forks: 3

Luis355/qw
qw is a lightweight text editor designed for quick and efficient editing tasks. It offers a simple yet powerful interface for users to easily manipulate text files.
Size: 0 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Language: C++ - Size: 1.99 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 35 - Forks: 4

etasnadi/VulkanCooperativeMatrixAttention
Vulkan & GLSL implementation of FlashAttention-2
Language: C++ - Size: 39.1 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

kreasof-ai/Homunculus-Project
Long term project about a custom AI architecture. Consist of cutting-edge technique in machine learning such as Flash-Attention, Group-Query-Attention, ZeRO-Infinity, BitNet, etc.
Language: Python - Size: 4.63 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 5 - Forks: 0

LukasDrews97/DumbleLLM
Decoder-only LLM trained on the Harry Potter books.
Language: Python - Size: 235 KB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

AI-DarwinLabs/vllm-hpc-installer
🚀 Automated installation script for vLLM on HPC systems with ROCm support, optimized for AMD MI300X GPUs.
Language: Shell - Size: 6.84 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

AI-DarwinLabs/amd-mi300-ml-stack
Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations
Language: Shell - Size: 5.86 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

gietema/attention
Toy Flash Attention implementation in torch
Language: Python - Size: 21.5 KB - Last synced at: 7 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

alexzhang13/flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
Language: Python - Size: 2.27 MB - Last synced at: 10 months ago - Pushed at: 11 months ago - Stars: 46 - Forks: 5

graphcore-research/flash-attention-ipu
Poplar implementation of FlashAttention for IPU
Language: C++ - Size: 575 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

RulinShao/FastCkpt
Python package for rematerialization-aware gradient checkpointing
Language: Python - Size: 15.6 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 3

Naman-ntc/FastCode
Utilities for efficient fine-tuning, inference and evaluation of code generation models
Language: Python - Size: 75.2 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2
