GitHub topics: flash-attention

Repositories

xlite-dev/Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.

Language: Python - Size: 115 MB - Last synced at: about 7 hours ago - Pushed at: 6 days ago - Stars: 4,157 - Forks: 287

8e8bdba457c18cf692a95fe2ec67000b/VulkanCooperativeMatrixAttention

Vulkan & GLSL implementation of FlashAttention-2

Size: 1.95 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

xlite-dev/LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.

Language: Cuda - Size: 263 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4,833 - Forks: 524

MiniMax-01 is a simple implementation of the MiniMax algorithm, a widely used strategy for decision-making in two-player turn-based games like Tic-Tac-Toe. The algorithm aims to minimize the maximum possible loss for the player, making it a popular choice for developing AI opponents in various game scenarios.

Size: 1000 Bytes - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 4 - Forks: 0

ymcui/Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

Language: Python - Size: 8.15 MB - Last synced at: 7 days ago - Pushed at: 9 months ago - Stars: 7,156 - Forks: 571

QwenLM/Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Language: Python - Size: 35.4 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 18,501 - Forks: 1,515

Bruce-Lee-LY/decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

Language: C++ - Size: 867 KB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 37 - Forks: 4

LuluW8071/Building-LLM-from-Scratch

GPT-2 from scratch with Flash Attention

Language: Jupyter Notebook - Size: 212 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

xlite-dev/ffpa-attn

📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.

Language: Cuda - Size: 4.21 MB - Last synced at: 15 days ago - Pushed at: about 2 months ago - Stars: 185 - Forks: 8

CoinCheung/gdGPT

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

Language: Python - Size: 1.1 MB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 96 - Forks: 9

kyegomez/FlashMHA

An simple pytorch implementation of Flash MultiHead Attention

Language: Jupyter Notebook - Size: 85 KB - Last synced at: 17 days ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 2

erfanzar/jax-flash-attn2

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

Language: Python - Size: 6.19 MB - Last synced at: 12 days ago - Pushed at: 4 months ago - Stars: 24 - Forks: 0

InternLM/InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Language: Python - Size: 6.8 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 389 - Forks: 67

MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

Language: Python - Size: 2.4 MB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 1,779 - Forks: 106

InternLM/InternLM

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

Language: Python - Size: 7.12 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 6,908 - Forks: 485

pxl-th/NNop.jl

Flash Attention & friends in pure Julia

Language: Julia - Size: 43.9 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 1

DAMO-NLP-SG/Inf-CLIP

[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

Language: Python - Size: 3.79 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 245 - Forks: 10

dcarpintero/pangolin-guard

Open, Lightweight Model for AI Safety.

Language: Jupyter Notebook - Size: 11.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

MasterSkepticista/gpt2

Training GPT-2 on FineWeb-Edu in JAX/Flax

Language: Python - Size: 104 KB - Last synced at: 22 days ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

kklemon/FlashPerceiver

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

Language: Python - Size: 712 KB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 26 - Forks: 3

Luis355/qw

qw is a lightweight text editor designed for quick and efficient editing tasks. It offers a simple yet powerful interface for users to easily manipulate text files.

Size: 0 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Bruce-Lee-LY/flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

Language: C++ - Size: 1.99 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 35 - Forks: 4

etasnadi/VulkanCooperativeMatrixAttention

Vulkan & GLSL implementation of FlashAttention-2

Language: C++ - Size: 39.1 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

kreasof-ai/Homunculus-Project

Long term project about a custom AI architecture. Consist of cutting-edge technique in machine learning such as Flash-Attention, Group-Query-Attention, ZeRO-Infinity, BitNet, etc.

Language: Python - Size: 4.63 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 5 - Forks: 0

LukasDrews97/DumbleLLM

Decoder-only LLM trained on the Harry Potter books.

Language: Python - Size: 235 KB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

AI-DarwinLabs/vllm-hpc-installer

🚀 Automated installation script for vLLM on HPC systems with ROCm support, optimized for AMD MI300X GPUs.

Language: Shell - Size: 6.84 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

AI-DarwinLabs/amd-mi300-ml-stack

Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations

Language: Shell - Size: 5.86 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

gietema/attention

Toy Flash Attention implementation in torch

Language: Python - Size: 21.5 KB - Last synced at: 7 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

alexzhang13/flashattention2-custom-mask

Triton implementation of FlashAttention2 that adds Custom Masks.

Language: Python - Size: 2.27 MB - Last synced at: 10 months ago - Pushed at: 11 months ago - Stars: 46 - Forks: 5

graphcore-research/flash-attention-ipu

Poplar implementation of FlashAttention for IPU

Language: C++ - Size: 575 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

RulinShao/FastCkpt

Python package for rematerialization-aware gradient checkpointing

Language: Python - Size: 15.6 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 3

Naman-ntc/FastCode

Utilities for efficient fine-tuning, inference and evaluation of code generation models

Language: Python - Size: 75.2 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

Related Keywords

flash-attention 32 llm 12 flash-attention-2 7 pytorch 6 cuda 5 large-language-models 5 transformer 5 deep-learning 5 attention 5 nlp 4 large-language-model 4 tensor-cores 3 gpu-computing 3 artificial-intelligence 3 inference 3 pretrained-models 3 transformers 3 mla 3 gpu 3 deepseek-v3 3 deepseek 3 llm-training 2 nvidia 2 chinese 2 deepseek-r1 2 chatbot 2 multi-head-attention 2 mha 2 machine-learning 2 hpc 2 conda 2 attention-mechanism 2 llama2 2 rlhf 2 flash-attention-3 2 flash-mla 2 llm-inference 2 minimax-01 2 paged-attention 2 deepspeed 2 tensorrt-llm 2 vllm 2 deel-learning 2 glsl 2 gpu-acceleration 2 vulkan 2 jax 2 cuda-kernels 2 natural-language-processing 2 ring-attention 2 cpp 1 cross-platform 1 framework 1 qt 1 qwik-city 1 tailwind 1 typing-game 1 typing-practice 1 long-context 1 amdgpu 1 gpgpu 1 julia 1 clip 1 contrastive-learning 1 infinite-batch-size 1 memory-efficient 1 ai-safety 1 alternating-attention 1 fine-tuning 1 huggingface-transformers 1 modernbert 1 prompt-guard 1 fineweb 1 flax 1 gpt2 1 perceiver 1 vision-language-model 1 mi300x 1 rocm 1 amd-mi300 1 axolotl 1 pytorch-rocm 1 slurm 1 torch 1 triton 1 triton-lang 1 graphcore 1 ipu 1 poplar 1 gradient-checkpointing 1 code-generation 1 efficient 1 finetuning 1 vue 1 web 1 cutlass 1 tensor-core 1 bitnet 1 jupyter-notebook 1 low-rank-adaptation 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos