Topic: "flash-attention"
QwenLM/Qwen
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Language: Python - Size: 35.3 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 17,975 - Forks: 1,480

ymcui/Chinese-LLaMA-Alpaca-2
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Language: Python - Size: 8.15 MB - Last synced at: 4 days ago - Pushed at: 7 months ago - Stars: 7,160 - Forks: 570

InternLM/InternLM
Official release of InternLM2 7B and 20B base and chat models. 200K context support
Language: Python - Size: 4.52 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 5,395 - Forks: 383

xlite-dev/Awesome-LLM-Inference
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.
Language: Python - Size: 115 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 3,900 - Forks: 275

xlite-dev/LeetCUDA
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
Language: Cuda - Size: 262 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 3,627 - Forks: 393

flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
Language: Cuda - Size: 3.49 MB - Last synced at: 10 days ago - Pushed at: 12 days ago - Stars: 2,716 - Forks: 286

MoonshotAI/MoBA
MoBA: Mixture of Block Attention for Long-Context LLMs
Language: Python - Size: 2.4 MB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 1,732 - Forks: 103

InternLM/InternEvo
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Language: Python - Size: 6.77 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 382 - Forks: 64

xlite-dev/ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) SRAM complexity large headdim (D > 256), ~2x↑🎉vs SDPA EA.
Language: Cuda - Size: 4.21 MB - Last synced at: 28 days ago - Pushed at: about 1 month ago - Stars: 161 - Forks: 7

CoinCheung/gdGPT
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Language: Python - Size: 1.1 MB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 95 - Forks: 8

DAMO-NLP-SG/Inf-CLIP
💣💣 The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
Language: Python - Size: 3.76 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 47 - Forks: 2

alexzhang13/flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
Language: Python - Size: 2.27 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 46 - Forks: 5

Bruce-Lee-LY/decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
Language: C++ - Size: 884 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 35 - Forks: 2

Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Language: C++ - Size: 1.99 MB - Last synced at: 20 days ago - Pushed at: 2 months ago - Stars: 35 - Forks: 4

kklemon/FlashPerceiver
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
Language: Python - Size: 712 KB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 26 - Forks: 2

erfanzar/jax-flash-attn2
A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).
Language: Python - Size: 6.19 MB - Last synced at: 21 days ago - Pushed at: 2 months ago - Stars: 23 - Forks: 0

kyegomez/FlashMHA
An simple pytorch implementation of Flash MultiHead Attention
Language: Jupyter Notebook - Size: 85 KB - Last synced at: 14 days ago - Pushed at: about 1 year ago - Stars: 21 - Forks: 2

RulinShao/FastCkpt
Python package for rematerialization-aware gradient checkpointing
Language: Python - Size: 15.6 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 3

Naman-ntc/FastCode
Utilities for efficient fine-tuning, inference and evaluation of code generation models
Language: Python - Size: 75.2 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

kreasof-ai/Homunculus-Project
Long term project about a custom AI architecture. Consist of cutting-edge technique in machine learning such as Flash-Attention, Group-Query-Attention, ZeRO-Infinity, BitNet, etc.
Language: Python - Size: 4.63 MB - Last synced at: 29 days ago - Pushed at: 7 months ago - Stars: 5 - Forks: 0

Delxrius/MiniMax-01
MiniMax-01 is a simple implementation of the MiniMax algorithm, a widely used strategy for decision-making in two-player turn-based games like Tic-Tac-Toe. The algorithm aims to minimize the maximum possible loss for the player, making it a popular choice for developing AI opponents in various game scenarios.
Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 0

graphcore-research/flash-attention-ipu
Poplar implementation of FlashAttention for IPU
Language: C++ - Size: 575 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

etasnadi/VulkanCooperativeMatrixAttention
Vulkan & GLSL implementation of FlashAttention-2
Language: C++ - Size: 39.1 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

MasterSkepticista/gpt2
Training GPT-2 on FineWeb-Edu in JAX/Flax
Language: Python - Size: 104 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

gietema/attention
Toy Flash Attention implementation in torch
Language: Python - Size: 21.5 KB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

8e8bdba457c18cf692a95fe2ec67000b/VulkanCooperativeMatrixAttention
Vulkan & GLSL implementation of FlashAttention-2
Size: 1.95 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

dcarpintero/pangolin-guard
Open, Lightweight Model for AI Safety.
Language: Jupyter Notebook - Size: 11.2 MB - Last synced at: 29 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Luis355/qw
qw is a lightweight text editor designed for quick and efficient editing tasks. It offers a simple yet powerful interface for users to easily manipulate text files.
Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

LukasDrews97/DumbleLLM
Decoder-only LLM trained on the Harry Potter books.
Language: Python - Size: 235 KB - Last synced at: 29 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

AI-DarwinLabs/vllm-hpc-installer
🚀 Automated installation script for vLLM on HPC systems with ROCm support, optimized for AMD MI300X GPUs.
Language: Shell - Size: 6.84 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

AI-DarwinLabs/amd-mi300-ml-stack
Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations
Language: Shell - Size: 5.86 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0
