flash-attention | Topic | Ecosyste.ms: Repos

Topic: "flash-attention"

QwenLM/Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Language: Python - Size: 35.3 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 17,975 - Forks: 1,480

ymcui/Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

Language: Python - Size: 8.15 MB - Last synced at: 4 days ago - Pushed at: 7 months ago - Stars: 7,160 - Forks: 570

InternLM/InternLM

Official release of InternLM2 7B and 20B base and chat models. 200K context support

Language: Python - Size: 4.52 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 5,395 - Forks: 383

xlite-dev/Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.

Language: Python - Size: 115 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 3,900 - Forks: 275

xlite-dev/LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

Language: Cuda - Size: 262 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 3,627 - Forks: 393

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

Language: Cuda - Size: 3.49 MB - Last synced at: 10 days ago - Pushed at: 12 days ago - Stars: 2,716 - Forks: 286

MoonshotAI/MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

Language: Python - Size: 2.4 MB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 1,732 - Forks: 103

InternLM/InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Language: Python - Size: 6.77 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 382 - Forks: 64

xlite-dev/ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) SRAM complexity large headdim (D > 256), ~2x↑🎉vs SDPA EA.

Language: Cuda - Size: 4.21 MB - Last synced at: 28 days ago - Pushed at: about 1 month ago - Stars: 161 - Forks: 7

CoinCheung/gdGPT

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

Language: Python - Size: 1.1 MB - Last synced at: 4 days ago - Pushed at: about 1 year ago - Stars: 95 - Forks: 8

DAMO-NLP-SG/Inf-CLIP

💣💣 The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

Language: Python - Size: 3.76 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 47 - Forks: 2

alexzhang13/flashattention2-custom-mask

Triton implementation of FlashAttention2 that adds Custom Masks.

Language: Python - Size: 2.27 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 46 - Forks: 5

Bruce-Lee-LY/decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

Language: C++ - Size: 884 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 35 - Forks: 2

Bruce-Lee-LY/flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

Language: C++ - Size: 1.99 MB - Last synced at: 20 days ago - Pushed at: 2 months ago - Stars: 35 - Forks: 4

kklemon/FlashPerceiver

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

Language: Python - Size: 712 KB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 26 - Forks: 2

erfanzar/jax-flash-attn2

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

Language: Python - Size: 6.19 MB - Last synced at: 21 days ago - Pushed at: 2 months ago - Stars: 23 - Forks: 0

kyegomez/FlashMHA

An simple pytorch implementation of Flash MultiHead Attention

Language: Jupyter Notebook - Size: 85 KB - Last synced at: 14 days ago - Pushed at: about 1 year ago - Stars: 21 - Forks: 2

RulinShao/FastCkpt

Python package for rematerialization-aware gradient checkpointing

Language: Python - Size: 15.6 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 3

Naman-ntc/FastCode

Utilities for efficient fine-tuning, inference and evaluation of code generation models

Language: Python - Size: 75.2 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

kreasof-ai/Homunculus-Project

Long term project about a custom AI architecture. Consist of cutting-edge technique in machine learning such as Flash-Attention, Group-Query-Attention, ZeRO-Infinity, BitNet, etc.

Language: Python - Size: 4.63 MB - Last synced at: 29 days ago - Pushed at: 7 months ago - Stars: 5 - Forks: 0

MiniMax-01 is a simple implementation of the MiniMax algorithm, a widely used strategy for decision-making in two-player turn-based games like Tic-Tac-Toe. The algorithm aims to minimize the maximum possible loss for the player, making it a popular choice for developing AI opponents in various game scenarios.

Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 0