GitHub topics: decoding-attention

Repositories

Bruce-Lee-LY/decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

Language: C++ - Size: 867 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 37 - Forks: 4

Related Keywords

cuda 1 cuda-core 1 decoding-attention 1 flash-attention 1 flashinfer 1 flashmla 1 gpu 1 gqa 1 inference 1 large-language-model 1 llm 1 mha 1 mla 1 mqa 1 multi-head-attention 1 nvidia 1