GitHub / Bruce-Lee-LY / decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

JSON API: http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bruce-Lee-LY%2Fdecoding_attention
PURL: pkg:github/Bruce-Lee-LY/decoding_attention

Stars: 37
Forks: 4
Open issues: 0

License: bsd-3-clause
Language: C++
Size: 867 KB
Dependencies parsed at: Pending

Created at: 12 months ago
Updated at: about 1 month ago
Pushed at: about 1 month ago
Last synced at: about 1 month ago

Topics: cuda, cuda-core, decoding-attention, flash-attention, flashinfer, flashmla, gpu, gqa, inference, large-language-model, llm, mha, mla, mqa, multi-head-attention, nvidia

Loading...