Topic: "wordpiece"
NLPOptimize/flash-tokenizer
EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING
Language: C++ - Size: 197 MB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 456 - Forks: 5

georg-jung/FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
Language: C# - Size: 19.2 MB - Last synced at: 4 days ago - Pushed at: 24 days ago - Stars: 49 - Forks: 11

stephantul/piecelearn
Learning BPE embeddings by first learning a segmentation model and then training word2vec
Language: Python - Size: 22.5 KB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 19 - Forks: 1

NLPOptimize/awesome-tokenizers
A curated list of tokenizer libraries for blazing-fast NLP processing.
Size: 18.6 KB - Last synced at: 25 days ago - Pushed at: 4 months ago - Stars: 5 - Forks: 0

danieldk/wordpieces
Split tokens into word pieces
Language: Rust - Size: 33.2 KB - Last synced at: 8 days ago - Pushed at: almost 3 years ago - Stars: 5 - Forks: 0

SeonbeomKim/Python-Byte_Pair_Encoding
Byte Pair Encoding (BPE)
Language: Python - Size: 51.2 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 5 - Forks: 4

Lizhecheng02/Kaggle-LLM-Detect_AI_Generated_Text
Detect whether the text is AI-generated by training a new tokenizer and combining it with tree classification models or by training language models on a large dataset of human & AI-generated texts.
Language: Jupyter Notebook - Size: 394 KB - Last synced at: 7 months ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 0

SeanLee97/BertWordPieceTokenizer.jl
WordPiece Tokenizer for BERT models.
Language: Julia - Size: 23.4 KB - Last synced at: 24 days ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

Daniel-Heo/NemoTokenizer
Fast wordpiece, sentencepiece tokenizer by Trie, OpenMP, SIMD, MemoryPool
Language: C++ - Size: 231 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

Hank-Kuo/go-bert-tokenizer
go-bert-tokenizer
Language: Go - Size: 119 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

burcgokden/BERT-Subword-Tokenizer-Wrapper
A framework for generating subword vocabulary from a tensorflow dataset and building custom BERT tokenizer models.
Language: Python - Size: 12.7 KB - Last synced at: 5 months ago - Pushed at: about 4 years ago - Stars: 1 - Forks: 0

vassef/Implementing-BPE-and-WordPiece-Tokenizers
Language: Jupyter Notebook - Size: 1.32 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0
