GitHub topics: bytepairencoding
10-OASIS-01/BPEtokenizer
This project implements a tokenizer based on the Byte Pair Encoding (BPE) algorithm, with additional custom tokenizers, including one similar to the GPT-4 tokenizer.
Language: Python - Size: 5.21 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 4 - Forks: 0

Hords01/Data_Mining
TF-IDF Calculation
Language: Python - Size: 37 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

gxstxxv/BPE
Byte Pair Encoding (BPE)
Language: Python - Size: 7.38 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

ReshiAdavan/Thoth
tokenizer for large-scale language models (GPT, Claude, Llama, etc.)
Language: Python - Size: 2.96 MB - Last synced at: 6 days ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

willxxy/superbpe
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
Language: Rust - Size: 530 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

shivendrra/tokenizers
self made byte-pair-encoding tokenizer
Language: Python - Size: 3.36 MB - Last synced at: 11 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

art-test-stack/tokenizer
A web app to compare pre-built or self-built tokenizers
Language: Python - Size: 367 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

dbtreasure/zig-bpe
Byte Pair Encoding (BPE) in the Zig programming language (0.13.0)
Language: Zig - Size: 1.84 MB - Last synced at: 7 days ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

madhu102938/BPE-CBOW
implementation of BPE algorithm and training of the tokens generated
Language: Python - Size: 7.69 MB - Last synced at: 4 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

AnimaVR/TOKENIZER-BytePairEncoderDecoder-ModelTrainer-CSharp
Actual C Sharp Byte Pair Encoder that works. Use bin folder or add your own data to be able to train your own model, this model is then used to encode into train.bin and val.bin binary files to use to train an LLM or similar.
Language: C# - Size: 4.61 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

deepanprabhu/fastbpe
Java library implementing Byte-Pair Encoding Tokenization
Language: Java - Size: 236 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

mohsenfayyaz/nlp-course-ut
Natural Language Processing course assignments @ University of Tehran
Language: Jupyter Notebook - Size: 56.3 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

JunhoKim94/Transformer
This repository is reimplementation of Transformer model which was introduced in 2017 NeurIPS paper "Attention is all you need"
Language: Python - Size: 144 KB - Last synced at: over 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

vatsalsaglani/BytePairEncoding
A python package to build a corpus vocabulary using the byte pair methodology and also a tokenizer to tokenize input texts based on the built vocab.
Language: Python - Size: 14.6 KB - Last synced at: over 2 years ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0
