GitHub topics: bpe-tokenizer
simonrueba/bpe-visualization
Interactive tool for exploring Byte Pair Encoding tokenization step-by-step.
Language: TypeScript - Size: 0 Bytes - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

SameerManan/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Language: Python - Size: 2.44 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 1

jaco-bro/tokenizer
BPE tokenizer for LLMs in Zig
Language: Zig - Size: 259 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

taabishhh/LLM_Preprocessing
This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.
Language: Scala - Size: 7.37 MB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

neluca/tinybpe
This is a fast, lightweight and clean code implementation of the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization and NLP tasks.
Language: C - Size: 6.11 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 2 - Forks: 0

gweidart/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Language: Python - Size: 2.47 MB - Last synced at: 15 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

willxxy/superbpe
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
Language: Rust - Size: 530 KB - Last synced at: 15 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 0

nickscha/bpe
C89, single header, nostdlib byte pair encoding algorythm
Language: C - Size: 35.2 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

xmarva/transformer-architectures
Teaching transformer-based architectures
Language: Jupyter Notebook - Size: 215 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 2 - Forks: 0

Hords01/Data_Mining
TF-IDF Calculation
Language: Python - Size: 37 MB - Last synced at: 30 days ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

heissanjay/bpe-go
Naive implementation of byte pair encoding (BPE) tokenization algorithm in golang
Language: Go - Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

estnafinema0/russian-jokes-generator
Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.
Language: Jupyter Notebook - Size: 294 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

RahulDey12/tiktoken-php
A PHP implementation of OpenAI's BPE tokenizer tiktoken.
Language: PHP - Size: 95.7 KB - Last synced at: about 15 hours ago - Pushed at: 3 months ago - Stars: 7 - Forks: 0

jmaczan/bpe-tokenizer
Byte-Pair Encoding tokenizer for training large language models on huge datasets
Language: Python - Size: 108 KB - Last synced at: 24 days ago - Pushed at: 11 months ago - Stars: 6 - Forks: 1

jmaczan/bpe.c
High performance Byte-Pair Encoding tokenizer for large language models
Language: C - Size: 18.6 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

hyouteki/lanat
processing de LANguage NATural
Language: Jupyter Notebook - Size: 332 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 2

shivendrra/tokenizers
self made byte-pair-encoding tokenizer
Language: Python - Size: 3.46 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0
