An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: bytepairencoding

10-OASIS-01/BPEtokenizer

This project implements a tokenizer based on the Byte Pair Encoding (BPE) algorithm, with additional custom tokenizers, including one similar to the GPT-4 tokenizer.

Language: Python - Size: 5.21 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 4 - Forks: 0

Hords01/Data_Mining

TF-IDF Calculation

Language: Python - Size: 37 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

gxstxxv/BPE

Byte Pair Encoding (BPE)

Language: Python - Size: 7.38 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

ReshiAdavan/Thoth

tokenizer for large-scale language models (GPT, Claude, Llama, etc.)

Language: Python - Size: 2.96 MB - Last synced at: 6 days ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

willxxy/superbpe

[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust

Language: Rust - Size: 530 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

shivendrra/tokenizers

self made byte-pair-encoding tokenizer

Language: Python - Size: 3.36 MB - Last synced at: 11 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

art-test-stack/tokenizer

A web app to compare pre-built or self-built tokenizers

Language: Python - Size: 367 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

dbtreasure/zig-bpe

Byte Pair Encoding (BPE) in the Zig programming language (0.13.0)

Language: Zig - Size: 1.84 MB - Last synced at: 7 days ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

madhu102938/BPE-CBOW

implementation of BPE algorithm and training of the tokens generated

Language: Python - Size: 7.69 MB - Last synced at: 4 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

AnimaVR/TOKENIZER-BytePairEncoderDecoder-ModelTrainer-CSharp

Actual C Sharp Byte Pair Encoder that works. Use bin folder or add your own data to be able to train your own model, this model is then used to encode into train.bin and val.bin binary files to use to train an LLM or similar.

Language: C# - Size: 4.61 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

deepanprabhu/fastbpe

Java library implementing Byte-Pair Encoding Tokenization

Language: Java - Size: 236 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

mohsenfayyaz/nlp-course-ut

Natural Language Processing course assignments @ University of Tehran

Language: Jupyter Notebook - Size: 56.3 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

JunhoKim94/Transformer

This repository is reimplementation of Transformer model which was introduced in 2017 NeurIPS paper "Attention is all you need"

Language: Python - Size: 144 KB - Last synced at: over 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

vatsalsaglani/BytePairEncoding

A python package to build a corpus vocabulary using the byte pair methodology and also a tokenizer to tokenize input texts based on the built vocab.

Language: Python - Size: 14.6 KB - Last synced at: over 2 years ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0