An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: bpe-tokenizer

simonrueba/bpe-visualization

Interactive tool for exploring Byte Pair Encoding tokenization step-by-step.

Language: TypeScript - Size: 0 Bytes - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

SameerManan/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

Language: Python - Size: 2.44 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 1

jaco-bro/tokenizer

BPE tokenizer for LLMs in Zig

Language: Zig - Size: 259 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

taabishhh/LLM_Preprocessing

This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.

Language: Scala - Size: 7.37 MB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

neluca/tinybpe

This is a fast, lightweight and clean code implementation of the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization and NLP tasks.

Language: C - Size: 6.11 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 2 - Forks: 0

gweidart/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

Language: Python - Size: 2.47 MB - Last synced at: 15 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

willxxy/superbpe

[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust

Language: Rust - Size: 530 KB - Last synced at: 15 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 0

nickscha/bpe

C89, single header, nostdlib byte pair encoding algorythm

Language: C - Size: 35.2 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

xmarva/transformer-architectures

Teaching transformer-based architectures

Language: Jupyter Notebook - Size: 215 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 2 - Forks: 0

Hords01/Data_Mining

TF-IDF Calculation

Language: Python - Size: 37 MB - Last synced at: 30 days ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

heissanjay/bpe-go

Naive implementation of byte pair encoding (BPE) tokenization algorithm in golang

Language: Go - Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

estnafinema0/russian-jokes-generator

Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.

Language: Jupyter Notebook - Size: 294 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

RahulDey12/tiktoken-php

A PHP implementation of OpenAI's BPE tokenizer tiktoken.

Language: PHP - Size: 95.7 KB - Last synced at: about 15 hours ago - Pushed at: 3 months ago - Stars: 7 - Forks: 0

jmaczan/bpe-tokenizer

Byte-Pair Encoding tokenizer for training large language models on huge datasets

Language: Python - Size: 108 KB - Last synced at: 24 days ago - Pushed at: 11 months ago - Stars: 6 - Forks: 1

jmaczan/bpe.c

High performance Byte-Pair Encoding tokenizer for large language models

Language: C - Size: 18.6 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

hyouteki/lanat

processing de LANguage NATural

Language: Jupyter Notebook - Size: 332 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 2

shivendrra/tokenizers

self made byte-pair-encoding tokenizer

Language: Python - Size: 3.46 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0