An open API service providing repository metadata for many open source software ecosystems.

Topic: "byte-pair-encoding"

samber/go-gpt-3-encoder

Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3

Language: Go - Size: 558 KB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 79 - Forks: 21

aallam/ktoken

Kotlin multiplatform BPE tokenizer library for OpenAI models

Language: Kotlin - Size: 10.7 MB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 30 - Forks: 2

ankane/youtokentome-ruby 📦

High performance unsupervised text tokenization for Ruby

Language: Ruby - Size: 31.3 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 1

BobMcDear/minbpe-hs

Byte-level byte pair encoding (BPE) in Haskell

Language: Haskell - Size: 67.4 KB - Last synced at: 11 days ago - Pushed at: 11 months ago - Stars: 15 - Forks: 1

bnosac/tokenizers.bpe

R package for Byte Pair Encoding based on YouTokenToMe

Language: C++ - Size: 7.64 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 15 - Forks: 0

DolbyUUU/byte_pair_encoding_BPE_subword_tokenization_implementation_python

Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python

Language: Python - Size: 449 KB - Last synced at: 4 months ago - Pushed at: about 2 years ago - Stars: 12 - Forks: 0

theskyinflames/word2png

This is a tool that encrypts a sequence of words (or pieces of texts) using the AES-256 algorithm and encodes the encrypted result into a PNG image by linking each byte value to a specific color. It also decodes the before image to get back the original sequence of words

Language: Go - Size: 929 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 0

akhvorov/vgram

Feature extraction from sequential data

Language: C++ - Size: 545 KB - Last synced at: 6 days ago - Pushed at: almost 6 years ago - Stars: 7 - Forks: 0

jmaczan/bpe-tokenizer

Byte-Pair Encoding tokenizer for training large language models on huge datasets

Language: Python - Size: 108 KB - Last synced at: 20 days ago - Pushed at: 11 months ago - Stars: 6 - Forks: 1

AndreyKolomiets/News_Headline_Generation

Генерация новостных заголовков

Language: Python - Size: 161 KB - Last synced at: 16 days ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 1

SeonbeomKim/Python-Byte_Pair_Encoding

Byte Pair Encoding (BPE)

Language: Python - Size: 51.2 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 5 - Forks: 4

mdabir1203/BPE_Tokenizer_Visualizer

A Visualizer to check how BPE Tokenizer in an LLM Works

Language: JavaScript - Size: 204 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 4 - Forks: 0

zouharvi/tokenization-principle

Language: Python - Size: 291 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 1

Ascend-Research/AutoGO

Code repo for the paper "AutoGO: Automated Computation Graph Optimization for Neural Network Evolution", accepted to NeurIPS 2023.

Language: Python - Size: 42.5 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

iamlxb3/PMTC

Code for the publication of WWW'22

Language: Python - Size: 170 MB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 1

crodriguez1a/bpe-summarizer

Auto summarization from BPE tokenization

Language: Jupyter Notebook - Size: 1.69 MB - Last synced at: 1 day ago - Pushed at: over 4 years ago - Stars: 3 - Forks: 1

gweidart/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

Language: Python - Size: 2.47 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

zouharvi/formal-bpe

Language: Python - Size: 491 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

floriankark/transformer

Transformer implementation in pytorch trained on NVIDIA A100 in fp16

Language: Python - Size: 2.33 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

AndreiMoraru123/Neural-Machine-Translation

Modern Eager TensorFlow implementation of Attention Is All You Need

Language: Python - Size: 1.02 MB - Last synced at: 3 days ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

IAmPara0x/fast-bpe

Fast BPE algorithm to generate byte pair encodings from text corpus, it's written in rust and approximately 20x faster than it's python implementation

Language: Rust - Size: 432 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

cosmaadrian/acumen-compressor

Order-agnostic lossless compressor using BPE and Huffman Coding.

Language: Python - Size: 24.4 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

capjamesg/bpe

Byte-pair encoding implementation in Python.

Language: Python - Size: 1.95 KB - Last synced at: 5 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

parsa-abbasi/intro-to-nlp

An Introduction to Natural Language Processing (NLP)

Language: Jupyter Notebook - Size: 397 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

baselinerepo/LLMs

Understanding Large Language Models

Language: CSS - Size: 29 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

SameerManan/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

Language: Python - Size: 2.44 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 1

DVDAGames/pgn-tokenizer

A byte pair encoding (BPE) tokenizer for chess portable game notation (PGN)

Language: Python - Size: 1.5 MB - Last synced at: 16 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

mahdertesf/SentencePiece-and-Byte-Pair-Encoding-BPE-Implementation

This repository provides a hands-on exploration of SentencePiece tokenization and Byte-Pair Encoding (BPE) .The code demonstrates data preprocessing steps like NFKC normalization and lossless tokenization, followed by a practical implementation of the BPE algorithm from scratch.

Language: Jupyter Notebook - Size: 615 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

JiauZhang/textok

Text Tokenizer in C++

Language: Python - Size: 14.6 KB - Last synced at: about 4 hours ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

LukasDrews97/DumbleLLM

Decoder-only LLM trained on the Harry Potter books.

Language: Python - Size: 235 KB - Last synced at: 16 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

jonasknobloch/mbpe

Morphologically biased byte-pair encoding pre-tokenization

Language: Rust - Size: 136 KB - Last synced at: 23 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

tizianocitro/tiztoken

A byte-level Byte Pair Encoding (BPE) algorithm for tokenization in Large Language Models (LLMs), similar to those used in GPT, Llama, and Mistral.

Language: Python - Size: 25.4 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

Maria-Antony/Seq2Seq-NMT

This is project for sequence to sequence NLP task. We developed a custom model to understand the process of task using PyTorch. We also fine tuned pre-trained transformer models to improve the performance of translation task.

Language: Jupyter Notebook - Size: 13 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

psycoplankton/GPT-Decoded

An implementation of the GPT(generative pretrained transformer) model, from scratch, which produces Shakespearean text by training on the dialogues written by Shakespeare along with the GPT Encoder.

Language: Jupyter Notebook - Size: 4.84 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

rashmishreev/Deep-Learning

This repository houses my assignments completed during the Deep Learning course as part of my Master's in Data Analytics program. Explore diverse projects showcasing hands-on applications of advanced neural networks and machine learning techniques.

Language: Jupyter Notebook - Size: 136 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

clabrugere/byte-pair-encoding

Byte pair encoding tokenizer as used in some large language models.

Language: Python - Size: 17.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

amankshihab/TENER-MALAYALAM

Named entity recognition in Malayalam using BiLSTM and TENER (Transformer Encoder)

Language: Jupyter Notebook - Size: 76.7 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0