GitHub topics: tokenizers
SameerManan/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Language: Python - Size: 2.44 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 1

VenkatRamaraju/polydb
a vector database + embedding model written from scratch in go
Language: Go - Size: 20.1 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

jshuadvd/LongRoPE
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
Language: Python - Size: 562 KB - Last synced at: 5 days ago - Pushed at: 10 months ago - Stars: 136 - Forks: 14

sayakpaul/count-tokens-hf-datasets
This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.
Language: Python - Size: 19.5 KB - Last synced at: 11 days ago - Pushed at: over 2 years ago - Stars: 26 - Forks: 1

gweidart/rs-bpe
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Language: Python - Size: 2.47 MB - Last synced at: 19 days ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 0

xebia-functional/xef
Building applications with LLMs through composability, in Kotlin
Language: Kotlin - Size: 15.4 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 189 - Forks: 14

1kkiRen/Tokenizer-Changer
Python script for manipulating the existing tokenizer.
Language: Python - Size: 83 KB - Last synced at: 2 days ago - Pushed at: about 2 months ago - Stars: 18 - Forks: 1

sappho192/Tokenizers.DotNet
[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library
Language: C# - Size: 2.82 MB - Last synced at: 29 days ago - Pushed at: about 1 month ago - Stars: 8 - Forks: 2

infinilabs/pizza-stemmers
🌍 A Rust snowball stemmers with 30+ languages stemming algorithms for INFINI Pizza.
Language: Rust - Size: 875 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

mkashirin/cattode
Lil GPT and BPE built from scratch using PyTorch.
Language: Python - Size: 3.85 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Prismadic/magnet
the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly
Language: Python - Size: 11.8 MB - Last synced at: 4 days ago - Pushed at: 7 months ago - Stars: 31 - Forks: 3

wassemgtk/SuperTokenizer
A high-performance tokenizer built to rival GPT-4, trained on the C4 dataset.
Language: Jupyter Notebook - Size: 0 Bytes - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

symanto-research/merge-tokenizers
Package to align tokens from different tokenizations.
Language: Python - Size: 347 KB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 8 - Forks: 0

arturom/search-analysis
A graphical user interface for the Elasticsearch Analyze API
Language: JavaScript - Size: 4.67 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 6 - Forks: 0

Jeronymous/deep_learning_notebooks
Self-containing notebooks to play simply with some particular concepts in Deep Learning
Language: Jupyter Notebook - Size: 17.1 MB - Last synced at: 25 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

DanielPFlorian/Transformers-Github-Semantic-Search
NLP Dataset Creation and Semantic Search Demonstration
Language: Jupyter Notebook - Size: 24.4 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

helena-intel/test-prompt-generator
Create prompts with a given token length for testing LLMs and other transformers text models.
Language: Python - Size: 311 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

LogosBible/HfTokenizers
C# wrapper for https://github.com/huggingface/tokenizers/tree/main/tokenizers
Language: C# - Size: 1.38 MB - Last synced at: 7 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Anush008/tokenizers
Multi-arch bindings for @huggingface/tokenizers.
Language: Rust - Size: 893 KB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 2

lepisma/tokenizers.el
Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library
Language: Rust - Size: 23.4 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

MihranD/HuggingFace-Tokenizers
Hugging Face Transformers provide a powerful and flexible framework for working with state-of-the-art natural language processing models.
Size: 1.95 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

cobanov/turkish-bpe-tokenizer
Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language
Language: Python - Size: 8.79 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

Arunprakash-A/DL-Pytorch-Workshop
Develop DL models using Pytorch and Hugging Face
Size: 2.88 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 37 - Forks: 11

Hugging-Face-Supporter/tftokenizers
Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels
Language: Python - Size: 263 KB - Last synced at: 7 days ago - Pushed at: about 3 years ago - Stars: 9 - Forks: 4

kojix2/blingfire-crystal
Language: Crystal - Size: 50.8 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0

victoryosiobe/kingchop
Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.
Language: JavaScript - Size: 85 KB - Last synced at: 4 days ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

Beomi/megatronlm_dataset_autotokenizer
Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.
Language: Python - Size: 498 KB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 1

megagonlabs/ginza-transformers
Use custom tokenizers in spacy-transformers
Language: Python - Size: 32.2 KB - Last synced at: 5 days ago - Pushed at: almost 3 years ago - Stars: 16 - Forks: 5

willsaliba/LDR_Transformer
ML Model designed to learn compositional structure of LEGO assemblies
Language: Python - Size: 617 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

OmkarBorhade98/Text_Summarization
Text Summarization using NLP
Language: Jupyter Notebook - Size: 116 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mickymultani/LLM-Architecture
Visualize some important concepts related to LLM architectures.
Language: Jupyter Notebook - Size: 9.74 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

unfoldingWord/string-punctuation-tokenizer
Small library that provides functions to tokenize a string into an array of words with or without punctuation
Language: JavaScript - Size: 2.14 MB - Last synced at: 3 days ago - Pushed at: almost 2 years ago - Stars: 8 - Forks: 1

s2458588/wsm-tokenizer
Bachelor Thesis Repository. Wsm-tokenizer (word shape mapping) uses vocabulary comparisons to find probable morphemes in lexemic tokens.
Language: Jupyter Notebook - Size: 2.41 MB - Last synced at: 7 months ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

jungsoh/transformers-question-answering
Fine tuning pre-trained transformer models in TensorFlow and in PyTorch for question answering
Language: Jupyter Notebook - Size: 379 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0
