GitHub topics: tokenizers

Repositories

SameerManan/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

Language: Python - Size: 2.44 MB - Last synced at: about 15 hours ago - Pushed at: about 17 hours ago - Stars: 0 - Forks: 1

Arunprakash-A/DL-Pytorch-Workshop

Develop DL models using Pytorch and Hugging Face

Size: 17.4 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 40 - Forks: 13

sappho192/Tokenizers.DotNet

[Unofficial] Simple .NET wrapper of HuggingFace Tokenizers library

Language: C# - Size: 6.51 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 13 - Forks: 5

jshuadvd/LongRoPE

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

Language: Python - Size: 562 KB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 149 - Forks: 14

xebia-functional/xef

Building applications with LLMs through composability, in Kotlin

Language: Kotlin - Size: 15.4 MB - Last synced at: 26 days ago - Pushed at: 11 months ago - Stars: 190 - Forks: 14

helena-intel/test-prompt-generator

Create prompts with a given token length for testing LLMs and other transformers text models.

Language: Python - Size: 606 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

symanto-research/merge-tokenizers

Package to align tokens from different tokenizations.

Language: Python - Size: 347 KB - Last synced at: 14 days ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 0

ImadSaddik/ElasticSearch_Python_Course

This repository is part of a course on Elasticsearch in Python. It includes notebooks that demonstrate its usage, along with a YouTube series to guide you through the material.

Language: Jupyter Notebook - Size: 26.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 59 - Forks: 47

gweidart/rs-bpe

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

Language: Python - Size: 2.47 MB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 12 - Forks: 1

1kkiRen/Tokenizer-Changer

Python script for manipulating the existing tokenizer.

Language: Python - Size: 85 KB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 19 - Forks: 1

sayakpaul/count-tokens-hf-datasets

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

Language: Python - Size: 19.5 KB - Last synced at: 2 days ago - Pushed at: almost 3 years ago - Stars: 27 - Forks: 1

mkashirin/cattode

Lil GPT and BPE built from scratch using PyTorch.

Language: Python - Size: 3.85 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

victoryosiobe/kingchop

Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.

Language: JavaScript - Size: 85.9 KB - Last synced at: 2 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

Anush008/tokenizers

Multi-arch bindings for @huggingface/tokenizers.

Language: Rust - Size: 893 KB - Last synced at: about 14 hours ago - Pushed at: almost 2 years ago - Stars: 8 - Forks: 4

VenkatRamaraju/polydb

a vector database + embedding model written from scratch in go

Language: Go - Size: 20.1 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

infinilabs/pizza-stemmers

🌍 A Rust snowball stemmers with 30+ languages stemming algorithms for INFINI Pizza.

Language: Rust - Size: 875 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Prismadic/magnet

the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly

Language: Python - Size: 11.8 MB - Last synced at: 10 days ago - Pushed at: 11 months ago - Stars: 31 - Forks: 3

wassemgtk/SuperTokenizer

A high-performance tokenizer built to rival GPT-4, trained on the C4 dataset.

Language: Jupyter Notebook - Size: 0 Bytes - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

arturom/search-analysis

A graphical user interface for the Elasticsearch Analyze API

Language: JavaScript - Size: 4.67 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 6 - Forks: 0

Jeronymous/deep_learning_notebooks

Self-containing notebooks to play simply with some particular concepts in Deep Learning

Language: Jupyter Notebook - Size: 17.1 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

DanielPFlorian/Transformers-Github-Semantic-Search

NLP Dataset Creation and Semantic Search Demonstration

Language: Jupyter Notebook - Size: 24.4 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

LogosBible/HfTokenizers

C# wrapper for https://github.com/huggingface/tokenizers/tree/main/tokenizers

Language: C# - Size: 1.38 MB - Last synced at: 18 days ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

lepisma/tokenizers.el

Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library

Language: Rust - Size: 23.4 KB - Last synced at: 6 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

MihranD/HuggingFace-Tokenizers

Hugging Face Transformers provide a powerful and flexible framework for working with state-of-the-art natural language processing models.

Size: 1.95 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

cobanov/turkish-bpe-tokenizer

Byte Pair Encoding (BPE) tokenizer tailored for the Turkish language

Language: Python - Size: 8.79 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

Hugging-Face-Supporter/tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

Language: Python - Size: 263 KB - Last synced at: 9 days ago - Pushed at: over 3 years ago - Stars: 9 - Forks: 4

kojix2/blingfire-crystal

Language: Crystal - Size: 50.8 KB - Last synced at: 5 months ago - Pushed at: 8 months ago - Stars: 3 - Forks: 0

Beomi/megatronlm_dataset_autotokenizer

Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.

Language: Python - Size: 498 KB - Last synced at: 4 months ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 1

megagonlabs/ginza-transformers

Use custom tokenizers in spacy-transformers

Language: Python - Size: 32.2 KB - Last synced at: 17 days ago - Pushed at: about 3 years ago - Stars: 16 - Forks: 5

willsaliba/LDR_Transformer

ML Model designed to learn compositional structure of LEGO assemblies

Language: Python - Size: 617 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

OmkarBorhade98/Text_Summarization

Text Summarization using NLP

Language: Jupyter Notebook - Size: 116 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mickymultani/LLM-Architecture

Visualize some important concepts related to LLM architectures.

Language: Jupyter Notebook - Size: 9.74 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

unfoldingWord/string-punctuation-tokenizer

Small library that provides functions to tokenize a string into an array of words with or without punctuation

Language: JavaScript - Size: 2.14 MB - Last synced at: 14 days ago - Pushed at: about 2 years ago - Stars: 8 - Forks: 1

s2458588/wsm-tokenizer

Bachelor Thesis Repository. Wsm-tokenizer (word shape mapping) uses vocabulary comparisons to find probable morphemes in lexemic tokens.

Language: Jupyter Notebook - Size: 2.41 MB - Last synced at: 11 months ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

jungsoh/transformers-question-answering

Fine tuning pre-trained transformer models in TensorFlow and in PyTorch for question answering

Language: Jupyter Notebook - Size: 379 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

Related Keywords

tokenizers 35 transformers 11 nlp 8 huggingface 8 natural-language-processing 6 llm 6 bpe 4 machine-learning 4 rust 4 openai 3 tokenizer 3 embeddings 3 pytorch 3 huggingface-transformers 3 tokenization 3 artificial-intelligence 3 tensorflow 2 javascript 2 pypi-package 2 gpt 2 deep-learning 2 llm-inference 2 bpe-tokenizer 2 ai 2 tokens 2 elasticsearch 2 byte-pair-encoding 2 semantic-search 2 byte-pair-tokenizer 2 tiktoken 2 python 2 filters 1 distributed-systems 1 react 1 text-analysis 1 artificial-neural-networks 1 automatic-speech-recognition 1 deep-neural-networks 1 speech-recognition 1 speech-to-text 1 fine-tuning 1 finetuning-llms 1 gemini 1 inference-api 1 langchain 1 llm-training 1 milvus 1 mistral 1 mlx 1 nats 1 nats-messaging 1 nats-streaming 1 sentence-splitting 1 tokenizer-framework 1 analyze-api 1 analyzers 1 question-answering 1 pytorch-api 1 gradient-tape 1 distilbert-model 1 babi-dataset 1 lexemes 1 segmentation 1 scripture-open-components 1 nlp-library 1 llm-architecture 1 attention-mechanism 1 transformer 1 sudachitra 1 spacy-transformers 1 spacy 1 ginza 1 megatron-lm 1 gpt-neox 1 crystal 1 tensorflow-hub 1 sentencepie 1 bert 1 turkish-tokenization 1 turkish 1 emacs-lisp 1 text-embedding 1 tokenizer-nlp 1 distributed-computing 1 knn-algorithm 1 hybrid-search 1 elastic 1 distance 1 benchmarking 1 scala 1 multiplatform 1 kotlin 1 functional-programming 1 chatgpt-api 1 agents 1 transformer-architecture 1 natural-language-understanding 1 natural-language-procressing 1 natural-language-inference 1 natural 1