tokenization | Topic | Ecosyste.ms: Repos

explosion/spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Language: Python - Size: 194 MB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 32,997 - Forks: 4,630

toon-format/toon

🎒 Token-Oriented Object Notation (TOON) – Compact, human-readable, schema-aware JSON for LLM prompts. Spec, benchmarks, TypeScript SDK.

Language: TypeScript - Size: 1.65 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 20,377 - Forks: 896

AgentOps-AI/tokencost

Easy token price estimates for 400+ LLMs. TokenOps.

Language: Python - Size: 2.16 MB - Last synced at: 7 days ago - Pushed at: 4 months ago - Stars: 1,875 - Forks: 98

NVIDIA/Cosmos-Tokenizer 📦

A suite of image and video neural tokenizers

Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 6 days ago - Pushed at: 11 months ago - Stars: 1,695 - Forks: 85

LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/

Language: TypeScript - Size: 293 MB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 1,456 - Forks: 168

securitybunker/databunker

Secure Vault for Customer PII/PHI/PCI/KYC Records

Language: Go - Size: 11.1 MB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 1,372 - Forks: 89

RavenProject/Ravencoin

Ravencoin Core integration/staging tree

Language: C - Size: 461 MB - Last synced at: 7 months ago - Pushed at: over 1 year ago - Stars: 1,096 - Forks: 697

VKCOM/YouTokenToMe 📦

Unsupervised text tokenizer focused on computational efficiency

Language: C++ - Size: 192 KB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 972 - Forks: 110

AmoDinho/datacamp-python-data-science-track

All the slides, accompanying code and exercises all stored in this repo. 🎈

Language: Python - Size: 74.1 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 872 - Forks: 528

explosion/spacy-streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

Language: Python - Size: 61.5 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 845 - Forks: 119

nlp-uoregon/trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Language: Python - Size: 1.05 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 763 - Forks: 103

cbaziotis/ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Language: Python - Size: 778 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 671 - Forks: 93

alasdairforsythe/tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

Language: Go - Size: 734 KB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 600 - Forks: 20

adobe/NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing

Language: HTML - Size: 11.1 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 562 - Forks: 96

yooper/php-text-analysis

PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language

Language: PHP - Size: 1.01 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 526 - Forks: 87

WorksApplications/sudachi.rs

Sudachi in Rust 🦀 and new generation of SudachiPy

Language: Rust - Size: 15.8 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 402 - Forks: 43

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

Language: Rust - Size: 1.1 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 384 - Forks: 21

macmade/ClangKit

ClangKit provides an Objective-C frontend to LibClang. Source tokenization, diagnostics and fix-its are actually implemented.

Language: C - Size: 15.2 MB - Last synced at: 8 months ago - Pushed at: over 4 years ago - Stars: 365 - Forks: 45

SaberaTalukder/TOTEM

The official code 👩‍💻 for - TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis

Language: Python - Size: 65.4 KB - Last synced at: 4 months ago - Pushed at: 11 months ago - Stars: 342 - Forks: 51

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Language: C++ - Size: 1.69 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 319 - Forks: 76

FoundationVision/OmniTokenizer

[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.

Language: Python - Size: 68.9 MB - Last synced at: 7 months ago - Pushed at: over 1 year ago - Stars: 295 - Forks: 6

natasha/razdel

Rule-based token, sentence segmentation for Russian language

Language: Python - Size: 37.2 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 275 - Forks: 35

CodeChain-io/codechain

CodeChain's official implementation in Rust.

Language: Rust - Size: 28.7 MB - Last synced at: 7 months ago - Pushed at: almost 3 years ago - Stars: 258 - Forks: 51

daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

Language: Rust - Size: 4 MB - Last synced at: 11 days ago - Pushed at: 16 days ago - Stars: 249 - Forks: 10

zjukg/MyGO

[Paper][AAAI 2025] (MyGO)Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation

Language: Python - Size: 91 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 249 - Forks: 7

ScrapeGraphAI/toonify

Toonify: Compact data format reducing LLM token usage by 30-60%

Language: Python - Size: 1.66 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 248 - Forks: 16

janlukasschroeder/nlp-cheat-sheet-python

NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition

Language: Jupyter Notebook - Size: 3.05 MB - Last synced at: 2 months ago - Pushed at: almost 3 years ago - Stars: 248 - Forks: 73

SmartTokenLabs/TokenScript

TokenScript schema, specs and paper

Language: JavaScript - Size: 16.1 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 240 - Forks: 70

ImadSaddik/Train_Your_Language_Model_Course

Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.

Language: Jupyter Notebook - Size: 59.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 208 - Forks: 108

milaan9/Python_Natural_Language_Processing

This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.

Language: Jupyter Notebook - Size: 182 KB - Last synced at: 4 months ago - Pushed at: over 3 years ago - Stars: 200 - Forks: 175

cutupdev/Solana-RWA-Smart-Contract

Solana RWA (Real World Assets) smart contract

Language: Rust - Size: 35.2 KB - Last synced at: 4 days ago - Pushed at: 10 days ago - Stars: 184 - Forks: 181

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Language: Python - Size: 729 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 175 - Forks: 15

GlitchedPolygons/l8w8jwt

Minimal, OpenSSL-less and super lightweight JWT library written in C.

Language: C - Size: 6.25 MB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 168 - Forks: 47

gautierdag/bpeasy

Fast bare-bones BPE for modern tokenizer training

Language: Python - Size: 1.41 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 164 - Forks: 5

cohere-ai/magikarp

Code for the paper "Fishing for Magikarp"

Language: Python - Size: 2.77 GB - Last synced at: 7 months ago - Pushed at: 8 months ago - Stars: 155 - Forks: 14

rth/vtext

Simple NLP in Rust with Python bindings

Language: Rust - Size: 273 KB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 153 - Forks: 9

THUDM/icetk

A unified tokenization tool for Images, Chinese and English.

Language: Python - Size: 25.4 KB - Last synced at: 3 months ago - Pushed at: almost 3 years ago - Stars: 151 - Forks: 17

jshuadvd/LongRoPE

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

Language: Python - Size: 562 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 150 - Forks: 14

alpkeskin/gotoon

Token-Oriented Object Notation for Go – JSON for LLMs at half the token cost

Language: Go - Size: 60.5 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 144 - Forks: 9

bminixhofer/zett

Code for Zero-Shot Tokenizer Transfer

Language: Python - Size: 1.04 MB - Last synced at: 4 months ago - Pushed at: 12 months ago - Stars: 135 - Forks: 12

ash-01xor/bpe.c

Simple Byte pair Encoding mechanism used for tokenization process . written purely in C

Language: C - Size: 86.9 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 120 - Forks: 3

lucidrains/charformer-pytorch

Implementation of the GBST block from the Charformer paper, in Pytorch

Language: Python - Size: 77.1 KB - Last synced at: 4 months ago - Pushed at: over 4 years ago - Stars: 118 - Forks: 11

aymara/lima

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.

Language: C++ - Size: 276 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 115 - Forks: 20

mit-ccc/TweebankNLP

[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset

Language: Python - Size: 16.8 MB - Last synced at: 8 months ago - Pushed at: almost 2 years ago - Stars: 104 - Forks: 8

mysto/python-fpe

FPE - Format Preserving Encryption with FF3 in Python

Language: Python - Size: 144 KB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 101 - Forks: 20

ARBML/tkseem

Arabic Tokenization Library. It provides many tokenization algorithms.

Language: Jupyter Notebook - Size: 49.6 MB - Last synced at: 10 months ago - Pushed at: almost 2 years ago - Stars: 101 - Forks: 18

dluc/openai-tools

A collection of tools for working with OpenAI

Language: C# - Size: 559 KB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 100 - Forks: 15

johannschopplich/tokenx

📐 Fast token estimation at 94% accuracy of a full tokenizer in a 2kB bundle

Language: TypeScript - Size: 658 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 99 - Forks: 5

JuliaText/WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks

Language: Julia - Size: 262 KB - Last synced at: about 2 months ago - Pushed at: about 4 years ago - Stars: 99 - Forks: 25

clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.

Language: Go - Size: 920 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 92 - Forks: 4

GoogleCloudPlatform/dlp-dataflow-deidentification

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP

Language: Java - Size: 47.5 MB - Last synced at: 9 months ago - Pushed at: over 1 year ago - Stars: 92 - Forks: 51

PyThaiNLP/attacut

A Fast and Accurate Neural Thai Word Segmenter

Language: Python - Size: 4.15 MB - Last synced at: about 2 months ago - Pushed at: 12 months ago - Stars: 91 - Forks: 17

PyThaiNLP/wisesight-sentiment

Thai social media text sentiment dataset

Language: Jupyter Notebook - Size: 10.9 MB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 87 - Forks: 33

nlpcloud/nlpcloud-python

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...

Language: Python - Size: 61.5 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 86 - Forks: 8

av/klmbr

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

Language: TeX - Size: 2.24 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 79 - Forks: 3

CMTA/CMTAT

Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.

Language: JavaScript - Size: 119 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 75 - Forks: 29

wongnai/wongnai-corpus

Collection of Wongnai's datasets

Size: 38.7 MB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 74 - Forks: 22

mensfeld/llm-docs-builder

Transform and optimize your markdown documentation for Large Language Models (LLMs) and RAG systems. Generate llms.txt automatically.

Language: Ruby - Size: 1.75 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 72 - Forks: 3

adamshamsudeen/Vaaku2Vec

Language Modeling and Text Classification in Malayalam Language using ULMFiT

Language: Jupyter Notebook - Size: 1.09 MB - Last synced at: almost 3 years ago - Pushed at: about 3 years ago - Stars: 68 - Forks: 17

lucidrains/h-net-dynamic-chunking

Implementation of the dynamic chunking mechanism in H-net by Hwang et al. of Carnegie Mellon

Language: Python - Size: 34.4 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 65 - Forks: 1

liuzl/ling

Natural Language Processing Toolkit in Golang

Language: Go - Size: 496 KB - Last synced at: 2 months ago - Pushed at: over 5 years ago - Stars: 64 - Forks: 4

dl-tokenf/contracts

On-chain RWA Tokenization Framework

Language: Solidity - Size: 1.14 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 62 - Forks: 16

vaulty-co/vaulty

Tokenize, encrypt/decrypt, mask your data on the fly with Vaulty proxy

Language: Go - Size: 269 KB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 62 - Forks: 11

winkjs/wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

Language: JavaScript - Size: 2.05 MB - Last synced at: 2 months ago - Pushed at: almost 3 years ago - Stars: 62 - Forks: 12

cedricrupb/code_tokenize

Fast tokenization and structural analysis of any programming language

Language: Python - Size: 152 KB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 60 - Forks: 10

TGDivy/MBTI-Personality-Classifier

A model which uses your social media posting predict your MBTI personality type.

Language: Jupyter Notebook - Size: 931 KB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 60 - Forks: 17

neelkamath/spacy-server 📦

🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec

Language: Python - Size: 87.9 KB - Last synced at: almost 3 years ago - Pushed at: about 4 years ago - Stars: 58 - Forks: 13

typst/unscanny

Painless string scanning.

Language: Rust - Size: 15.6 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 56 - Forks: 7

ChocoWu/SeTok

Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM

Language: Python - Size: 2.1 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 54 - Forks: 0

zhongbin1/bert_tokenization_for_java

This is a java version of Chinese tokenization descried in BERT.

Language: Java - Size: 67.4 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 54 - Forks: 8

cashtokens/cashtokens

A proposal to enable two new primitives on Bitcoin Cash: fungible tokens and non-fungible tokens.

Size: 621 KB - Last synced at: 20 days ago - Pushed at: over 2 years ago - Stars: 52 - Forks: 36

georg-jung/FastBertTokenizer

Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

Language: C# - Size: 19.2 MB - Last synced at: 29 days ago - Pushed at: about 2 months ago - Stars: 50 - Forks: 11

anki-code/xontrib-output-search

Get identifiers, paths, URLs and words from the previous command output and use them for the next command in @xonsh.

Language: Python - Size: 152 KB - Last synced at: 20 days ago - Pushed at: almost 2 years ago - Stars: 50 - Forks: 2

unicode-cookbook/cookbook

The Unicode Cookbook for Linguists

Language: TeX - Size: 105 MB - Last synced at: over 2 years ago - Pushed at: about 5 years ago - Stars: 50 - Forks: 4

Quillhash/Real-World-Assets-RWA

This repository comprises the theoretical and technical aspects of tokenisation of real world assets.

Language: Solidity - Size: 1.39 MB - Last synced at: 8 months ago - Pushed at: over 1 year ago - Stars: 49 - Forks: 10

nlpcloud/nlpcloud-js

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...

Language: JavaScript - Size: 101 KB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 48 - Forks: 6

PolyCash/polycash

The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.

Language: PHP - Size: 64.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 38

TrainingByPackt/Natural-Language-Processing-Fundamentals

Use Python and NLTK to build out your own text classifiers and solve common NLP problems

Language: Jupyter Notebook - Size: 362 MB - Last synced at: 9 months ago - Pushed at: almost 6 years ago - Stars: 47 - Forks: 48

bminixhofer/tokenkit

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

Language: Python - Size: 397 KB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 46 - Forks: 6

zouharvi/tokenization-scorer

Simple-to-use scoring function for arbitrarily tokenized texts.

Language: Python - Size: 42 KB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 46 - Forks: 5

anyks/alm

Smart Language Model

Language: C++ - Size: 1.97 MB - Last synced at: 3 months ago - Pushed at: about 3 years ago - Stars: 46 - Forks: 6

GoogleCloudPlatform/auto-data-tokenize

Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow

Language: Java - Size: 1.47 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 43 - Forks: 20

eliben/go-sentencepiece

Go implementation of the SentencePiece tokenizer

Language: Go - Size: 210 KB - Last synced at: 3 days ago - Pushed at: 22 days ago - Stars: 39 - Forks: 11

andreihar/taibun

Taiwanese Hokkien Transliterator and Tokeniser

Language: Python - Size: 4.57 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 39 - Forks: 2

rosette-api/python

Babel Street Analytics Client Library for Python

Language: Python - Size: 1.63 MB - Last synced at: 4 months ago - Pushed at: 10 months ago - Stars: 38 - Forks: 37

aatimofeev/spacy_russian_tokenizer

Custom Russian tokenizer for spaCy

Language: Python - Size: 30.3 KB - Last synced at: almost 3 years ago - Pushed at: over 6 years ago - Stars: 38 - Forks: 4

bastienbot/nlp-js-tools-french

POS Tagger, lemmatizer and stemmer for french language in javascript

Language: JavaScript - Size: 1.04 MB - Last synced at: 3 months ago - Pushed at: over 8 years ago - Stars: 37 - Forks: 8

luccalb/tiptap-annotation-magic

An extension for the Tiptap editor, enabling the annotation of text. Comes with support for overlapping annotations, useful for e.g. NLP tokenization.

Language: TypeScript - Size: 309 KB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 35 - Forks: 1

cmargiotta/e-regex

Fast regular expression library, with full matching support, even at compile time!

Language: C++ - Size: 2.49 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 34 - Forks: 2

esalesky/visrep

This repository contains an extension of fairseq for pixel / visual representations for machine translation.

Language: Python - Size: 97.3 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 34 - Forks: 5

mysto/node-fpe

FPE - Format Preserving Encryption with FF3 in Node-js

Language: JavaScript - Size: 68.4 KB - Last synced at: 4 months ago - Pushed at: 11 months ago - Stars: 33 - Forks: 3

JackHCC/Chinese-Tokenization

利用传统方法（N-gram，HMM等）、神经网络方法（CNN，LSTM等）和预训练方法（Bert等）的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】

Language: Python - Size: 45.4 MB - Last synced at: 8 months ago - Pushed at: over 3 years ago - Stars: 33 - Forks: 4

Aboudjem/ERC-3643

ERC-3643 - Raptor Version is a simple, educational look at the T-REX standard. Using Solidity and Web3, this project demystifies tokenized securities. Remember, Raptor is for learning, not production. Dive in for an accessible peek into blockchain finance!

Language: TypeScript - Size: 5.44 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 17