GitHub topics: tokenization
larsulbricht/awesome-digital-assets
Collection of high-quality resources on blockchain, tokenization, and DLT-based capital markets (EU-Focus)
Size: 251 KB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 6 - Forks: 0

AndresEspin1993/b2t-tokenizer
B2T - Tokenizer for the AI Systems.
Language: PowerShell - Size: 240 KB - Last synced at: about 7 hours ago - Pushed at: about 9 hours ago - Stars: 0 - Forks: 0

gnatykdm/b2t-tokenizer
B2T - Tokenizer for the AI Systems.
Language: PowerShell - Size: 1000 Bytes - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

XDuch/aztec-network
A step by step guide on How to Install Aztec Network Sequencer on Testnet
Size: 16.6 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

ryhkml/ytingest
Extract YouTube video, feed it to any LLM as knowledge
Language: C - Size: 108 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

MuzzammilShah/GPT-TransformerModel-2
An end-to-end PyTorch implementation of a GPT-2 style language model (124M) released by OpenAI and inspired by Karpathy’s NanoGPT. Covers core components like tokenization, multi-head self-attention, transformer blocks, positional embeddings and various other key ML concepts.
Language: Jupyter Notebook - Size: 3.06 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

sebastian2005-RP/GPU-Accelerated-Next-Word-Prediction-Using-LSTM-and-PyTorch
This repository implements a GPU-accelerated next-word prediction model using PyTorch and LSTM. It includes data preprocessing with NLTK, vocabulary creation, training on tokenized text, and generating text predictions, starting from a given input phrase.
Language: Jupyter Notebook - Size: 329 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

RAHEEM12344/content-recommendation-engine
A modern, responsive web application that delivers personalized content recommendations based on user preferences and behavior. This interactive recommendation system allows users to discover content tailored to their interests through category selection, tag filtering, and customizable content parameters.
Language: HTML - Size: 187 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Language: Python - Size: 729 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 155 - Forks: 12

izikeros/count_tokens
Count tokens in a text file.
Language: Python - Size: 137 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 6 - Forks: 0

WorksApplications/sudachi.rs
Sudachi in Rust 🦀 and new generation of SudachiPy
Language: Rust - Size: 15 MB - Last synced at: about 5 hours ago - Pushed at: 8 days ago - Stars: 355 - Forks: 38

explosion/spacy-streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Language: Python - Size: 61.5 KB - Last synced at: about 4 hours ago - Pushed at: 10 months ago - Stars: 839 - Forks: 116

CompLin/nheengatu
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
Language: Python - Size: 31.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 8 - Forks: 3

Yoz75/WordGenerator2
Token based word generator
Language: C# - Size: 22.5 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

CLewisMessina/wolfscribe
Turn books into clean, fine-tuning-ready datasets (TXT/CSV). EPUB, PDF, and token-aware. Local, GUI-based, no cloud required.
Language: Python - Size: 33.2 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

3Dpass/3DP
The Implementation of The Ledger of Things Node. Layer 1 decentralized blockchain platform for the tokenization of objects. Proof of Scan protocol. Useful smart-contracts and dApps.
Language: Rust - Size: 66.9 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 25 - Forks: 17

AgentOps-AI/tokencost
Easy token price estimates for 400+ LLMs. TokenOps.
Language: Python - Size: 1.83 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,651 - Forks: 74

ingwatson/SecureTradeToken
Secure Trade Token – A secure GUI-based application to encrypt and decrypt structured trade data using AES encryption and HMAC authentication. Built with PyQt5 and PyCryptodome.
Language: Python - Size: 24.4 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

12345far/metrics-calculation-precision-recall
Laboratory 7 - Retrieval Information
Size: 1.95 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

Basis-Theory/developers.basistheory.com
Basis Theory Developer Documentation
Language: JavaScript - Size: 24.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6 - Forks: 4

explosion/spaCy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Language: Python - Size: 194 MB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 31,514 - Forks: 4,499

CMTA/CMTAT
Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.
Language: JavaScript - Size: 37.9 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 57 - Forks: 24

kawashiro-dev/Tokenizador-y-Lematizador
Tokenizador y Lematizador de palabras
Language: JavaScript - Size: 7.81 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

KatanaSword/gen-ai_cohort
Learn GenAI
Language: Python - Size: 13.6 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

RavenProject/Ravencoin
Ravencoin Core integration/staging tree
Language: C - Size: 461 MB - Last synced at: 3 days ago - Pushed at: 12 months ago - Stars: 1,096 - Forks: 694

ImadSaddik/Train_Your_Language_Model_Course
Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.
Language: Jupyter Notebook - Size: 27.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 83 - Forks: 54

trag1c/crossandra-rs
(WIP) A straightforward tokenization library for seamless text processing.
Language: Rust - Size: 681 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 8 - Forks: 1

securitybunker/databunker
Secure Vault for Customer PII/PHI/PCI/KYC Records
Language: Go - Size: 11.1 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1,295 - Forks: 82

PolyCash/polycash
The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.
Language: PHP - Size: 32.2 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 47 - Forks: 38

bminixhofer/tokenkit
A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.
Language: Python - Size: 463 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 18 - Forks: 2

soubankhandwani/ai-paper-evaluation-model
This project is an intelligent web application that compares student answers from scanned or typed PDFs against teacher-provided answer PDFs using NLP techniques and machine learning. It performs OCR, text extraction, preprocessing, and semantic similarity scoring to generate marks for each question.
Language: HTML - Size: 195 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

Javas128282/Prediksi_Kalimat_negatif-positif
memprediksi kalimat positif atau negatif dan mengatur bobot tf-idf dengan model MultinomialNB
Language: Python - Size: 2.93 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

jayanthpotluri5513/Deceptive-news-sequencing-using-LSTM
A project on Fake news detection using ML and DL approaches
Language: Jupyter Notebook - Size: 7.22 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

MUHAMMADAKMAL137/GPU-Accelerated-Next-Word-Prediction-Using-LSTM-and-PyTorch
This repository implements a GPU-accelerated next-word prediction model using PyTorch and LSTM. It includes data preprocessing with NLTK, vocabulary creation, training on tokenized text, and generating text predictions, starting from a given input phrase.
Language: Jupyter Notebook - Size: 332 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

amr080/Smart-Contracts
XFT smart contract dictionary
Language: Solidity - Size: 147 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 1

mahnoorsheikh16/NLP-Approach-to-AI-Text-Classification Fork of andrew-jxhn/STT811_StatsProject
Language: Jupyter Notebook - Size: 40.7 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

0-Adz/AuthService_ExpenseTracker
This is the one of the microservice for my Expense Tracker App.
Language: Java - Size: 61.5 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

Yashrajgithub/MathLang-Compiler
MathLang Compiler is an AI-powered web application that translates natural language mathematical expressions into executable JavaScript code. Built with React, TypeScript, and Vite, it enables seamless code generation and execution from plain English inputs, showcasing the power of language processing in computational logic.
Language: TypeScript - Size: 731 KB - Last synced at: about 8 hours ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

shivendrra/biosaic
Tokenizer for encoding/decoding dna sequences
Language: Python - Size: 71.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 0

jshuadvd/LongRoPE
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
Language: Python - Size: 562 KB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 136 - Forks: 14

NVIDIA/Cosmos-Tokenizer 📦
A suite of image and video neural tokenizers
Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 10 days ago - Pushed at: 3 months ago - Stars: 1,621 - Forks: 78

KathyReid/token-wars-dataviz
A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.
Language: Python - Size: 131 KB - Last synced at: 1 day ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

nlp-uoregon/trankit
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Language: Python - Size: 1.06 MB - Last synced at: about 15 hours ago - Pushed at: 7 months ago - Stars: 753 - Forks: 103

Ssalzi/awesome-digital-assets
Collection of high-quality resources on blockchain, tokenization, and DLT-based capital markets (EU-Focus)
Size: 236 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

VuduVations/GSHI
Strategic Feasibility Analysis for GSHI – A decentralized, compliance-ready crypto banking platform integrating DeFi, CeFi, DAO governance, and algorithmic intelligence - 2021.
Size: 4.88 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

spindle-health/carduus
PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.
Language: Python - Size: 1.67 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 13 - Forks: 1

Maria-Antony/Seq2Seq-NMT
This is project for sequence to sequence NLP task. We developed a custom model to understand the process of task using PyTorch. We also fine tuned pre-trained transformer models to improve the performance of translation task.
Language: Jupyter Notebook - Size: 13 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

Kaiten-dev/quita_mini
Quita Mini is a text analysis tool designed to calculate various linguistic metrics from text data. It processes a collection of text files, computes statistics such as Type-Token Ratio (TTR), entropy, average token and type lengths, hapax legomena percentages, and more. The results are then saved in an Excel file for further analysis.
Language: Go - Size: 3.53 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

thjbdvlt/solipCysme
spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.
Language: Python - Size: 1.64 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

FerdiKurt/carbon-credits
These smart contracts provide a comprehensive system for carbon credit tokenization, issuance, trading, and retirement.
Language: Solidity - Size: 107 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

thearhamsharif/BSCS-UBIT-2k21
Includes coursework and lab materials for students enrolled in the Bachelor of Science in Computer Science degree at UBIT.
Language: C++ - Size: 12.8 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

tanerim/ts_tokenizer
TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts.
Language: Python - Size: 21.1 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 2 - Forks: 0

VKCOM/YouTokenToMe 📦
Unsupervised text tokenizer focused on computational efficiency
Language: C++ - Size: 192 KB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 966 - Forks: 103

Tharun007-TK/gpt2-custom
Custom Mini-GPT2 Model built using TensorFlow/Keras. It supports training on custom text data, saving weights to .h5, and generating new text from a prompt using a simple prediction loop.
Language: Python - Size: 9.77 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

divin3circle/NSEChainBridge
NSEChainBridge
Language: TypeScript - Size: 2.4 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

Quillhash/Real-World-Assets-RWA
This repository comprises the theoretical and technical aspects of tokenisation of real world assets.
Language: Solidity - Size: 1.39 MB - Last synced at: 15 days ago - Pushed at: about 1 year ago - Stars: 49 - Forks: 10

Devansh-Seth-DEV/LexiC
LexiC is a simple and modular C project that converts source code into a stream of tokens. It handles token counting, segmentation, and full tokenization, forming the first stage of a compiler or interpreter pipeline.
Language: C - Size: 347 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

adobe/NLP-Cube
Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
Language: HTML - Size: 11.1 MB - Last synced at: 14 days ago - Pushed at: 6 months ago - Stars: 558 - Forks: 94

williamjsmail/Barracuda
Automatically analyze Cyber Threat Intelligence (CTI) reports using machine learning (ML) to identify MITRE ATT&CK techniques (T-Codes)
Language: HTML - Size: 6.71 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

struktapp/strukt-commons
Strukt Common Utilities
Language: PHP - Size: 159 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 1 - Forks: 0

Digital-World-App/RWA
Este repositório contém o código-fonte do Marketplace do Mundo Digital, uma plataforma desenvolvida para digitalização de registros imobiliários e facilitação de transações imobiliárias internacionais por meio de tecnologias inovadoras. 🔧 Contribuições são muito bem-vindas!
Language: JavaScript - Size: 114 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

katerinaharana/chatbot
WIP-- Building the Cornerstone of a Chatbot: Creating a Clustering-Based Intent Identification Engine
Language: Jupyter Notebook - Size: 178 KB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

NueLanguage/nue
The nue Programming Language
Language: C - Size: 127 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 3 - Forks: 0

TI-Toolkit/tokens
TI-BASIC token information XMLs for inclusion in other projects
Language: Python - Size: 408 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 8 - Forks: 0

georg-jung/FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
Language: C# - Size: 19.2 MB - Last synced at: about 21 hours ago - Pushed at: 12 days ago - Stars: 49 - Forks: 10

gulcihanglmz/natural-language-processing
📚Utilizes libraries such as NLTK, spaCy, Hugging Face Transformers, and Scikit-learn. Ideal for beginners and developers looking to dive into NLP applications and machine learning models.
Language: Jupyter Notebook - Size: 1.06 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

bithead123/parcel
A simple static language for parsing text information and retrieving any data.
Language: C++ - Size: 1.19 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

simonrueba/bpe-visualization
Interactive tool for exploring Byte Pair Encoding tokenization step-by-step.
Language: TypeScript - Size: 0 Bytes - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

Deed3Labs/Protocol-Contracts
The Deed Protocol Smart Contracts 📑
Language: Solidity - Size: 2.59 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

andreihar/taibun.js
Taiwanese Hokkien Transliterator and Tokeniser
Language: JavaScript - Size: 3.41 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

22P31A0512/Sentimental-Analysis
Build a model to classify text as positive, negative, or neutral. Apply NLP techniques for preprocessing and machine learning for classification. Aim for accurate sentiment prediction on various text formats.
Language: Jupyter Notebook - Size: 280 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

dakofler/simple_tokenizers
Tokenizers is a collection of tokenization implementations focused on transparency and readability
Language: Python - Size: 21.5 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

dhyanid13/Helpify-LSTM-based-approach-for-classifying-mental-health-issues
Employing NLP techniques to classify Mental Health Issues into a particular categories
Language: Jupyter Notebook - Size: 5.8 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

bhuvan2018/news_article_classification
This HACKATHON project implements automated news article classification using machine learning and NLP techniques. Built with Flet for the UI, it processes & classifies text-based news content using methods like tokenization, lemmatization, vectorization and BERT-based embeddings.
Language: Jupyter Notebook - Size: 6.43 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 1

clipperhouse/uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split words, sentences and graphemes.
Language: Go - Size: 995 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 55 - Forks: 4

Relostar-Devil/CIS-509-Analytics-Unstructured-Data-Yelp-Data-Analysis
In Florida and Pennsylvania, Yelp reviews paint a vivid picture of dining experiences across American, Chinese, and Italian cuisines. Using sentiment analysis and topic modeling, we uncover key themes that shape customer satisfaction. From flavor and service to ambiance and value, one factor stands above all—food quality.
Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

nlpcloud/nlpcloud-python
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...
Language: Python - Size: 61.5 KB - Last synced at: about 5 hours ago - Pushed at: 6 months ago - Stars: 82 - Forks: 8

TI-Toolkit/tivars_lib_py
A Python library for interacting with TI-(e)z80 (82/83/84 series) calculator files
Language: Python - Size: 3.61 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 19 - Forks: 1

ChocoWu/SeTok
Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
Language: Python - Size: 2.1 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 54 - Forks: 0

anenbergb/BERT-from-scratch
Implementation of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Language: Python - Size: 475 KB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

verygoodsecurity/vgs-collect-ios
VGS Collect iOS SDK
Language: Swift - Size: 63.6 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 23 - Forks: 15

GoogleCloudPlatform/dlp-dataflow-deidentification
Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
Language: Java - Size: 47.5 MB - Last synced at: 22 days ago - Pushed at: 9 months ago - Stars: 92 - Forks: 51

andreihar/taibun
Taiwanese Hokkien Transliterator and Tokeniser
Language: Python - Size: 4.57 MB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 34 - Forks: 2

Dexaran/TokenStandardConverter
Language: Solidity - Size: 95.7 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 1

eliben/go-sentencepiece
Go implementation of the SentencePiece tokenizer
Language: Go - Size: 200 KB - Last synced at: 7 days ago - Pushed at: 8 months ago - Stars: 28 - Forks: 2

vaibhavdangar09/NER-WITH-BERT
The goal of this project is to develop a Named Entity Recognition (NER) system that can identify and classify named entities (such as names of people, organizations, locations, dates, etc.) in a given text using the BERT model from Hugging Face's Transformers library.
Language: Jupyter Notebook - Size: 19.5 KB - Last synced at: 24 days ago - Pushed at: 25 days ago - Stars: 4 - Forks: 1

cashtokens/cashtokens
A proposal to enable two new primitives on Bitcoin Cash: fungible tokens and non-fungible tokens.
Size: 621 KB - Last synced at: 11 days ago - Pushed at: almost 2 years ago - Stars: 48 - Forks: 32

flolu/mongo-search
Fuzzy Text Search And Autocompletion With MongoDB And Node.js
Language: TypeScript - Size: 554 KB - Last synced at: 14 days ago - Pushed at: over 2 years ago - Stars: 29 - Forks: 9

alasdairforsythe/tokenmonster
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
Language: Go - Size: 734 KB - Last synced at: 12 days ago - Pushed at: 10 months ago - Stars: 576 - Forks: 21

natasha/razdel
Rule-based token, sentence segmentation for Russian language
Language: Python - Size: 37.2 MB - Last synced at: 36 minutes ago - Pushed at: almost 2 years ago - Stars: 266 - Forks: 32

alexandermorgan/BatchBPE Fork of karpathy/minbpe
Lightweight batched implementation of the Byte Pair Encoding (BPE) algorithm for LLM tokenization.
Language: Python - Size: 1.63 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 4 - Forks: 0

ThalesGroup/CipherTrust_Application_Protection
Public code samples and resources for the Thales CipherTrust Application Protection products of the CipherTrust Data Security Platform
Language: Java - Size: 37.8 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 26 - Forks: 16

eklem/words-n-numbers
Tokenizing strings of text. Regex extracting arrays of words and optionally numbers, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions.
Language: JavaScript - Size: 1.28 MB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 12 - Forks: 0

OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Language: C++ - Size: 1.69 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 302 - Forks: 74

infinilabs/pizza-stemmers
🌍 A Rust snowball stemmers with 30+ languages stemming algorithms for INFINI Pizza.
Language: Rust - Size: 875 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

levysoft/chatgpt-token-cost-analysis
Python script and HTML page to analyze token costs from ChatGPT export chats. Extracts messages, calculates token usage, and determines monthly costs. The Python script saves results to a CSV file, while the HTML page provides an interactive, local analysis tool with support for multiple models and ensures data privacy.
Language: HTML - Size: 1.38 MB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 4 - Forks: 0

rosette-api/java
Babel Street Analytics Client Library for Java
Language: Java - Size: 64.8 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 11 - Forks: 35

DavZim/rtiktoken
BPE Tokenizer for OpenAI's models
Language: R - Size: 21.9 MB - Last synced at: 5 days ago - Pushed at: 27 days ago - Stars: 12 - Forks: 1

LoopscaleLabs/rwa-token
The RWA Token Program is a wrapper and extension program for Solana Token Extensions that creates a uniform approach to permissions tokens on SVM blockchains.
Language: TypeScript - Size: 51.2 MB - Last synced at: 5 days ago - Pushed at: 7 months ago - Stars: 18 - Forks: 9

sinaahmadi/KurdishTokenization
Tokenization resources for Kurdish (Sorani & Kurmanji dialects)
Language: Lex - Size: 5.81 MB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 8 - Forks: 0
