An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: tokenization

larsulbricht/awesome-digital-assets

Collection of high-quality resources on blockchain, tokenization, and DLT-based capital markets (EU-Focus)

Size: 251 KB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 6 - Forks: 0

AndresEspin1993/b2t-tokenizer

B2T - Tokenizer for the AI Systems.

Language: PowerShell - Size: 240 KB - Last synced at: about 7 hours ago - Pushed at: about 9 hours ago - Stars: 0 - Forks: 0

gnatykdm/b2t-tokenizer

B2T - Tokenizer for the AI Systems.

Language: PowerShell - Size: 1000 Bytes - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

XDuch/aztec-network

A step by step guide on How to Install Aztec Network Sequencer on Testnet

Size: 16.6 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

ryhkml/ytingest

Extract YouTube video, feed it to any LLM as knowledge

Language: C - Size: 108 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

MuzzammilShah/GPT-TransformerModel-2

An end-to-end PyTorch implementation of a GPT-2 style language model (124M) released by OpenAI and inspired by Karpathy’s NanoGPT. Covers core components like tokenization, multi-head self-attention, transformer blocks, positional embeddings and various other key ML concepts.

Language: Jupyter Notebook - Size: 3.06 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

sebastian2005-RP/GPU-Accelerated-Next-Word-Prediction-Using-LSTM-and-PyTorch

This repository implements a GPU-accelerated next-word prediction model using PyTorch and LSTM. It includes data preprocessing with NLTK, vocabulary creation, training on tokenized text, and generating text predictions, starting from a given input phrase.

Language: Jupyter Notebook - Size: 329 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

RAHEEM12344/content-recommendation-engine

A modern, responsive web application that delivers personalized content recommendations based on user preferences and behavior. This interactive recommendation system allows users to discover content tailored to their interests through category selection, tag filtering, and customizable content parameters.

Language: HTML - Size: 187 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Language: Python - Size: 729 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 155 - Forks: 12

izikeros/count_tokens

Count tokens in a text file.

Language: Python - Size: 137 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 6 - Forks: 0

WorksApplications/sudachi.rs

Sudachi in Rust 🦀 and new generation of SudachiPy

Language: Rust - Size: 15 MB - Last synced at: about 5 hours ago - Pushed at: 8 days ago - Stars: 355 - Forks: 38

explosion/spacy-streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

Language: Python - Size: 61.5 KB - Last synced at: about 4 hours ago - Pushed at: 10 months ago - Stars: 839 - Forks: 116

CompLin/nheengatu

Tools and resources for the computational processing of Nheengatu (Modern Tupi)

Language: Python - Size: 31.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 8 - Forks: 3

Yoz75/WordGenerator2

Token based word generator

Language: C# - Size: 22.5 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

CLewisMessina/wolfscribe

Turn books into clean, fine-tuning-ready datasets (TXT/CSV). EPUB, PDF, and token-aware. Local, GUI-based, no cloud required.

Language: Python - Size: 33.2 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

3Dpass/3DP

The Implementation of The Ledger of Things Node. Layer 1 decentralized blockchain platform for the tokenization of objects. Proof of Scan protocol. Useful smart-contracts and dApps.

Language: Rust - Size: 66.9 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 25 - Forks: 17

AgentOps-AI/tokencost

Easy token price estimates for 400+ LLMs. TokenOps.

Language: Python - Size: 1.83 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,651 - Forks: 74

ingwatson/SecureTradeToken

Secure Trade Token – A secure GUI-based application to encrypt and decrypt structured trade data using AES encryption and HMAC authentication. Built with PyQt5 and PyCryptodome.

Language: Python - Size: 24.4 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

12345far/metrics-calculation-precision-recall

Laboratory 7 - Retrieval Information

Size: 1.95 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

Basis-Theory/developers.basistheory.com

Basis Theory Developer Documentation

Language: JavaScript - Size: 24.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6 - Forks: 4

explosion/spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Language: Python - Size: 194 MB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 31,514 - Forks: 4,499

CMTA/CMTAT

Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.

Language: JavaScript - Size: 37.9 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 57 - Forks: 24

kawashiro-dev/Tokenizador-y-Lematizador

Tokenizador y Lematizador de palabras

Language: JavaScript - Size: 7.81 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

KatanaSword/gen-ai_cohort

Learn GenAI

Language: Python - Size: 13.6 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

RavenProject/Ravencoin

Ravencoin Core integration/staging tree

Language: C - Size: 461 MB - Last synced at: 3 days ago - Pushed at: 12 months ago - Stars: 1,096 - Forks: 694

ImadSaddik/Train_Your_Language_Model_Course

Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.

Language: Jupyter Notebook - Size: 27.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 83 - Forks: 54

trag1c/crossandra-rs

(WIP) A straightforward tokenization library for seamless text processing.

Language: Rust - Size: 681 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 8 - Forks: 1

securitybunker/databunker

Secure Vault for Customer PII/PHI/PCI/KYC Records

Language: Go - Size: 11.1 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1,295 - Forks: 82

PolyCash/polycash

The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.

Language: PHP - Size: 32.2 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 47 - Forks: 38

bminixhofer/tokenkit

A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.

Language: Python - Size: 463 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 18 - Forks: 2

soubankhandwani/ai-paper-evaluation-model

This project is an intelligent web application that compares student answers from scanned or typed PDFs against teacher-provided answer PDFs using NLP techniques and machine learning. It performs OCR, text extraction, preprocessing, and semantic similarity scoring to generate marks for each question.

Language: HTML - Size: 195 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

Javas128282/Prediksi_Kalimat_negatif-positif

memprediksi kalimat positif atau negatif dan mengatur bobot tf-idf dengan model MultinomialNB

Language: Python - Size: 2.93 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

jayanthpotluri5513/Deceptive-news-sequencing-using-LSTM

A project on Fake news detection using ML and DL approaches

Language: Jupyter Notebook - Size: 7.22 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

MUHAMMADAKMAL137/GPU-Accelerated-Next-Word-Prediction-Using-LSTM-and-PyTorch

This repository implements a GPU-accelerated next-word prediction model using PyTorch and LSTM. It includes data preprocessing with NLTK, vocabulary creation, training on tokenized text, and generating text predictions, starting from a given input phrase.

Language: Jupyter Notebook - Size: 332 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

amr080/Smart-Contracts

XFT smart contract dictionary

Language: Solidity - Size: 147 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 1

mahnoorsheikh16/NLP-Approach-to-AI-Text-Classification Fork of andrew-jxhn/STT811_StatsProject

Language: Jupyter Notebook - Size: 40.7 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

0-Adz/AuthService_ExpenseTracker

This is the one of the microservice for my Expense Tracker App.

Language: Java - Size: 61.5 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

Yashrajgithub/MathLang-Compiler

MathLang Compiler is an AI-powered web application that translates natural language mathematical expressions into executable JavaScript code. Built with React, TypeScript, and Vite, it enables seamless code generation and execution from plain English inputs, showcasing the power of language processing in computational logic.

Language: TypeScript - Size: 731 KB - Last synced at: about 8 hours ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

shivendrra/biosaic

Tokenizer for encoding/decoding dna sequences

Language: Python - Size: 71.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 0

jshuadvd/LongRoPE

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

Language: Python - Size: 562 KB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 136 - Forks: 14

NVIDIA/Cosmos-Tokenizer 📦

A suite of image and video neural tokenizers

Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 10 days ago - Pushed at: 3 months ago - Stars: 1,621 - Forks: 78

KathyReid/token-wars-dataviz

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

Language: Python - Size: 131 KB - Last synced at: 1 day ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

nlp-uoregon/trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Language: Python - Size: 1.06 MB - Last synced at: about 15 hours ago - Pushed at: 7 months ago - Stars: 753 - Forks: 103

Ssalzi/awesome-digital-assets

Collection of high-quality resources on blockchain, tokenization, and DLT-based capital markets (EU-Focus)

Size: 236 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

VuduVations/GSHI

Strategic Feasibility Analysis for GSHI – A decentralized, compliance-ready crypto banking platform integrating DeFi, CeFi, DAO governance, and algorithmic intelligence - 2021.

Size: 4.88 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

spindle-health/carduus

PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.

Language: Python - Size: 1.67 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 13 - Forks: 1

Maria-Antony/Seq2Seq-NMT

This is project for sequence to sequence NLP task. We developed a custom model to understand the process of task using PyTorch. We also fine tuned pre-trained transformer models to improve the performance of translation task.

Language: Jupyter Notebook - Size: 13 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

Kaiten-dev/quita_mini

Quita Mini is a text analysis tool designed to calculate various linguistic metrics from text data. It processes a collection of text files, computes statistics such as Type-Token Ratio (TTR), entropy, average token and type lengths, hapax legomena percentages, and more. The results are then saved in an Excel file for further analysis.

Language: Go - Size: 3.53 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

thjbdvlt/solipCysme

spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.

Language: Python - Size: 1.64 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

FerdiKurt/carbon-credits

These smart contracts provide a comprehensive system for carbon credit tokenization, issuance, trading, and retirement.

Language: Solidity - Size: 107 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

thearhamsharif/BSCS-UBIT-2k21

Includes coursework and lab materials for students enrolled in the Bachelor of Science in Computer Science degree at UBIT.

Language: C++ - Size: 12.8 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

tanerim/ts_tokenizer

TS Tokenizer is a hybrid (lexicon-based and rule-based) tokenizer designed specifically for tokenizing Turkish texts.

Language: Python - Size: 21.1 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 2 - Forks: 0

VKCOM/YouTokenToMe 📦

Unsupervised text tokenizer focused on computational efficiency

Language: C++ - Size: 192 KB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 966 - Forks: 103

Tharun007-TK/gpt2-custom

Custom Mini-GPT2 Model built using TensorFlow/Keras. It supports training on custom text data, saving weights to .h5, and generating new text from a prompt using a simple prediction loop.

Language: Python - Size: 9.77 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

divin3circle/NSEChainBridge

NSEChainBridge

Language: TypeScript - Size: 2.4 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

Quillhash/Real-World-Assets-RWA

This repository comprises the theoretical and technical aspects of tokenisation of real world assets.

Language: Solidity - Size: 1.39 MB - Last synced at: 15 days ago - Pushed at: about 1 year ago - Stars: 49 - Forks: 10

Devansh-Seth-DEV/LexiC

LexiC is a simple and modular C project that converts source code into a stream of tokens. It handles token counting, segmentation, and full tokenization, forming the first stage of a compiler or interpreter pipeline.

Language: C - Size: 347 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

adobe/NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing

Language: HTML - Size: 11.1 MB - Last synced at: 14 days ago - Pushed at: 6 months ago - Stars: 558 - Forks: 94

williamjsmail/Barracuda

Automatically analyze Cyber Threat Intelligence (CTI) reports using machine learning (ML) to identify MITRE ATT&CK techniques (T-Codes)

Language: HTML - Size: 6.71 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

struktapp/strukt-commons

Strukt Common Utilities

Language: PHP - Size: 159 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 1 - Forks: 0

Digital-World-App/RWA

Este repositório contém o código-fonte do Marketplace do Mundo Digital, uma plataforma desenvolvida para digitalização de registros imobiliários e facilitação de transações imobiliárias internacionais por meio de tecnologias inovadoras. 🔧 Contribuições são muito bem-vindas!

Language: JavaScript - Size: 114 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

katerinaharana/chatbot

WIP-- Building the Cornerstone of a Chatbot: Creating a Clustering-Based Intent Identification Engine

Language: Jupyter Notebook - Size: 178 KB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

NueLanguage/nue

The nue Programming Language

Language: C - Size: 127 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 3 - Forks: 0

TI-Toolkit/tokens

TI-BASIC token information XMLs for inclusion in other projects

Language: Python - Size: 408 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 8 - Forks: 0

georg-jung/FastBertTokenizer

Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.

Language: C# - Size: 19.2 MB - Last synced at: about 21 hours ago - Pushed at: 12 days ago - Stars: 49 - Forks: 10

gulcihanglmz/natural-language-processing

📚Utilizes libraries such as NLTK, spaCy, Hugging Face Transformers, and Scikit-learn. Ideal for beginners and developers looking to dive into NLP applications and machine learning models.

Language: Jupyter Notebook - Size: 1.06 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

bithead123/parcel

A simple static language for parsing text information and retrieving any data.

Language: C++ - Size: 1.19 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

simonrueba/bpe-visualization

Interactive tool for exploring Byte Pair Encoding tokenization step-by-step.

Language: TypeScript - Size: 0 Bytes - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

Deed3Labs/Protocol-Contracts

The Deed Protocol Smart Contracts 📑

Language: Solidity - Size: 2.59 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

andreihar/taibun.js

Taiwanese Hokkien Transliterator and Tokeniser

Language: JavaScript - Size: 3.41 MB - Last synced at: 15 days ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

22P31A0512/Sentimental-Analysis

Build a model to classify text as positive, negative, or neutral. Apply NLP techniques for preprocessing and machine learning for classification. Aim for accurate sentiment prediction on various text formats.

Language: Jupyter Notebook - Size: 280 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

dakofler/simple_tokenizers

Tokenizers is a collection of tokenization implementations focused on transparency and readability

Language: Python - Size: 21.5 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

dhyanid13/Helpify-LSTM-based-approach-for-classifying-mental-health-issues

Employing NLP techniques to classify Mental Health Issues into a particular categories

Language: Jupyter Notebook - Size: 5.8 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

bhuvan2018/news_article_classification

This HACKATHON project implements automated news article classification using machine learning and NLP techniques. Built with Flet for the UI, it processes & classifies text-based news content using methods like tokenization, lemmatization, vectorization and BERT-based embeddings.

Language: Jupyter Notebook - Size: 6.43 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 1

clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split words, sentences and graphemes.

Language: Go - Size: 995 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 55 - Forks: 4

Relostar-Devil/CIS-509-Analytics-Unstructured-Data-Yelp-Data-Analysis

In Florida and Pennsylvania, Yelp reviews paint a vivid picture of dining experiences across American, Chinese, and Italian cuisines. Using sentiment analysis and topic modeling, we uncover key themes that shape customer satisfaction. From flavor and service to ambiance and value, one factor stands above all—food quality.

Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

nlpcloud/nlpcloud-python

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...

Language: Python - Size: 61.5 KB - Last synced at: about 5 hours ago - Pushed at: 6 months ago - Stars: 82 - Forks: 8

TI-Toolkit/tivars_lib_py

A Python library for interacting with TI-(e)z80 (82/83/84 series) calculator files

Language: Python - Size: 3.61 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 19 - Forks: 1

ChocoWu/SeTok

Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM

Language: Python - Size: 2.1 MB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 54 - Forks: 0

anenbergb/BERT-from-scratch

Implementation of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Language: Python - Size: 475 KB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

verygoodsecurity/vgs-collect-ios

VGS Collect iOS SDK

Language: Swift - Size: 63.6 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 23 - Forks: 15

GoogleCloudPlatform/dlp-dataflow-deidentification

Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP

Language: Java - Size: 47.5 MB - Last synced at: 22 days ago - Pushed at: 9 months ago - Stars: 92 - Forks: 51

andreihar/taibun

Taiwanese Hokkien Transliterator and Tokeniser

Language: Python - Size: 4.57 MB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 34 - Forks: 2

Dexaran/TokenStandardConverter

Language: Solidity - Size: 95.7 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 1

eliben/go-sentencepiece

Go implementation of the SentencePiece tokenizer

Language: Go - Size: 200 KB - Last synced at: 7 days ago - Pushed at: 8 months ago - Stars: 28 - Forks: 2

vaibhavdangar09/NER-WITH-BERT

The goal of this project is to develop a Named Entity Recognition (NER) system that can identify and classify named entities (such as names of people, organizations, locations, dates, etc.) in a given text using the BERT model from Hugging Face's Transformers library.

Language: Jupyter Notebook - Size: 19.5 KB - Last synced at: 24 days ago - Pushed at: 25 days ago - Stars: 4 - Forks: 1

cashtokens/cashtokens

A proposal to enable two new primitives on Bitcoin Cash: fungible tokens and non-fungible tokens.

Size: 621 KB - Last synced at: 11 days ago - Pushed at: almost 2 years ago - Stars: 48 - Forks: 32

flolu/mongo-search

Fuzzy Text Search And Autocompletion With MongoDB And Node.js

Language: TypeScript - Size: 554 KB - Last synced at: 14 days ago - Pushed at: over 2 years ago - Stars: 29 - Forks: 9

alasdairforsythe/tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

Language: Go - Size: 734 KB - Last synced at: 12 days ago - Pushed at: 10 months ago - Stars: 576 - Forks: 21

natasha/razdel

Rule-based token, sentence segmentation for Russian language

Language: Python - Size: 37.2 MB - Last synced at: 36 minutes ago - Pushed at: almost 2 years ago - Stars: 266 - Forks: 32

alexandermorgan/BatchBPE Fork of karpathy/minbpe

Lightweight batched implementation of the Byte Pair Encoding (BPE) algorithm for LLM tokenization.

Language: Python - Size: 1.63 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 4 - Forks: 0

ThalesGroup/CipherTrust_Application_Protection

Public code samples and resources for the Thales CipherTrust Application Protection products of the CipherTrust Data Security Platform

Language: Java - Size: 37.8 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 26 - Forks: 16

eklem/words-n-numbers

Tokenizing strings of text. Regex extracting arrays of words and optionally numbers, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions.

Language: JavaScript - Size: 1.28 MB - Last synced at: 6 days ago - Pushed at: 8 months ago - Stars: 12 - Forks: 0

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Language: C++ - Size: 1.69 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 302 - Forks: 74

infinilabs/pizza-stemmers

🌍 A Rust snowball stemmers with 30+ languages stemming algorithms for INFINI Pizza.

Language: Rust - Size: 875 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

levysoft/chatgpt-token-cost-analysis

Python script and HTML page to analyze token costs from ChatGPT export chats. Extracts messages, calculates token usage, and determines monthly costs. The Python script saves results to a CSV file, while the HTML page provides an interactive, local analysis tool with support for multiple models and ensures data privacy.

Language: HTML - Size: 1.38 MB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 4 - Forks: 0

rosette-api/java

Babel Street Analytics Client Library for Java

Language: Java - Size: 64.8 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 11 - Forks: 35

DavZim/rtiktoken

BPE Tokenizer for OpenAI's models

Language: R - Size: 21.9 MB - Last synced at: 5 days ago - Pushed at: 27 days ago - Stars: 12 - Forks: 1

LoopscaleLabs/rwa-token

The RWA Token Program is a wrapper and extension program for Solana Token Extensions that creates a uniform approach to permissions tokens on SVM blockchains.

Language: TypeScript - Size: 51.2 MB - Last synced at: 5 days ago - Pushed at: 7 months ago - Stars: 18 - Forks: 9

sinaahmadi/KurdishTokenization

Tokenization resources for Kurdish (Sorani & Kurmanji dialects)

Language: Lex - Size: 5.81 MB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 8 - Forks: 0