GitHub topics: tokenization
AndyFerns/Automated-Reasoning-Project
A project aiming to implement Automated Reasoning in First Order Logic using NLP
Language: Python - Size: 119 KB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 1 - Forks: 0

RAHEEM12344/content-recommendation-engine
A modern, responsive web application that delivers personalized content recommendations based on user preferences and behavior. This interactive recommendation system allows users to discover content tailored to their interests through category selection, tag filtering, and customizable content parameters.
Language: HTML - Size: 187 KB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 0 - Forks: 0

matiasrodlo/afiste
Blockchain based VC marketplace. Jump Chile semifinalist. (2019)
Language: PHP - Size: 470 MB - Last synced at: about 6 hours ago - Pushed at: about 7 hours ago - Stars: 1 - Forks: 0

sebastian2005-RP/GPU-Accelerated-Next-Word-Prediction-Using-LSTM-and-PyTorch
This repository implements a GPU-accelerated next-word prediction model using PyTorch and LSTM. It includes data preprocessing with NLTK, vocabulary creation, training on tokenized text, and generating text predictions, starting from a given input phrase.
Language: Jupyter Notebook - Size: 329 KB - Last synced at: about 7 hours ago - Pushed at: about 7 hours ago - Stars: 0 - Forks: 0

AgentOps-AI/tokencost
Easy token price estimates for 400+ LLMs. TokenOps.
Language: Python - Size: 1.89 MB - Last synced at: about 17 hours ago - Pushed at: about 18 hours ago - Stars: 1,718 - Forks: 82

mikiyadd/my-c-array
Dynamic array implementation in C with a modular, folder-based structure.
Language: C - Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Basis-Theory/developers.basistheory.com
Basis Theory Developer Documentation
Language: JavaScript - Size: 24.9 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 6 - Forks: 4

chuckyLeeVIII/Bitcoin-BhE-NaS Fork of bitcoin/bips
Bitcoin Improvement Proposals
Language: Wikitext - Size: 15.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 1

shaheennabi/Natural-Language-Processing-Practices-and-Mini-Projects
π NLP Experiments π A hands-on collection of NLP experiments π¬, featuring models like RNN, LSTM, and Attention Mechanism. π Explore applications like text classification, sentiment analysis, and language generation π. Continuously updated with new algorithms and research implementations! π₯
Language: Jupyter Notebook - Size: 24.4 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

12345far/metrics-calculation-precision-recall
Laboratory 7 - Retrieval Information
Size: 1.95 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

IthavinduU/jwt-auth-service
JWT authentication microservice built with Ruby and Sinatra.
Language: Ruby - Size: 3.91 KB - Last synced at: 2 days ago - Pushed at: 29 days ago - Stars: 1 - Forks: 0

CompLin/nheengatu
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
Language: Python - Size: 36.5 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8 - Forks: 4

3Dpass/3DP
The Implementation of The Ledger of Things Node. Layer 1 decentralized blockchain platform for the tokenization of objects. Proof of Scan protocol. Useful smart-contracts and dApps.
Language: Rust - Size: 54.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 25 - Forks: 19

AndresEspin1993/b2t-tokenizer
B2T - Tokenizer for the AI Systems.
Language: PowerShell - Size: 240 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

JuGecko/Tokenization-Visualizer
A web application illustrating tokenization methods when selecting certain LLMs.
Language: C# - Size: 7.78 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

securitybunker/databunker
Secure Vault for Customer PII/PHI/PCI/KYC Records
Language: Go - Size: 11.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,305 - Forks: 83

gnatykdm/b2t-tokenizer
B2T Tokenizer β Brain-Inspired Multimodal Data Processor
Language: PowerShell - Size: 240 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

XDuch/aztec-network
A step by step guide on How to Install Aztec Network Sequencer on Testnet
Size: 16.6 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 1

chuckyLeeVIII/ai-hedge-fund Fork of virattt/ai-hedge-fund
An AI Hedge managed by knox wallet
Language: Python - Size: 1.59 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

CMTA/CMTAT
Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.
Language: JavaScript - Size: 63.7 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 57 - Forks: 25

ImadSaddik/Train_Your_Language_Model_Course
Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.
Language: Jupyter Notebook - Size: 58.9 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 117 - Forks: 72

av/klmbr
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
Language: TeX - Size: 2.24 MB - Last synced at: 4 days ago - Pushed at: 9 months ago - Stars: 76 - Forks: 3

CLewisMessina/wolfstitch
Turn books into clean, fine-tuning-ready datasets (TXT/CSV). EPUB, PDF, and token-aware. Local, GUI-based, no cloud required.
Language: Python - Size: 309 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

anjalirj27/Llama4
Llama4 β Code from Scratch This project is inspired by [vukrosicβs courses repository](https://github.com/vukrosic/courses). Here, Iβve implemented the tokenizer logic from scratch using Python and Google Colab to better understand how LLMs handle text at the token level.
Language: Jupyter Notebook - Size: 11.7 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

mohansree14/Token-Classification
A Streamlit app for biomedical named entity recognition (NER) using BioBERT. Enter biomedical text and get instant, colorful token-level predictions for labels `O`, `B-AC`, `B-LF`, and `I-LF`. Includes graphical visualization and an interaction log.
Language: Jupyter Notebook - Size: 731 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

FerdiKurt/carbon-credits
These smart contracts provide a system for carbon credit tokenization, issuance, trading, and retirement.
Language: Solidity - Size: 111 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 1

delBull/saaspandoras Fork of nextify-limited/saasfly
Acquire your right to participate in exclusive projects with Pandoras.
Language: TypeScript - Size: 22.5 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

NVIDIA/Cosmos-Tokenizer π¦
A suite of image and video neural tokenizers
Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 1,637 - Forks: 78

verygoodsecurity/vgs-collect-ios
VGS Collect iOS SDK
Language: Swift - Size: 63.5 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 23 - Forks: 15

bermudaphp/tokenizer
PHP tokenizer for finding class, interface, trait, and enum declarations.
Language: PHP - Size: 68.4 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

shaitarAn/subword-evenness-crosslingual-transfer
Analysis of subword evenness as a predictor of cross-lingual transfer success in multilingual language models (mBERT, XLM-R, mT5)
Size: 9.77 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

DanTheAI/LLM-Middleware-Pipeline
A modular, configurable LLM middleware pipeline that transforms raw prompts into enterprise-ready microservices.
Language: Python - Size: 32.2 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

ITSLab-UAegean/ais-manipulation
This is a repo related to the vessel AIS data, including filtering tokenization and trip extraction.
Language: Python - Size: 8.45 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

possible-worlds-research/wikinlp
A package to download and preprocess a Wikipedia dump, in any language.
Language: Python - Size: 118 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 9 - Forks: 1

ChaitanyaK77/Building-a-Small-Language-Model-SLM-
This Repository provides a Jupyter Notebook for building a small language model from scratch using 'TinyStories' dataset. Covers data preprocessing, BPE tokenization, binary storage, GPU memory management, and training a Transformer in PyTorch. Generate sample stories to test your model. Ideal for learning NLP and PyTorch.
Size: 0 Bytes - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

jshuadvd/LongRoPE
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
Language: Python - Size: 562 KB - Last synced at: 1 day ago - Pushed at: 11 months ago - Stars: 137 - Forks: 14

johannschopplich/tokenx
π Fast and lightweight token estimation for any LLM without requiring a full tokenizer
Language: TypeScript - Size: 390 KB - Last synced at: about 3 hours ago - Pushed at: 19 days ago - Stars: 27 - Forks: 1

h3ro-dev/Royal-RWA-Website
Royal RWA - Revolutionary DeFi platform bridging traditional assets with blockchain through a three-token ecosystem
Language: TypeScript - Size: 354 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

mysto/python-fpe
FPE - Format Preserving Encryption with FF3 in Python
Language: Python - Size: 144 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 102 - Forks: 18

jwalsh/boston-python-llm-tokenizer
Learn tokenization in the context of Large Language Models (LLMs)
Language: Jupyter Notebook - Size: 13.8 MB - Last synced at: 2 days ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Node0/llm-tools
My expanding collection of scripts and tools designed to aid in working with large language models, understanding their performance characteristics and context limitations.
Language: HTML - Size: 1.07 MB - Last synced at: 5 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

braingpt-lovelab/backwards
Source code for <Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies>
Language: Jupyter Notebook - Size: 51.1 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 3 - Forks: 0

Mattbusel/tokenviz
TokenViz β A CLI tool to visualize token usage in OpenAI prompts, helping developers optimize and understand prompt structure for better model performance.
Language: Python - Size: 0 Bytes - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 1 - Forks: 0

Networks-Learning/token-pricing
Repository for the paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives", Arxiv 2025
Language: Jupyter Notebook - Size: 10.1 MB - Last synced at: 2 days ago - Pushed at: 23 days ago - Stars: 4 - Forks: 0

ppomes/TokenShield
PCI Compliance Gateway POC
Language: Go - Size: 48.8 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

zjukg/MyGO
[Paper][AAAI 2025] (MyGO)Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation
Language: Python - Size: 91 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 249 - Forks: 7

cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Language: Python - Size: 778 KB - Last synced at: 6 days ago - Pushed at: 19 days ago - Stars: 670 - Forks: 93

VKCOM/YouTokenToMe π¦
Unsupervised text tokenizer focused on computational efficiency
Language: C++ - Size: 192 KB - Last synced at: 17 days ago - Pushed at: about 1 year ago - Stars: 968 - Forks: 105

ThalesGroup/CipherTrust_Application_Protection
Public code samples and resources for the Thales CipherTrust Application Protection products of the CipherTrust Data Security Platform
Language: Java - Size: 37.4 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 27 - Forks: 17

thjbdvlt/solipCysme
spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.
Language: Python - Size: 974 KB - Last synced at: 16 days ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 1

explosion/spaCy
π« Industrial-strength Natural Language Processing (NLP) in Python
Language: Python - Size: 194 MB - Last synced at: 18 days ago - Pushed at: 24 days ago - Stars: 31,699 - Forks: 4,508

WorksApplications/sudachi.rs
Sudachi in Rust π¦ and new generation of SudachiPy
Language: Rust - Size: 15 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 359 - Forks: 39

eliben/go-sentencepiece
Go implementation of the SentencePiece tokenizer
Language: Go - Size: 200 KB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 30 - Forks: 4

NueLanguage/nue
The Nue Programming Language
Language: C - Size: 134 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 6 - Forks: 0

daac-tools/vaporetto
π₯ Vaporetto: Very accelerated pointwise prediction based tokenizer
Language: Rust - Size: 3.99 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 238 - Forks: 10

CO2NEX/co2nex-architecture
Mermaid.js system diagram for the CO2NEX carbon credit climate platform infrastructure
Language: HTML - Size: 11.7 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

Jyonn/UnifiedTokenizer
A machine learning toolkit for tokenization and indexing
Language: Python - Size: 521 KB - Last synced at: 8 days ago - Pushed at: 20 days ago - Stars: 4 - Forks: 1

icelaterdc/Turk-NLP
TΓΌrkΓ§e iΓ§in kapsamlΔ± aΓ§Δ±k kaynak NLP (DoΔal Dil Δ°Εleme) kΓΌtΓΌphanesi.
Language: Python - Size: 20.5 KB - Last synced at: 12 days ago - Pushed at: 21 days ago - Stars: 2 - Forks: 0

GhostFireDigital/TokenUp.ai
TokenUp.ai is a modular AI-native token infrastructure protocol designed for next-gen Web3 builders. Includes minting, tokenomics, governance, and analytics modules. Built by GhostFire Digital.
Language: HTML - Size: 1.51 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

fahadabid545/POS-Tagging
Performed Part-of-Speech (POS) tagging using NLTK to label words with their grammatical roles in text data. Useful for NLP preprocessing and syntactic analysis.
Language: Jupyter Notebook - Size: 5.86 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

alasdairforsythe/tokenmonster
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
Language: Go - Size: 734 KB - Last synced at: 22 days ago - Pushed at: 12 months ago - Stars: 581 - Forks: 21

kensho-technologies/pathpiece
PathPiece tokenizer
Language: Rust - Size: 6.39 MB - Last synced at: 11 days ago - Pushed at: 7 months ago - Stars: 12 - Forks: 1

fattmerchantorg/Fattmerchant-iOS-SDK
Fattmerchant iOS SDK
Language: Swift - Size: 155 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 3 - Forks: 2

adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Language: Python - Size: 729 MB - Last synced at: 17 days ago - Pushed at: about 1 month ago - Stars: 160 - Forks: 14

RavenProject/Ravencoin
Ravencoin Core integration/staging tree
Language: C - Size: 461 MB - Last synced at: 23 days ago - Pushed at: about 1 year ago - Stars: 1,096 - Forks: 697

shivendrra/shredword
Fast & efficient BPE tokenizer written in C & python for LLM tranining
Language: C++ - Size: 18.2 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

ayushedith/ethermint
Minimal ERC-20 token built with Solidity & Hardhat
Language: JavaScript - Size: 84 KB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

daac-tools/vibrato
π€ vibrato: Viterbi-based accelerated tokenizer
Language: Rust - Size: 1.08 MB - Last synced at: 24 days ago - Pushed at: about 1 month ago - Stars: 360 - Forks: 15

mridulsaklani/My_Tokenizer
It is a small model of tokenizer also used by every AI GPT's model to perform tha task how to convert alphabets into a specific assigned token and encoding or decoding
Language: Python - Size: 3.99 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

FoundationVision/OmniTokenizer
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
Language: Python - Size: 68.9 MB - Last synced at: 26 days ago - Pushed at: 12 months ago - Stars: 295 - Forks: 6

larsulbricht/awesome-digital-assets
Collection of high-quality resources on blockchain, tokenization, and DLT-based capital markets (EU-Focus)
Size: 307 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 6 - Forks: 0

PolyCash/polycash
The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.
Language: PHP - Size: 32.3 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 47 - Forks: 38

vipbondre/tokenization_xrpl
The Get Your Pass website showcases a secure and private ticketing system powered by the XRP Ledger (XRPL). This system ensures privacy, security, and efficiency throughout the ticket purchase process, abstracting complexities and protecting sensitive user information
Language: JavaScript - Size: 1.26 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

KanishkNavale/Text-Mining-with-TF-IDF-and-Cosine-Similarity
A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.
Language: Jupyter Notebook - Size: 7.34 MB - Last synced at: 9 days ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 1

bminixhofer/zett
Code for Zero-Shot Tokenizer Transfer
Language: Python - Size: 1.04 MB - Last synced at: 20 days ago - Pushed at: 5 months ago - Stars: 128 - Forks: 11

Devansh-Seth-DEV/LexiC
LexiC is a simple and modular C project that converts source code into a stream of tokens. It handles token counting, segmentation, and full tokenization, forming the first stage of a compiler or interpreter pipeline.
Language: C - Size: 713 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

LoopscaleLabs/rwa-token
The RWA Token Program is a wrapper and extension program for Solana Token Extensions that creates a uniform approach to permissions tokens on SVM blockchains.
Language: TypeScript - Size: 51.2 MB - Last synced at: 13 days ago - Pushed at: 9 months ago - Stars: 19 - Forks: 9

spindle-health/carduus
PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.
Language: Python - Size: 1.32 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 15 - Forks: 1

gautierdag/bpeasy
Fast bare-bones BPE for modern tokenizer training
Language: Python - Size: 1.41 MB - Last synced at: 29 days ago - Pushed at: 3 months ago - Stars: 156 - Forks: 5

Basis-Theory/terraform-provider-basistheory
Terraform provider for Basis-Theory
Language: Go - Size: 148 KB - Last synced at: 21 days ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 0

CO2NEX/carbon-tokens
This repository contains the technical specification, tokenomics, and smart contract blueprints for C2NX tokens β the native digital asset of the CO2NEX platform used for governance, verification bounties, and transaction fees in the carbon offset market.
Size: 4.88 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

lunasec-io/lunasec
LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/
Language: TypeScript - Size: 293 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 1,448 - Forks: 169

AmoDinho/datacamp-python-data-science-track
All the slides, accompanying code and exercises all stored in this repo. π
Language: Python - Size: 74.1 MB - Last synced at: 29 days ago - Pushed at: almost 2 years ago - Stars: 837 - Forks: 522

Deed3Labs/Protocol-Contracts
The Deed Protocol Smart Contracts π
Language: Solidity - Size: 3.37 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

ijazul-haq/nlpashto
Pashto Natural Language Processing Toolkit
Size: 62.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 12 - Forks: 0

izikeros/count_tokens
Count tokens in a text file.
Language: Python - Size: 137 KB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

natasha/razdel
Rule-based token, sentence segmentation for Russian language
Language: Python - Size: 37.2 MB - Last synced at: 28 days ago - Pushed at: almost 2 years ago - Stars: 267 - Forks: 32

Worklytics/psoxy
serverless βοΈ π , pseudonymizing proxy between Worklytics and your workplace πΌ SaaS data sources' APIs. Data Loss Prevention (DLP) π‘οΈπ and compliance layer deployable to AWS Lambda or GCP Cloud Functions.
Language: Java - Size: 34.5 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 13 - Forks: 6

thearhamsharif/BSCS-UBIT-2k21
Includes coursework and lab materials for students enrolled in the Bachelor of Science in Computer Science degree at UBIT.
Language: C++ - Size: 13.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

cosmaadrian/strawberry-problem
Official repository for "The Strawberry Problem π: Emergence of Character-level Understanding in Tokenized Language Models"
Language: Python - Size: 56.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

ixopay/tokenex-ios
TokenEx iOS SDK
Language: Swift - Size: 74.2 KB - Last synced at: 28 days ago - Pushed at: 12 months ago - Stars: 1 - Forks: 1

zouharvi/tokenization-scorer
Simple-to-use scoring function for arbitrarily tokenized texts.
Language: Python - Size: 42 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 40 - Forks: 5

sourav200199/Whats-Insight
Get chat insights - all in one go!
Language: Python - Size: 2.95 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

abu14/nlp-assignment-abenezer_tesfaye
Formal submission for the NLP assignment
Size: 12.7 KB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

macmade/ClangKit
ClangKit provides an Objective-C frontend to LibClang. Source tokenization, diagnostics and fix-its are actually implemented.
Language: C - Size: 15.2 MB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 365 - Forks: 45

jkrukowski/swift-sentencepiece
Use SentencePiece in Swift for tokenization and detokenization.
Language: Swift - Size: 2.43 MB - Last synced at: 28 days ago - Pushed at: 4 months ago - Stars: 9 - Forks: 2

TI-Toolkit/tivars_lib_py
A Python library for interacting with TI-(e)z80 (82/83/84 series) calculator files
Language: Python - Size: 3.79 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 19 - Forks: 1

prabhashj07/nepalikit
NepaliKit is a Python library for natural language processing (NLP) tasks in Nepali. It features tokenization (rule-based and SentencePiece), text preprocessing, stopword management, and sentence segmentation. Ideal for developers and researchers working with Nepali text data.
Language: Python - Size: 364 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 6 - Forks: 0

winkjs/wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Language: JavaScript - Size: 2.05 MB - Last synced at: 12 days ago - Pushed at: over 2 years ago - Stars: 62 - Forks: 12

shivendrra/biosaic
Tokenizer for encoding/decoding dna sequences
Language: Python - Size: 71.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0
