An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: tokenization

RAHEEM12344/content-recommendation-engine

A modern, responsive web application that delivers personalized content recommendations based on user preferences and behavior. This interactive recommendation system allows users to discover content tailored to their interests through category selection, tag filtering, and customizable content parameters.

Language: HTML - Size: 187 KB - Last synced at: about 12 hours ago - Pushed at: about 13 hours ago - Stars: 0 - Forks: 0

Y3LLOWVESTS/rustyonions

RustyOnions is an experimental Rust-based P2P platform evolving into a decentralized Web3 network. It combines two data planes โ€” a public overlay for chunk storage and a Tor-powered private layer for secure messaging โ€” with bandwidth metering to promote responsible relay participation.

Language: Rust - Size: 8.18 MB - Last synced at: about 21 hours ago - Pushed at: about 21 hours ago - Stars: 2 - Forks: 0

12345far/metrics-calculation-precision-recall

Laboratory 7 - Retrieval Information

Size: 1.95 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

Saffronduck5667/precision-r-comparison

Laboratory 8 - Retrieval Information

Size: 1.95 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

JayalekshmiSharma/perl-yji

๐Ÿ”ง Simplify your Perl code with YJI, a lightweight tool for generating efficient and reusable code snippets efficiently.

Size: 1.29 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

giovannidia/tokenlens

๐Ÿ” Enhance AI applications with typed model metadata and context utilities for efficient decision-making and budget management.

Language: TypeScript - Size: 2.1 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

mikiyadd/my-c-array

Dynamic array implementation in C with a modular, folder-based structure.

Language: C - Size: 12.7 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

johannschopplich/toon

๐ŸŽ’ Token-Oriented Object Notation โ€“ JSON for LLMs at half the token cost

Language: TypeScript - Size: 543 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 4,226 - Forks: 87

fbkaragoz/durak

Durak is an open-source modular Turkish NLP preprocessing toolkit

Language: Python - Size: 1.34 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Aishwaraya-Dharmadhikari/NLP_Programs

All Natural Language Processing Programs

Size: 6.84 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

hacker193/cmtat-icma-tokenized-bonds

๐ŸŒ Showcase tokenized fixed income solutions with CMTAT and ICMA, featuring advanced trading and analytics for efficient market operations.

Language: TypeScript - Size: 1.58 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Vishwaksena94/tokenloom

๐Ÿ” Parse streamed text into structured events using TokenLoom, a TypeScript library designed for flexible handling of real-time data and custom tags.

Language: TypeScript - Size: 1.56 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

spiko-tech/contracts

Contracts for Spiko's tokenized securities.

Language: JavaScript - Size: 1.63 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 9 - Forks: 3

888abd8888/privacy-vault-

๐Ÿ›ก๏ธ Empower individuals and organizations to protect data privacy, ensure accountability, and build trust through transparent, open-source solutions.

Language: HTML - Size: 1.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

sunny262565/perl-yji

Size: 1.95 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

5zandorcvh3U/NFT

Example implementations of tokens to represent unique assets, such as collectibles or deeds, using the NEP-171 spec (similar to ERC-721)

Language: Rust - Size: 16.8 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0

Eduleiteyg/youtube_vid_analyzer

๐Ÿ“บ Analyze YouTube videos effortlessly. Extract key insights and engage with content through a simple Flask web interface.

Language: Python - Size: 1.3 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

Marcelleedit7272/genai-tokenizer

๐Ÿง  Explore tokenization with GenAi-Tokenizer, a user-friendly tool for decoding text, learning vocabulary, and visualizing token types effortlessly.

Language: TypeScript - Size: 4.66 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Slush1004/Pytorch-RNN-create-Q-A-Syste-

๐Ÿ™ PyTorch RNN-based Q&A system predicts answers from questions using a custom QA dataset. It tokenizes text, builds vocab, and uses embedding, RNN, and linear layers.

Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

thearhamsharif/BSCS-UBIT-2k21

Includes coursework and lab materials for students enrolled in the Bachelor of Science in Computer Science degree at UBIT.

Language: Jupyter Notebook - Size: 37.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

shouldfeelright/rabbithole

โ€˜And Iโ€™m not sure what I can do about it,โ€™ she thought, โ€˜Iโ€™m afraid Iโ€™ve made a mistake, as I canโ€™t get back to it. Itโ€™s like Iโ€™m going to go on and on, but I canโ€™t. I donโ€™t know what I am.โ€™

Language: Python - Size: 85 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

AndresEspin1993/b2t-tokenizer

B2T - Tokenizer for the AI Systems.

Language: PowerShell - Size: 240 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 1

Worklytics/psoxy

serverless โ˜๏ธ ๐Ÿš€ , pseudonymizing proxy between Worklytics and your workplace ๐Ÿ’ผ SaaS data sources' APIs. Data Loss Prevention (DLP) ๐Ÿ›ก๏ธ๐Ÿ”’ and compliance layer deployable to AWS Lambda or GCP Cloud Functions.

Language: Java - Size: 36.6 MB - Last synced at: about 2 hours ago - Pushed at: about 4 hours ago - Stars: 15 - Forks: 6

sytelus/nanuGPT

Simple, reliable and well tested training code for quick experiments with transformer based models

Language: Python - Size: 4.14 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 12 - Forks: 1

SRWA-Cypherpunk/SRWA

Institutional-grade protocol for tokenizing Real-World Assets (RWAs) on Solana. Features: on-chain compliance, KYC/AML verification, and DeFi integrations for global markets.

Language: TypeScript - Size: 20.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8 - Forks: 0

mensfeld/llm-docs-builder

Transform and optimize your markdown documentation for Large Language Models (LLMs) and RAG systems. Generate llms.txt automatically.

Language: Ruby - Size: 1.67 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8 - Forks: 0

Joe-Naz01/llm_basics

Introduced tokenization, decoding, and prompt-engineering fundamentals for text generation. Demonstrated temperature, top-k/top-p sampling, few-shot prompts, and instruction-based generation, laying the groundwork for efficient and controlled LLM inference.

Language: Jupyter Notebook - Size: 9.77 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.

Language: Go - Size: 913 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 89 - Forks: 4

AhmedDawoud3/Tokenizer

Byte Pair Encoding tokenizer supporting Arabic text with full diacritical marks (ุชุดูƒูŠู„). Train, save, and deploy custom tokenizers.

Language: Python - Size: 17.6 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

Deeptanshu-sankhwar/Datomacy

A privacy-first YouTube Data DAO built on Vana that empowers users to capture, own, and monetize their YouTube behavioral data. Features a Chrome extension for real-time data collection and a Next.js web app for DAO participation. Users maintain complete control over their viewing patterns, ad interactions, and engagement data while earning tokens.

Language: TypeScript - Size: 22 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6 - Forks: 1

PolyCash/polycash

The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.

Language: PHP - Size: 33.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 47 - Forks: 38

explosion/spaCy

๐Ÿ’ซ Industrial-strength Natural Language Processing (NLP) in Python

Language: Python - Size: 194 MB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 32,703 - Forks: 4,609

Shubham64364/nlp-nltk-python

๐Ÿš€ Explore NLP fundamentals with Pythonโ€™s NLTK library through clear examples and hands-on tasks in tokenization, analysis, and classification.

Language: Python - Size: 2.99 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

Pacatro/gpoetry

A tiny GPT model to generate spanish poetry

Language: Python - Size: 10.9 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

izsolnay/Ancient_NLP

Goal: Discover whether modern NLP tools and predictive algorithms can provide insights into ancient text corpora

Language: Jupyter Notebook - Size: 6.57 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

erikbatista42/Tiny-LLM

How to create a small LLM built with the transformer architecture in Python.

Language: Python - Size: 9.77 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

CompLin/nheengatu

Tools and resources for the computational processing of Nheengatu (Modern Tupi)

Language: Python - Size: 40.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 12 - Forks: 4

mhasegawa7045/Film_NLP_Sentimental_Analysis_Machine_Learning

[Tokenization, Topic Modeling, Sentiment Analysis, Network of Bigrams] The purpose of this project is to see if text mining techniques can ease better analysis for categorizing movies with just the Descriptions while ignoring the Genre from the dataset, IMDB_movies.csv, which is stored under the data frame variable, movies_desc. Tokenization (TF-DF) was used to increase efficiency to analyze term frequencies in movie Descriptions so that the conceptual theme of a movie franchise would be determined even if a person has never watched any of the films. Create mixtures of terms that are correlated to every topic and the mixture of topics that distinguishes each document through Topic Modeling in the dataset, IMDB_movies.csv. Sentimental Analysis focused on Movies with Sentimental Clusters that were using bing and NRC lexicons to see how Sentiment affects Rating and Revenue. The network of bigrams for the Movies dataset help summarize how frequented Movie Description word-terms create term relationships and how they connect to other movies.

Language: HTML - Size: 7.41 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

securitybunker/databunker

Secure Vault for Customer PII/PHI/PCI/KYC Records

Language: Go - Size: 11.1 MB - Last synced at: 4 days ago - Pushed at: about 2 months ago - Stars: 1,336 - Forks: 87

johannschopplich/tokenx

๐Ÿ“ Fast token estimation at 94% accuracy of a full tokenizer in a 2kB bundle

Language: TypeScript - Size: 536 KB - Last synced at: 7 days ago - Pushed at: 15 days ago - Stars: 47 - Forks: 3

delBull/saaspandoras Fork of nextify-limited/saasfly

Acquire your right to participate in exclusive projects with Pandoras.

Language: TypeScript - Size: 204 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

NVIDIA/Cosmos-Tokenizer ๐Ÿ“ฆ

A suite of image and video neural tokenizers

Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 7 days ago - Pushed at: 9 months ago - Stars: 1,674 - Forks: 83

tnqbao/gau-authorization-service

Authorization service written in Go, designed to manage authentication, token refresh, and user permissions.

Language: Go - Size: 94.7 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

GlitchedPolygons/l8w8jwt

Minimal, OpenSSL-less and super lightweight JWT library written in C.

Language: C - Size: 6.25 MB - Last synced at: 1 day ago - Pushed at: 3 months ago - Stars: 168 - Forks: 47

AmoDinho/datacamp-python-data-science-track

All the slides, accompanying code and exercises all stored in this repo. ๐ŸŽˆ

Language: Python - Size: 74.1 MB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 872 - Forks: 528

verygoodsecurity/vgs-collect-ios

VGS Collect iOS SDK

Language: Swift - Size: 64.8 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 23 - Forks: 18

Deed3Labs/Protocol-Contracts

The Deed Protocol Smart Contracts ๐Ÿ“‘

Language: TypeScript - Size: 4.11 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 1

FatimaALzahrani/Byte-Pair-Encoding-Demo

A minimal Python implementation of Byte Pair Encoding (BPE) with step-by-step visualization of merge operations and vocabulary updates.

Language: HTML - Size: 83 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

venkat-0706/Twalyze

Twitter sentiment analysis project using machine learning to classify tweets and understand audience mood, opinions, and behavior trends in real-time.

Language: Jupyter Notebook - Size: 24.4 KB - Last synced at: 5 days ago - Pushed at: 6 months ago - Stars: 10 - Forks: 1

dart-community/opal

Dart package with basic tokenization and syntax highlighting support for various programming languages and data formats.

Language: Dart - Size: 60.5 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

rth/vtext

Simple NLP in Rust with Python bindings

Language: Rust - Size: 273 KB - Last synced at: 2 days ago - Pushed at: over 2 years ago - Stars: 153 - Forks: 9

daac-tools/vaporetto

๐Ÿ›ฅ Vaporetto: Very accelerated pointwise prediction based tokenizer

Language: Rust - Size: 3.99 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 245 - Forks: 10

kantkrishan0206-crypto/AlignGPT

โ€œThis project implements a mini LLM alignment pipeline using Reinforcement Learning from Human Feedback (RLHF). It includes training a reward model from human-annotated preference data, fine-tuning the language model via policy optimization, and performing ablation studies to evaluate robustness, fairness, and alignment trade-offs.โ€

Language: Jupyter Notebook - Size: 5.77 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

mahnoorsheikh16/NLP-Framework-for-Literature-Summarization-in-Law-and-Policy

Implementation of an interactive chatbot for summarizing legal and policy documents. Includes data preprocessing (cleaning, tokenization, chunking), extractive summarization baselines, and fine-tuned abstractive models (PEGASUS and LED). Integrates a retrieval layer for document relevance and uses ROUGE, BLEU, and cosine similarity for evaluation.

Language: Jupyter Notebook - Size: 20.5 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

LingAdeu/language-tokenization-and-embedding

LLM splits up texts into tokens before converting them to vector embeddings. This repo explains different tokenization strategies prior to embedding conversion.

Size: 5.23 MB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

alexandermorgan/BatchBPE Fork of karpathy/minbpe

Lightweight batched implementation of the Byte Pair Encoding (BPE) algorithm for LLM tokenization.

Language: Python - Size: 1.65 MB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 6 - Forks: 0

nuekkis/Turk-NLP

Tรผrkรงe iรงin kapsamlฤฑ aรงฤฑk kaynak NLP (DoฤŸal Dil ฤฐลŸleme) kรผtรผphanesi.

Language: Python - Size: 25.4 KB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 4 - Forks: 0

uw-swag/tokdrift

Repository for TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar.

Language: Python - Size: 11.5 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 4 - Forks: 0

Bratipah/sylvan-cap

SylvanCap is a revolutionary Web3 platform that bridges sustainable forestry and decentralized finance (DeFi) by tokenizing individual trees as a RWA

Language: TypeScript - Size: 2.32 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

qlaxd/Large-Language-Diffusion-with-mAsking

Implementing Diffusion Models for Language Generation

Language: Python - Size: 429 MB - Last synced at: 14 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

nlpcloud/nlpcloud-go

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...

Language: Go - Size: 111 KB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 10 - Forks: 2

jparkerweb/llm-distillery

๐Ÿถ llm-distillery โ‡ข use LLMs to run map-reduce summarization tasks on large documents until a target token size is met.

Language: JavaScript - Size: 287 KB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 11 - Forks: 1

explosion/spacy-streamlit

๐Ÿ‘‘ spaCy building blocks and visualizers for Streamlit apps

Language: Python - Size: 61.5 KB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 845 - Forks: 119

WorksApplications/sudachi.rs

Sudachi in Rust ๐Ÿฆ€ and new generation of SudachiPy

Language: Rust - Size: 15.8 MB - Last synced at: 11 days ago - Pushed at: 4 months ago - Stars: 388 - Forks: 42

Mecanik/Tiny-BPE-Trainer

Lightweight, header-only Byte Pair Encoding (BPE) trainer in modern C++17. Produces HuggingFace-compatible vocabularies for transformers and integrates with Modern Text Tokenizer.

Language: C++ - Size: 33.2 KB - Last synced at: 10 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

eliben/go-sentencepiece

Go implementation of the SentencePiece tokenizer

Language: Go - Size: 200 KB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 37 - Forks: 8

KanishkNavale/Text-Mining-with-TF-IDF-and-Cosine-Similarity ๐Ÿ“ฆ

A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.

Language: Jupyter Notebook - Size: 7.34 MB - Last synced at: 7 days ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 1

daac-tools/vibrato

๐ŸŽค vibrato: Viterbi-based accelerated tokenizer

Language: Rust - Size: 1.09 MB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 377 - Forks: 17

bastienbot/nlp-js-tools-french

POS Tagger, lemmatizer and stemmer for french language in javascript

Language: JavaScript - Size: 1.04 MB - Last synced at: 20 days ago - Pushed at: about 8 years ago - Stars: 37 - Forks: 8

dracuxan/GoScout

GoScout: Fast, Efficient, Go-powered Search

Language: Go - Size: 13.7 KB - Last synced at: 10 days ago - Pushed at: 29 days ago - Stars: 6 - Forks: 0

dl-tokenf/contracts

On-chain RWA Tokenization Framework

Language: Solidity - Size: 1.05 MB - Last synced at: 20 days ago - Pushed at: 3 months ago - Stars: 59 - Forks: 16

VKCOM/YouTokenToMe ๐Ÿ“ฆ

Unsupervised text tokenizer focused on computational efficiency

Language: C++ - Size: 192 KB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 971 - Forks: 108

amr080/finance

Alex's Finance Repo

Size: 35.8 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0

rishishanthan/lstm-sentiment-analysis

End-to-end sentiment analysis with a stacked LSTM in PyTorch โ€” custom tokenization, embeddings, padding, class imbalance handling, and thorough evaluation.

Language: Jupyter Notebook - Size: 7.47 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

CMTA/CMTAT

Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.

Language: JavaScript - Size: 109 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 67 - Forks: 27

TI-Toolkit/tivars_lib_py

A Python library for interacting with TI-(e)z80 (82/83/84 series) calculator files

Language: Python - Size: 3.94 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 21 - Forks: 1

stefanwille/llm-tokens-playground

A demo that makes LLM tokenization more tangible.

Language: TypeScript - Size: 64.5 KB - Last synced at: 24 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

KathyReid/token-wars-dataviz

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

Language: Python - Size: 454 KB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

av/klmbr

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

Language: TeX - Size: 2.24 MB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 79 - Forks: 3

gaidardzhiev/shell

*nix command interpreter

Language: C - Size: 446 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

vsce-toolroom/vscode-textmate-languageservice

Language APIs and support features from Textmate tokenization in Visual Studio Code.

Language: TypeScript - Size: 1.27 MB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 21 - Forks: 0

Jathurshan0330/TFM-Tokenizer

Official Code Repository of "Tokenizing Single-Channel EEG with Time-Frequency Motif Learning". arXiv: https://arxiv.org/abs/2502.16060

Language: Jupyter Notebook - Size: 22 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 2 - Forks: 0

delveopers/Shredword

Fast & efficient BPE tokenizer written in C & python for LLM tranining

Language: C++ - Size: 895 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

verygoodsecurity/vgs-collect-android

VGS Collect Android SDK

Language: Kotlin - Size: 14.1 MB - Last synced at: about 17 hours ago - Pushed at: 5 days ago - Stars: 8 - Forks: 9

saulmoralespa/subscription-wompi-woo

Integraciรณn de suscripciones con Wompi para WooCommerce

Language: PHP - Size: 394 KB - Last synced at: 30 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

Networks-Learning/token-pricing

Repository for the paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives", Arxiv 2025

Language: Jupyter Notebook - Size: 10.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

sanderland/script_bpe

Code for the paper "BPE stays on SCRIPT"

Language: Jupyter Notebook - Size: 652 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 15 - Forks: 3

JamiiDao/Krill

Institution grade server for Solana onchain tokenization, attestation and stablecoins. Easy attestions, POAPs, tokenization, onchain memberships and Monetary Exchange from one server, all controlled by you

Language: Rust - Size: 458 KB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

jshuadvd/LongRoPE

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

Language: Python - Size: 562 KB - Last synced at: 21 days ago - Pushed at: over 1 year ago - Stars: 150 - Forks: 14

mysto/python-fpe

FPE - Format Preserving Encryption with FF3 in Python

Language: Python - Size: 144 KB - Last synced at: 7 days ago - Pushed at: 5 months ago - Stars: 101 - Forks: 20

cosmaadrian/strawberry-problem

Official repository for "The Strawberry Problem ๐Ÿ“: Emergence of Character-level Understanding in Tokenized Language Models"

Language: Python - Size: 67.4 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

basit-afridi62/nlp-nltk-python

This repository is a hands-on guide to Natural Language Processing (NLP) with Python using NLTK. It includes scripts, explanations, and outputs for tokenization, stopwords, stemming, lemmatization, corpora, WordNet, feature extraction, sentiment analysis, and text classification with machine learning.

Language: Python - Size: 1.57 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

JanBremec/txc-compressor

TXC โ€“ High-performance token-based text and log compressor with superior compression ratios and competitive speed.

Language: Python - Size: 463 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

oshinrathor/ML-NLP-Projects

This repository contains a collection of Machine Learning and NLP projects, including sentiment analysis with NLTK, text preprocessing, and deep learning models. It covers techniques like tokenization, stopword removal, lemmatization, rule-based analysis, and transformer models like BERT for practical NLP applications.

Language: Jupyter Notebook - Size: 2.85 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

AndyFerns/Automated-Reasoning-Project

A project aiming to implement Automated Reasoning in First Order Logic using NLP

Language: Python - Size: 122 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

spindle-health/carduus

PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.

Language: Python - Size: 1.48 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 20 - Forks: 1

trag1c/crossandra-rs

(WIP) A straightforward tokenization library for seamless text processing.

Language: Rust - Size: 708 KB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 8 - Forks: 1

ImadSaddik/Train_Your_Language_Model_Course

Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.

Language: Jupyter Notebook - Size: 59.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 208 - Forks: 108

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Language: C++ - Size: 1.69 MB - Last synced at: 23 days ago - Pushed at: 7 months ago - Stars: 319 - Forks: 76

LeoMSgit/Personal-Lib---AI-ML-NLP-CV

Collection of Notes, Guides, and Examples for Artificial Intelligence, Machine Learning, Natural Language Processing and Computer Vision

Size: 186 KB - Last synced at: 25 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0