GitHub topics: tokenization

Repositories

RAHEEM12344/content-recommendation-engine

A modern, responsive web application that delivers personalized content recommendations based on user preferences and behavior. This interactive recommendation system allows users to discover content tailored to their interests through category selection, tag filtering, and customizable content parameters.

Language: HTML - Size: 187 KB - Last synced at: about 12 hours ago - Pushed at: about 13 hours ago - Stars: 0 - Forks: 0

Y3LLOWVESTS/rustyonions

RustyOnions is an experimental Rust-based P2P platform evolving into a decentralized Web3 network. It combines two data planes — a public overlay for chunk storage and a Tor-powered private layer for secure messaging — with bandwidth metering to promote responsible relay participation.

Language: Rust - Size: 8.18 MB - Last synced at: about 21 hours ago - Pushed at: about 21 hours ago - Stars: 2 - Forks: 0

12345far/metrics-calculation-precision-recall

Laboratory 7 - Retrieval Information

Size: 1.95 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

Saffronduck5667/precision-r-comparison

Laboratory 8 - Retrieval Information

Size: 1.95 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

JayalekshmiSharma/perl-yji

🔧 Simplify your Perl code with YJI, a lightweight tool for generating efficient and reusable code snippets efficiently.

Size: 1.29 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

giovannidia/tokenlens

🔍 Enhance AI applications with typed model metadata and context utilities for efficient decision-making and budget management.

Language: TypeScript - Size: 2.1 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

mikiyadd/my-c-array

Dynamic array implementation in C with a modular, folder-based structure.

Language: C - Size: 12.7 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

johannschopplich/toon

🎒 Token-Oriented Object Notation – JSON for LLMs at half the token cost

Language: TypeScript - Size: 543 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 4,226 - Forks: 87

fbkaragoz/durak

Durak is an open-source modular Turkish NLP preprocessing toolkit

Language: Python - Size: 1.34 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Aishwaraya-Dharmadhikari/NLP_Programs

All Natural Language Processing Programs

Size: 6.84 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

hacker193/cmtat-icma-tokenized-bonds

🌐 Showcase tokenized fixed income solutions with CMTAT and ICMA, featuring advanced trading and analytics for efficient market operations.

Language: TypeScript - Size: 1.58 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Vishwaksena94/tokenloom

🔍 Parse streamed text into structured events using TokenLoom, a TypeScript library designed for flexible handling of real-time data and custom tags.

Language: TypeScript - Size: 1.56 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

spiko-tech/contracts

Contracts for Spiko's tokenized securities.

Language: JavaScript - Size: 1.63 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 9 - Forks: 3

888abd8888/privacy-vault-

🛡️ Empower individuals and organizations to protect data privacy, ensure accountability, and build trust through transparent, open-source solutions.

Language: HTML - Size: 1.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

sunny262565/perl-yji

Size: 1.95 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

5zandorcvh3U/NFT

Example implementations of tokens to represent unique assets, such as collectibles or deeds, using the NEP-171 spec (similar to ERC-721)

Language: Rust - Size: 16.8 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0

Eduleiteyg/youtube_vid_analyzer

📺 Analyze YouTube videos effortlessly. Extract key insights and engage with content through a simple Flask web interface.

Language: Python - Size: 1.3 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

Marcelleedit7272/genai-tokenizer

🧠 Explore tokenization with GenAi-Tokenizer, a user-friendly tool for decoding text, learning vocabulary, and visualizing token types effortlessly.

Language: TypeScript - Size: 4.66 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Slush1004/Pytorch-RNN-create-Q-A-Syste-

🐙 PyTorch RNN-based Q&A system predicts answers from questions using a custom QA dataset. It tokenizes text, builds vocab, and uses embedding, RNN, and linear layers.

Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

thearhamsharif/BSCS-UBIT-2k21

Includes coursework and lab materials for students enrolled in the Bachelor of Science in Computer Science degree at UBIT.

Language: Jupyter Notebook - Size: 37.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

shouldfeelright/rabbithole

‘And I’m not sure what I can do about it,’ she thought, ‘I’m afraid I’ve made a mistake, as I can’t get back to it. It’s like I’m going to go on and on, but I can’t. I don’t know what I am.’

Language: Python - Size: 85 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

AndresEspin1993/b2t-tokenizer

B2T - Tokenizer for the AI Systems.

Language: PowerShell - Size: 240 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 1

Worklytics/psoxy

serverless ☁️ 🚀 , pseudonymizing proxy between Worklytics and your workplace 💼 SaaS data sources' APIs. Data Loss Prevention (DLP) 🛡️🔒 and compliance layer deployable to AWS Lambda or GCP Cloud Functions.

Language: Java - Size: 36.6 MB - Last synced at: about 2 hours ago - Pushed at: about 4 hours ago - Stars: 15 - Forks: 6

sytelus/nanuGPT

Simple, reliable and well tested training code for quick experiments with transformer based models

Language: Python - Size: 4.14 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 12 - Forks: 1

SRWA-Cypherpunk/SRWA

Institutional-grade protocol for tokenizing Real-World Assets (RWAs) on Solana. Features: on-chain compliance, KYC/AML verification, and DeFi integrations for global markets.

Language: TypeScript - Size: 20.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8 - Forks: 0

mensfeld/llm-docs-builder

Transform and optimize your markdown documentation for Large Language Models (LLMs) and RAG systems. Generate llms.txt automatically.

Language: Ruby - Size: 1.67 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8 - Forks: 0

Joe-Naz01/llm_basics

Introduced tokenization, decoding, and prompt-engineering fundamentals for text generation. Demonstrated temperature, top-k/top-p sampling, few-shot prompts, and instruction-based generation, laying the groundwork for efficient and controlled LLM inference.

Language: Jupyter Notebook - Size: 9.77 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

clipperhouse/uax29

A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.

Language: Go - Size: 913 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 89 - Forks: 4

AhmedDawoud3/Tokenizer

Byte Pair Encoding tokenizer supporting Arabic text with full diacritical marks (تشكيل). Train, save, and deploy custom tokenizers.

Language: Python - Size: 17.6 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

Deeptanshu-sankhwar/Datomacy

A privacy-first YouTube Data DAO built on Vana that empowers users to capture, own, and monetize their YouTube behavioral data. Features a Chrome extension for real-time data collection and a Next.js web app for DAO participation. Users maintain complete control over their viewing patterns, ad interactions, and engagement data while earning tokens.

Language: TypeScript - Size: 22 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6 - Forks: 1

PolyCash/polycash

The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.

Language: PHP - Size: 33.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 47 - Forks: 38

explosion/spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Language: Python - Size: 194 MB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 32,703 - Forks: 4,609

Shubham64364/nlp-nltk-python

🚀 Explore NLP fundamentals with Python’s NLTK library through clear examples and hands-on tasks in tokenization, analysis, and classification.

Language: Python - Size: 2.99 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

Pacatro/gpoetry

A tiny GPT model to generate spanish poetry

Language: Python - Size: 10.9 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

izsolnay/Ancient_NLP

Goal: Discover whether modern NLP tools and predictive algorithms can provide insights into ancient text corpora

Language: Jupyter Notebook - Size: 6.57 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

erikbatista42/Tiny-LLM

How to create a small LLM built with the transformer architecture in Python.

Language: Python - Size: 9.77 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

CompLin/nheengatu

Tools and resources for the computational processing of Nheengatu (Modern Tupi)

Language: Python - Size: 40.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 12 - Forks: 4

mhasegawa7045/Film_NLP_Sentimental_Analysis_Machine_Learning

[Tokenization, Topic Modeling, Sentiment Analysis, Network of Bigrams] The purpose of this project is to see if text mining techniques can ease better analysis for categorizing movies with just the Descriptions while ignoring the Genre from the dataset, IMDB_movies.csv, which is stored under the data frame variable, movies_desc. Tokenization (TF-DF) was used to increase efficiency to analyze term frequencies in movie Descriptions so that the conceptual theme of a movie franchise would be determined even if a person has never watched any of the films. Create mixtures of terms that are correlated to every topic and the mixture of topics that distinguishes each document through Topic Modeling in the dataset, IMDB_movies.csv. Sentimental Analysis focused on Movies with Sentimental Clusters that were using bing and NRC lexicons to see how Sentiment affects Rating and Revenue. The network of bigrams for the Movies dataset help summarize how frequented Movie Description word-terms create term relationships and how they connect to other movies.

Language: HTML - Size: 7.41 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

securitybunker/databunker

Secure Vault for Customer PII/PHI/PCI/KYC Records

Language: Go - Size: 11.1 MB - Last synced at: 4 days ago - Pushed at: about 2 months ago - Stars: 1,336 - Forks: 87

johannschopplich/tokenx

📐 Fast token estimation at 94% accuracy of a full tokenizer in a 2kB bundle

Language: TypeScript - Size: 536 KB - Last synced at: 7 days ago - Pushed at: 15 days ago - Stars: 47 - Forks: 3

delBull/saaspandoras Fork of nextify-limited/saasfly

Acquire your right to participate in exclusive projects with Pandoras.

Language: TypeScript - Size: 204 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

NVIDIA/Cosmos-Tokenizer 📦

A suite of image and video neural tokenizers

Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 7 days ago - Pushed at: 9 months ago - Stars: 1,674 - Forks: 83

tnqbao/gau-authorization-service

Authorization service written in Go, designed to manage authentication, token refresh, and user permissions.

Language: Go - Size: 94.7 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

GlitchedPolygons/l8w8jwt

Minimal, OpenSSL-less and super lightweight JWT library written in C.

Language: C - Size: 6.25 MB - Last synced at: 1 day ago - Pushed at: 3 months ago - Stars: 168 - Forks: 47

AmoDinho/datacamp-python-data-science-track

All the slides, accompanying code and exercises all stored in this repo. 🎈

Language: Python - Size: 74.1 MB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 872 - Forks: 528

verygoodsecurity/vgs-collect-ios

VGS Collect iOS SDK

Language: Swift - Size: 64.8 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 23 - Forks: 18

Deed3Labs/Protocol-Contracts

The Deed Protocol Smart Contracts 📑

Language: TypeScript - Size: 4.11 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 1

FatimaALzahrani/Byte-Pair-Encoding-Demo

A minimal Python implementation of Byte Pair Encoding (BPE) with step-by-step visualization of merge operations and vocabulary updates.

Language: HTML - Size: 83 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

venkat-0706/Twalyze

Twitter sentiment analysis project using machine learning to classify tweets and understand audience mood, opinions, and behavior trends in real-time.

Language: Jupyter Notebook - Size: 24.4 KB - Last synced at: 5 days ago - Pushed at: 6 months ago - Stars: 10 - Forks: 1

dart-community/opal

Dart package with basic tokenization and syntax highlighting support for various programming languages and data formats.

Language: Dart - Size: 60.5 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

rth/vtext

Simple NLP in Rust with Python bindings

Language: Rust - Size: 273 KB - Last synced at: 2 days ago - Pushed at: over 2 years ago - Stars: 153 - Forks: 9

daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

Language: Rust - Size: 3.99 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 245 - Forks: 10

kantkrishan0206-crypto/AlignGPT

“This project implements a mini LLM alignment pipeline using Reinforcement Learning from Human Feedback (RLHF). It includes training a reward model from human-annotated preference data, fine-tuning the language model via policy optimization, and performing ablation studies to evaluate robustness, fairness, and alignment trade-offs.”

Language: Jupyter Notebook - Size: 5.77 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

mahnoorsheikh16/NLP-Framework-for-Literature-Summarization-in-Law-and-Policy

Implementation of an interactive chatbot for summarizing legal and policy documents. Includes data preprocessing (cleaning, tokenization, chunking), extractive summarization baselines, and fine-tuned abstractive models (PEGASUS and LED). Integrates a retrieval layer for document relevance and uses ROUGE, BLEU, and cosine similarity for evaluation.

Language: Jupyter Notebook - Size: 20.5 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

LingAdeu/language-tokenization-and-embedding

LLM splits up texts into tokens before converting them to vector embeddings. This repo explains different tokenization strategies prior to embedding conversion.

Size: 5.23 MB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

alexandermorgan/BatchBPE Fork of karpathy/minbpe

Lightweight batched implementation of the Byte Pair Encoding (BPE) algorithm for LLM tokenization.

Language: Python - Size: 1.65 MB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 6 - Forks: 0

nuekkis/Turk-NLP

Türkçe için kapsamlı açık kaynak NLP (Doğal Dil İşleme) kütüphanesi.

Language: Python - Size: 25.4 KB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 4 - Forks: 0

uw-swag/tokdrift

Repository for TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar.

Language: Python - Size: 11.5 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 4 - Forks: 0

Bratipah/sylvan-cap

SylvanCap is a revolutionary Web3 platform that bridges sustainable forestry and decentralized finance (DeFi) by tokenizing individual trees as a RWA

Language: TypeScript - Size: 2.32 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

qlaxd/Large-Language-Diffusion-with-mAsking

Implementing Diffusion Models for Language Generation

Language: Python - Size: 429 MB - Last synced at: 14 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

nlpcloud/nlpcloud-go

NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...

Language: Go - Size: 111 KB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 10 - Forks: 2

jparkerweb/llm-distillery

🍶 llm-distillery ⇢ use LLMs to run map-reduce summarization tasks on large documents until a target token size is met.

Language: JavaScript - Size: 287 KB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 11 - Forks: 1

explosion/spacy-streamlit

👑 spaCy building blocks and visualizers for Streamlit apps

Language: Python - Size: 61.5 KB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 845 - Forks: 119

WorksApplications/sudachi.rs

Sudachi in Rust 🦀 and new generation of SudachiPy

Language: Rust - Size: 15.8 MB - Last synced at: 11 days ago - Pushed at: 4 months ago - Stars: 388 - Forks: 42

Mecanik/Tiny-BPE-Trainer

Lightweight, header-only Byte Pair Encoding (BPE) trainer in modern C++17. Produces HuggingFace-compatible vocabularies for transformers and integrates with Modern Text Tokenizer.

Language: C++ - Size: 33.2 KB - Last synced at: 10 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

eliben/go-sentencepiece

Go implementation of the SentencePiece tokenizer

Language: Go - Size: 200 KB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 37 - Forks: 8

KanishkNavale/Text-Mining-with-TF-IDF-and-Cosine-Similarity 📦

A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.

Language: Jupyter Notebook - Size: 7.34 MB - Last synced at: 7 days ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 1

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

Language: Rust - Size: 1.09 MB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 377 - Forks: 17

bastienbot/nlp-js-tools-french

POS Tagger, lemmatizer and stemmer for french language in javascript

Language: JavaScript - Size: 1.04 MB - Last synced at: 20 days ago - Pushed at: about 8 years ago - Stars: 37 - Forks: 8

dracuxan/GoScout

GoScout: Fast, Efficient, Go-powered Search

Language: Go - Size: 13.7 KB - Last synced at: 10 days ago - Pushed at: 29 days ago - Stars: 6 - Forks: 0

dl-tokenf/contracts

On-chain RWA Tokenization Framework

Language: Solidity - Size: 1.05 MB - Last synced at: 20 days ago - Pushed at: 3 months ago - Stars: 59 - Forks: 16

VKCOM/YouTokenToMe 📦

Unsupervised text tokenizer focused on computational efficiency

Language: C++ - Size: 192 KB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 971 - Forks: 108

amr080/finance

Alex's Finance Repo

Size: 35.8 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0

rishishanthan/lstm-sentiment-analysis

End-to-end sentiment analysis with a stacked LSTM in PyTorch — custom tokenization, embeddings, padding, class imbalance handling, and thorough evaluation.

Language: Jupyter Notebook - Size: 7.47 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

CMTA/CMTAT

Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.

Language: JavaScript - Size: 109 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 67 - Forks: 27

TI-Toolkit/tivars_lib_py

A Python library for interacting with TI-(e)z80 (82/83/84 series) calculator files

Language: Python - Size: 3.94 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 21 - Forks: 1

stefanwille/llm-tokens-playground

A demo that makes LLM tokenization more tangible.

Language: TypeScript - Size: 64.5 KB - Last synced at: 24 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

KathyReid/token-wars-dataviz

A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.

Language: Python - Size: 454 KB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

av/klmbr

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

Language: TeX - Size: 2.24 MB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 79 - Forks: 3

gaidardzhiev/shell

*nix command interpreter

Language: C - Size: 446 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

vsce-toolroom/vscode-textmate-languageservice

Language APIs and support features from Textmate tokenization in Visual Studio Code.

Language: TypeScript - Size: 1.27 MB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 21 - Forks: 0

Jathurshan0330/TFM-Tokenizer

Official Code Repository of "Tokenizing Single-Channel EEG with Time-Frequency Motif Learning". arXiv: https://arxiv.org/abs/2502.16060

Language: Jupyter Notebook - Size: 22 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 2 - Forks: 0

delveopers/Shredword

Fast & efficient BPE tokenizer written in C & python for LLM tranining

Language: C++ - Size: 895 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

verygoodsecurity/vgs-collect-android

VGS Collect Android SDK

Language: Kotlin - Size: 14.1 MB - Last synced at: about 17 hours ago - Pushed at: 5 days ago - Stars: 8 - Forks: 9

saulmoralespa/subscription-wompi-woo

Integración de suscripciones con Wompi para WooCommerce

Language: PHP - Size: 394 KB - Last synced at: 30 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

Networks-Learning/token-pricing

Repository for the paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives", Arxiv 2025

Language: Jupyter Notebook - Size: 10.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

sanderland/script_bpe

Code for the paper "BPE stays on SCRIPT"

Language: Jupyter Notebook - Size: 652 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 15 - Forks: 3

JamiiDao/Krill

Institution grade server for Solana onchain tokenization, attestation and stablecoins. Easy attestions, POAPs, tokenization, onchain memberships and Monetary Exchange from one server, all controlled by you

Language: Rust - Size: 458 KB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

jshuadvd/LongRoPE

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

Language: Python - Size: 562 KB - Last synced at: 21 days ago - Pushed at: over 1 year ago - Stars: 150 - Forks: 14

mysto/python-fpe

FPE - Format Preserving Encryption with FF3 in Python

Language: Python - Size: 144 KB - Last synced at: 7 days ago - Pushed at: 5 months ago - Stars: 101 - Forks: 20

cosmaadrian/strawberry-problem

Official repository for "The Strawberry Problem 🍓: Emergence of Character-level Understanding in Tokenized Language Models"

Language: Python - Size: 67.4 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

basit-afridi62/nlp-nltk-python

This repository is a hands-on guide to Natural Language Processing (NLP) with Python using NLTK. It includes scripts, explanations, and outputs for tokenization, stopwords, stemming, lemmatization, corpora, WordNet, feature extraction, sentiment analysis, and text classification with machine learning.

Language: Python - Size: 1.57 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

JanBremec/txc-compressor

TXC – High-performance token-based text and log compressor with superior compression ratios and competitive speed.

Language: Python - Size: 463 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

oshinrathor/ML-NLP-Projects

This repository contains a collection of Machine Learning and NLP projects, including sentiment analysis with NLTK, text preprocessing, and deep learning models. It covers techniques like tokenization, stopword removal, lemmatization, rule-based analysis, and transformer models like BERT for practical NLP applications.

Language: Jupyter Notebook - Size: 2.85 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

Related Keywords

tokenization 1,118 nlp 297 python 173 natural-language-processing 164 machine-learning 130 tokenizer 126 lemmatization 116 blockchain 85 stemming 84 sentiment-analysis 76 nltk 76 deep-learning 64 llm 54 nlp-machine-learning 43 ethereum 43 text-classification 42 smart-contracts 42 ai 41 tf-idf 39 web3 37 token 36 spacy 35 transformers 32 pytorch 32 defi 32 bag-of-words 32 transformer 31 stopwords 31 python3 30 solidity 30 named-entity-recognition 29 text-processing 28 pandas 28 nltk-python 27 rwa 27 parsing 25 bpe 25 tensorflow 25 lstm 25 information-retrieval 25 data-science 24 huggingface 23 text-mining 22 neural-network 22 tokens 22 word2vec 21 language-model 21 artificial-intelligence 20 crypto 20 embeddings 20 java 19 pos-tagging 19 vectorization 19 api 18 ner 18 javascript 18 golang 18 keras 18 rnn 17 logistic-regression 17 nodejs 17 regex 17 lexer 17 xft 17 preprocessing 17 lexical-analysis 17 security 17 nft 17 numpy 16 rust 16 cryptocurrency 16 naive-bayes-classifier 16 spacy-nlp 16 c 16 scikit-learn 16 text-analysis 15 bert 15 llms 15 gpt 15 cosine-similarity 15 text-generation 15 huggingface-transformers 15 lexical-analyzer 15 encryption 15 parser 14 byte-pair-encoding 14 tfidf-vectorizer 14 text-preprocessing 14 bert-model 14 go 14 machine-translation 14 chatbot 14 compiler 13 fintech 13 x-financial-technologies 13 cpp 13 matplotlib 13 jupyter-notebook 13 text 13 fine-tuning 13