GitHub topics: tokenization
RAHEEM12344/content-recommendation-engine
A modern, responsive web application that delivers personalized content recommendations based on user preferences and behavior. This interactive recommendation system allows users to discover content tailored to their interests through category selection, tag filtering, and customizable content parameters.
Language: HTML - Size: 187 KB - Last synced at: about 12 hours ago - Pushed at: about 13 hours ago - Stars: 0 - Forks: 0
Y3LLOWVESTS/rustyonions
RustyOnions is an experimental Rust-based P2P platform evolving into a decentralized Web3 network. It combines two data planes โ a public overlay for chunk storage and a Tor-powered private layer for secure messaging โ with bandwidth metering to promote responsible relay participation.
Language: Rust - Size: 8.18 MB - Last synced at: about 21 hours ago - Pushed at: about 21 hours ago - Stars: 2 - Forks: 0
12345far/metrics-calculation-precision-recall
Laboratory 7 - Retrieval Information
Size: 1.95 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0
Saffronduck5667/precision-r-comparison
Laboratory 8 - Retrieval Information
Size: 1.95 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0
JayalekshmiSharma/perl-yji
๐ง Simplify your Perl code with YJI, a lightweight tool for generating efficient and reusable code snippets efficiently.
Size: 1.29 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
giovannidia/tokenlens
๐ Enhance AI applications with typed model metadata and context utilities for efficient decision-making and budget management.
Language: TypeScript - Size: 2.1 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
mikiyadd/my-c-array
Dynamic array implementation in C with a modular, folder-based structure.
Language: C - Size: 12.7 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
johannschopplich/toon
๐ Token-Oriented Object Notation โ JSON for LLMs at half the token cost
Language: TypeScript - Size: 543 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 4,226 - Forks: 87
fbkaragoz/durak
Durak is an open-source modular Turkish NLP preprocessing toolkit
Language: Python - Size: 1.34 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
Aishwaraya-Dharmadhikari/NLP_Programs
All Natural Language Processing Programs
Size: 6.84 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
hacker193/cmtat-icma-tokenized-bonds
๐ Showcase tokenized fixed income solutions with CMTAT and ICMA, featuring advanced trading and analytics for efficient market operations.
Language: TypeScript - Size: 1.58 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
Vishwaksena94/tokenloom
๐ Parse streamed text into structured events using TokenLoom, a TypeScript library designed for flexible handling of real-time data and custom tags.
Language: TypeScript - Size: 1.56 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
spiko-tech/contracts
Contracts for Spiko's tokenized securities.
Language: JavaScript - Size: 1.63 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 9 - Forks: 3
888abd8888/privacy-vault-
๐ก๏ธ Empower individuals and organizations to protect data privacy, ensure accountability, and build trust through transparent, open-source solutions.
Language: HTML - Size: 1.6 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
sunny262565/perl-yji
Size: 1.95 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0
5zandorcvh3U/NFT
Example implementations of tokens to represent unique assets, such as collectibles or deeds, using the NEP-171 spec (similar to ERC-721)
Language: Rust - Size: 16.8 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0
Eduleiteyg/youtube_vid_analyzer
๐บ Analyze YouTube videos effortlessly. Extract key insights and engage with content through a simple Flask web interface.
Language: Python - Size: 1.3 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0
Marcelleedit7272/genai-tokenizer
๐ง Explore tokenization with GenAi-Tokenizer, a user-friendly tool for decoding text, learning vocabulary, and visualizing token types effortlessly.
Language: TypeScript - Size: 4.66 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0
Slush1004/Pytorch-RNN-create-Q-A-Syste-
๐ PyTorch RNN-based Q&A system predicts answers from questions using a custom QA dataset. It tokenizes text, builds vocab, and uses embedding, RNN, and linear layers.
Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0
thearhamsharif/BSCS-UBIT-2k21
Includes coursework and lab materials for students enrolled in the Bachelor of Science in Computer Science degree at UBIT.
Language: Jupyter Notebook - Size: 37.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0
shouldfeelright/rabbithole
โAnd Iโm not sure what I can do about it,โ she thought, โIโm afraid Iโve made a mistake, as I canโt get back to it. Itโs like Iโm going to go on and on, but I canโt. I donโt know what I am.โ
Language: Python - Size: 85 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0
AndresEspin1993/b2t-tokenizer
B2T - Tokenizer for the AI Systems.
Language: PowerShell - Size: 240 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 1
Worklytics/psoxy
serverless โ๏ธ ๐ , pseudonymizing proxy between Worklytics and your workplace ๐ผ SaaS data sources' APIs. Data Loss Prevention (DLP) ๐ก๏ธ๐ and compliance layer deployable to AWS Lambda or GCP Cloud Functions.
Language: Java - Size: 36.6 MB - Last synced at: about 2 hours ago - Pushed at: about 4 hours ago - Stars: 15 - Forks: 6
sytelus/nanuGPT
Simple, reliable and well tested training code for quick experiments with transformer based models
Language: Python - Size: 4.14 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 12 - Forks: 1
SRWA-Cypherpunk/SRWA
Institutional-grade protocol for tokenizing Real-World Assets (RWAs) on Solana. Features: on-chain compliance, KYC/AML verification, and DeFi integrations for global markets.
Language: TypeScript - Size: 20.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8 - Forks: 0
mensfeld/llm-docs-builder
Transform and optimize your markdown documentation for Large Language Models (LLMs) and RAG systems. Generate llms.txt automatically.
Language: Ruby - Size: 1.67 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 8 - Forks: 0
Joe-Naz01/llm_basics
Introduced tokenization, decoding, and prompt-engineering fundamentals for text generation. Demonstrated temperature, top-k/top-p sampling, few-shot prompts, and instruction-based generation, laying the groundwork for efficient and controlled LLM inference.
Language: Jupyter Notebook - Size: 9.77 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0
clipperhouse/uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.
Language: Go - Size: 913 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 89 - Forks: 4
AhmedDawoud3/Tokenizer
Byte Pair Encoding tokenizer supporting Arabic text with full diacritical marks (ุชุดููู). Train, save, and deploy custom tokenizers.
Language: Python - Size: 17.6 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0
Deeptanshu-sankhwar/Datomacy
A privacy-first YouTube Data DAO built on Vana that empowers users to capture, own, and monetize their YouTube behavioral data. Features a Chrome extension for real-time data collection and a Next.js web app for DAO participation. Users maintain complete control over their viewing patterns, ad interactions, and engagement data while earning tokens.
Language: TypeScript - Size: 22 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 6 - Forks: 1
PolyCash/polycash
The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.
Language: PHP - Size: 33.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 47 - Forks: 38
explosion/spaCy
๐ซ Industrial-strength Natural Language Processing (NLP) in Python
Language: Python - Size: 194 MB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 32,703 - Forks: 4,609
Shubham64364/nlp-nltk-python
๐ Explore NLP fundamentals with Pythonโs NLTK library through clear examples and hands-on tasks in tokenization, analysis, and classification.
Language: Python - Size: 2.99 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0
Pacatro/gpoetry
A tiny GPT model to generate spanish poetry
Language: Python - Size: 10.9 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0
izsolnay/Ancient_NLP
Goal: Discover whether modern NLP tools and predictive algorithms can provide insights into ancient text corpora
Language: Jupyter Notebook - Size: 6.57 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0
erikbatista42/Tiny-LLM
How to create a small LLM built with the transformer architecture in Python.
Language: Python - Size: 9.77 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0
CompLin/nheengatu
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
Language: Python - Size: 40.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 12 - Forks: 4
mhasegawa7045/Film_NLP_Sentimental_Analysis_Machine_Learning
[Tokenization, Topic Modeling, Sentiment Analysis, Network of Bigrams] The purpose of this project is to see if text mining techniques can ease better analysis for categorizing movies with just the Descriptions while ignoring the Genre from the dataset, IMDB_movies.csv, which is stored under the data frame variable, movies_desc. Tokenization (TF-DF) was used to increase efficiency to analyze term frequencies in movie Descriptions so that the conceptual theme of a movie franchise would be determined even if a person has never watched any of the films. Create mixtures of terms that are correlated to every topic and the mixture of topics that distinguishes each document through Topic Modeling in the dataset, IMDB_movies.csv. Sentimental Analysis focused on Movies with Sentimental Clusters that were using bing and NRC lexicons to see how Sentiment affects Rating and Revenue. The network of bigrams for the Movies dataset help summarize how frequented Movie Description word-terms create term relationships and how they connect to other movies.
Language: HTML - Size: 7.41 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0
securitybunker/databunker
Secure Vault for Customer PII/PHI/PCI/KYC Records
Language: Go - Size: 11.1 MB - Last synced at: 4 days ago - Pushed at: about 2 months ago - Stars: 1,336 - Forks: 87
johannschopplich/tokenx
๐ Fast token estimation at 94% accuracy of a full tokenizer in a 2kB bundle
Language: TypeScript - Size: 536 KB - Last synced at: 7 days ago - Pushed at: 15 days ago - Stars: 47 - Forks: 3
delBull/saaspandoras Fork of nextify-limited/saasfly
Acquire your right to participate in exclusive projects with Pandoras.
Language: TypeScript - Size: 204 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0
NVIDIA/Cosmos-Tokenizer ๐ฆ
A suite of image and video neural tokenizers
Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 7 days ago - Pushed at: 9 months ago - Stars: 1,674 - Forks: 83
tnqbao/gau-authorization-service
Authorization service written in Go, designed to manage authentication, token refresh, and user permissions.
Language: Go - Size: 94.7 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0
GlitchedPolygons/l8w8jwt
Minimal, OpenSSL-less and super lightweight JWT library written in C.
Language: C - Size: 6.25 MB - Last synced at: 1 day ago - Pushed at: 3 months ago - Stars: 168 - Forks: 47
AmoDinho/datacamp-python-data-science-track
All the slides, accompanying code and exercises all stored in this repo. ๐
Language: Python - Size: 74.1 MB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 872 - Forks: 528
verygoodsecurity/vgs-collect-ios
VGS Collect iOS SDK
Language: Swift - Size: 64.8 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 23 - Forks: 18
Deed3Labs/Protocol-Contracts
The Deed Protocol Smart Contracts ๐
Language: TypeScript - Size: 4.11 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 1
FatimaALzahrani/Byte-Pair-Encoding-Demo
A minimal Python implementation of Byte Pair Encoding (BPE) with step-by-step visualization of merge operations and vocabulary updates.
Language: HTML - Size: 83 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0
venkat-0706/Twalyze
Twitter sentiment analysis project using machine learning to classify tweets and understand audience mood, opinions, and behavior trends in real-time.
Language: Jupyter Notebook - Size: 24.4 KB - Last synced at: 5 days ago - Pushed at: 6 months ago - Stars: 10 - Forks: 1
dart-community/opal
Dart package with basic tokenization and syntax highlighting support for various programming languages and data formats.
Language: Dart - Size: 60.5 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0
rth/vtext
Simple NLP in Rust with Python bindings
Language: Rust - Size: 273 KB - Last synced at: 2 days ago - Pushed at: over 2 years ago - Stars: 153 - Forks: 9
daac-tools/vaporetto
๐ฅ Vaporetto: Very accelerated pointwise prediction based tokenizer
Language: Rust - Size: 3.99 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 245 - Forks: 10
kantkrishan0206-crypto/AlignGPT
โThis project implements a mini LLM alignment pipeline using Reinforcement Learning from Human Feedback (RLHF). It includes training a reward model from human-annotated preference data, fine-tuning the language model via policy optimization, and performing ablation studies to evaluate robustness, fairness, and alignment trade-offs.โ
Language: Jupyter Notebook - Size: 5.77 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0
mahnoorsheikh16/NLP-Framework-for-Literature-Summarization-in-Law-and-Policy
Implementation of an interactive chatbot for summarizing legal and policy documents. Includes data preprocessing (cleaning, tokenization, chunking), extractive summarization baselines, and fine-tuned abstractive models (PEGASUS and LED). Integrates a retrieval layer for document relevance and uses ROUGE, BLEU, and cosine similarity for evaluation.
Language: Jupyter Notebook - Size: 20.5 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0
LingAdeu/language-tokenization-and-embedding
LLM splits up texts into tokens before converting them to vector embeddings. This repo explains different tokenization strategies prior to embedding conversion.
Size: 5.23 MB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0
alexandermorgan/BatchBPE Fork of karpathy/minbpe
Lightweight batched implementation of the Byte Pair Encoding (BPE) algorithm for LLM tokenization.
Language: Python - Size: 1.65 MB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 6 - Forks: 0
nuekkis/Turk-NLP
Tรผrkรงe iรงin kapsamlฤฑ aรงฤฑk kaynak NLP (Doฤal Dil ฤฐลleme) kรผtรผphanesi.
Language: Python - Size: 25.4 KB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 4 - Forks: 0
uw-swag/tokdrift
Repository for TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar.
Language: Python - Size: 11.5 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 4 - Forks: 0
Bratipah/sylvan-cap
SylvanCap is a revolutionary Web3 platform that bridges sustainable forestry and decentralized finance (DeFi) by tokenizing individual trees as a RWA
Language: TypeScript - Size: 2.32 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0
qlaxd/Large-Language-Diffusion-with-mAsking
Implementing Diffusion Models for Language Generation
Language: Python - Size: 429 MB - Last synced at: 14 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0
nlpcloud/nlpcloud-go
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...
Language: Go - Size: 111 KB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 10 - Forks: 2
jparkerweb/llm-distillery
๐ถ llm-distillery โข use LLMs to run map-reduce summarization tasks on large documents until a target token size is met.
Language: JavaScript - Size: 287 KB - Last synced at: 16 days ago - Pushed at: 17 days ago - Stars: 11 - Forks: 1
explosion/spacy-streamlit
๐ spaCy building blocks and visualizers for Streamlit apps
Language: Python - Size: 61.5 KB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 845 - Forks: 119
WorksApplications/sudachi.rs
Sudachi in Rust ๐ฆ and new generation of SudachiPy
Language: Rust - Size: 15.8 MB - Last synced at: 11 days ago - Pushed at: 4 months ago - Stars: 388 - Forks: 42
Mecanik/Tiny-BPE-Trainer
Lightweight, header-only Byte Pair Encoding (BPE) trainer in modern C++17. Produces HuggingFace-compatible vocabularies for transformers and integrates with Modern Text Tokenizer.
Language: C++ - Size: 33.2 KB - Last synced at: 10 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0
eliben/go-sentencepiece
Go implementation of the SentencePiece tokenizer
Language: Go - Size: 200 KB - Last synced at: 5 days ago - Pushed at: about 1 year ago - Stars: 37 - Forks: 8
KanishkNavale/Text-Mining-with-TF-IDF-and-Cosine-Similarity ๐ฆ
A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.
Language: Jupyter Notebook - Size: 7.34 MB - Last synced at: 7 days ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 1
daac-tools/vibrato
๐ค vibrato: Viterbi-based accelerated tokenizer
Language: Rust - Size: 1.09 MB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 377 - Forks: 17
bastienbot/nlp-js-tools-french
POS Tagger, lemmatizer and stemmer for french language in javascript
Language: JavaScript - Size: 1.04 MB - Last synced at: 20 days ago - Pushed at: about 8 years ago - Stars: 37 - Forks: 8
dracuxan/GoScout
GoScout: Fast, Efficient, Go-powered Search
Language: Go - Size: 13.7 KB - Last synced at: 10 days ago - Pushed at: 29 days ago - Stars: 6 - Forks: 0
dl-tokenf/contracts
On-chain RWA Tokenization Framework
Language: Solidity - Size: 1.05 MB - Last synced at: 20 days ago - Pushed at: 3 months ago - Stars: 59 - Forks: 16
VKCOM/YouTokenToMe ๐ฆ
Unsupervised text tokenizer focused on computational efficiency
Language: C++ - Size: 192 KB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 971 - Forks: 108
amr080/finance
Alex's Finance Repo
Size: 35.8 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 0
rishishanthan/lstm-sentiment-analysis
End-to-end sentiment analysis with a stacked LSTM in PyTorch โ custom tokenization, embeddings, padding, class imbalance handling, and thorough evaluation.
Language: Jupyter Notebook - Size: 7.47 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0
CMTA/CMTAT
Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.
Language: JavaScript - Size: 109 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 67 - Forks: 27
TI-Toolkit/tivars_lib_py
A Python library for interacting with TI-(e)z80 (82/83/84 series) calculator files
Language: Python - Size: 3.94 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 21 - Forks: 1
stefanwille/llm-tokens-playground
A demo that makes LLM tokenization more tangible.
Language: TypeScript - Size: 64.5 KB - Last synced at: 24 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0
KathyReid/token-wars-dataviz
A data visualisation in `matplotlib` of the number of parameters in major LLMs as well as the number of tokens of text they were trained on.
Language: Python - Size: 454 KB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0
av/klmbr
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
Language: TeX - Size: 2.24 MB - Last synced at: 16 days ago - Pushed at: about 1 year ago - Stars: 79 - Forks: 3
gaidardzhiev/shell
*nix command interpreter
Language: C - Size: 446 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0
vsce-toolroom/vscode-textmate-languageservice
Language APIs and support features from Textmate tokenization in Visual Studio Code.
Language: TypeScript - Size: 1.27 MB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 21 - Forks: 0
Jathurshan0330/TFM-Tokenizer
Official Code Repository of "Tokenizing Single-Channel EEG with Time-Frequency Motif Learning". arXiv: https://arxiv.org/abs/2502.16060
Language: Jupyter Notebook - Size: 22 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 2 - Forks: 0
delveopers/Shredword
Fast & efficient BPE tokenizer written in C & python for LLM tranining
Language: C++ - Size: 895 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0
verygoodsecurity/vgs-collect-android
VGS Collect Android SDK
Language: Kotlin - Size: 14.1 MB - Last synced at: about 17 hours ago - Pushed at: 5 days ago - Stars: 8 - Forks: 9
saulmoralespa/subscription-wompi-woo
Integraciรณn de suscripciones con Wompi para WooCommerce
Language: PHP - Size: 394 KB - Last synced at: 30 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0
Networks-Learning/token-pricing
Repository for the paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives", Arxiv 2025
Language: Jupyter Notebook - Size: 10.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0
sanderland/script_bpe
Code for the paper "BPE stays on SCRIPT"
Language: Jupyter Notebook - Size: 652 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 15 - Forks: 3
JamiiDao/Krill
Institution grade server for Solana onchain tokenization, attestation and stablecoins. Easy attestions, POAPs, tokenization, onchain memberships and Monetary Exchange from one server, all controlled by you
Language: Rust - Size: 458 KB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0
jshuadvd/LongRoPE
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
Language: Python - Size: 562 KB - Last synced at: 21 days ago - Pushed at: over 1 year ago - Stars: 150 - Forks: 14
mysto/python-fpe
FPE - Format Preserving Encryption with FF3 in Python
Language: Python - Size: 144 KB - Last synced at: 7 days ago - Pushed at: 5 months ago - Stars: 101 - Forks: 20
cosmaadrian/strawberry-problem
Official repository for "The Strawberry Problem ๐: Emergence of Character-level Understanding in Tokenized Language Models"
Language: Python - Size: 67.4 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0
basit-afridi62/nlp-nltk-python
This repository is a hands-on guide to Natural Language Processing (NLP) with Python using NLTK. It includes scripts, explanations, and outputs for tokenization, stopwords, stemming, lemmatization, corpora, WordNet, feature extraction, sentiment analysis, and text classification with machine learning.
Language: Python - Size: 1.57 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0
JanBremec/txc-compressor
TXC โ High-performance token-based text and log compressor with superior compression ratios and competitive speed.
Language: Python - Size: 463 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0
oshinrathor/ML-NLP-Projects
This repository contains a collection of Machine Learning and NLP projects, including sentiment analysis with NLTK, text preprocessing, and deep learning models. It covers techniques like tokenization, stopword removal, lemmatization, rule-based analysis, and transformer models like BERT for practical NLP applications.
Language: Jupyter Notebook - Size: 2.85 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0
AndyFerns/Automated-Reasoning-Project
A project aiming to implement Automated Reasoning in First Order Logic using NLP
Language: Python - Size: 122 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0
spindle-health/carduus
PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.
Language: Python - Size: 1.48 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 20 - Forks: 1
trag1c/crossandra-rs
(WIP) A straightforward tokenization library for seamless text processing.
Language: Rust - Size: 708 KB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 8 - Forks: 1
ImadSaddik/Train_Your_Language_Model_Course
Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.
Language: Jupyter Notebook - Size: 59.2 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 208 - Forks: 108
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Language: C++ - Size: 1.69 MB - Last synced at: 23 days ago - Pushed at: 7 months ago - Stars: 319 - Forks: 76
LeoMSgit/Personal-Lib---AI-ML-NLP-CV
Collection of Notes, Guides, and Examples for Artificial Intelligence, Machine Learning, Natural Language Processing and Computer Vision
Size: 186 KB - Last synced at: 25 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0