An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: tokenization

AndyFerns/Automated-Reasoning-Project

A project aiming to implement Automated Reasoning in First Order Logic using NLP

Language: Python - Size: 119 KB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 1 - Forks: 0

RAHEEM12344/content-recommendation-engine

A modern, responsive web application that delivers personalized content recommendations based on user preferences and behavior. This interactive recommendation system allows users to discover content tailored to their interests through category selection, tag filtering, and customizable content parameters.

Language: HTML - Size: 187 KB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 0 - Forks: 0

matiasrodlo/afiste

Blockchain based VC marketplace. Jump Chile semifinalist. (2019)

Language: PHP - Size: 470 MB - Last synced at: about 6 hours ago - Pushed at: about 7 hours ago - Stars: 1 - Forks: 0

sebastian2005-RP/GPU-Accelerated-Next-Word-Prediction-Using-LSTM-and-PyTorch

This repository implements a GPU-accelerated next-word prediction model using PyTorch and LSTM. It includes data preprocessing with NLTK, vocabulary creation, training on tokenized text, and generating text predictions, starting from a given input phrase.

Language: Jupyter Notebook - Size: 329 KB - Last synced at: about 7 hours ago - Pushed at: about 7 hours ago - Stars: 0 - Forks: 0

AgentOps-AI/tokencost

Easy token price estimates for 400+ LLMs. TokenOps.

Language: Python - Size: 1.89 MB - Last synced at: about 17 hours ago - Pushed at: about 18 hours ago - Stars: 1,718 - Forks: 82

mikiyadd/my-c-array

Dynamic array implementation in C with a modular, folder-based structure.

Language: C - Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Basis-Theory/developers.basistheory.com

Basis Theory Developer Documentation

Language: JavaScript - Size: 24.9 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 6 - Forks: 4

chuckyLeeVIII/Bitcoin-BhE-NaS Fork of bitcoin/bips

Bitcoin Improvement Proposals

Language: Wikitext - Size: 15.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 1

shaheennabi/Natural-Language-Processing-Practices-and-Mini-Projects

πŸŽ‡ NLP Experiments πŸŽ† A hands-on collection of NLP experiments πŸ’¬, featuring models like RNN, LSTM, and Attention Mechanism. πŸš€ Explore applications like text classification, sentiment analysis, and language generation 🌍. Continuously updated with new algorithms and research implementations! πŸ”₯

Language: Jupyter Notebook - Size: 24.4 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

12345far/metrics-calculation-precision-recall

Laboratory 7 - Retrieval Information

Size: 1.95 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

IthavinduU/jwt-auth-service

JWT authentication microservice built with Ruby and Sinatra.

Language: Ruby - Size: 3.91 KB - Last synced at: 2 days ago - Pushed at: 29 days ago - Stars: 1 - Forks: 0

CompLin/nheengatu

Tools and resources for the computational processing of Nheengatu (Modern Tupi)

Language: Python - Size: 36.5 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 8 - Forks: 4

3Dpass/3DP

The Implementation of The Ledger of Things Node. Layer 1 decentralized blockchain platform for the tokenization of objects. Proof of Scan protocol. Useful smart-contracts and dApps.

Language: Rust - Size: 54.4 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 25 - Forks: 19

AndresEspin1993/b2t-tokenizer

B2T - Tokenizer for the AI Systems.

Language: PowerShell - Size: 240 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

JuGecko/Tokenization-Visualizer

A web application illustrating tokenization methods when selecting certain LLMs.

Language: C# - Size: 7.78 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

securitybunker/databunker

Secure Vault for Customer PII/PHI/PCI/KYC Records

Language: Go - Size: 11.1 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,305 - Forks: 83

gnatykdm/b2t-tokenizer

B2T Tokenizer β€” Brain-Inspired Multimodal Data Processor

Language: PowerShell - Size: 240 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

XDuch/aztec-network

A step by step guide on How to Install Aztec Network Sequencer on Testnet

Size: 16.6 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 1

chuckyLeeVIII/ai-hedge-fund Fork of virattt/ai-hedge-fund

An AI Hedge managed by knox wallet

Language: Python - Size: 1.59 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

CMTA/CMTAT

Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.

Language: JavaScript - Size: 63.7 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 57 - Forks: 25

ImadSaddik/Train_Your_Language_Model_Course

Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.

Language: Jupyter Notebook - Size: 58.9 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 117 - Forks: 72

av/klmbr

klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs

Language: TeX - Size: 2.24 MB - Last synced at: 4 days ago - Pushed at: 9 months ago - Stars: 76 - Forks: 3

CLewisMessina/wolfstitch

Turn books into clean, fine-tuning-ready datasets (TXT/CSV). EPUB, PDF, and token-aware. Local, GUI-based, no cloud required.

Language: Python - Size: 309 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

anjalirj27/Llama4

Llama4 – Code from Scratch This project is inspired by [vukrosic’s courses repository](https://github.com/vukrosic/courses). Here, I’ve implemented the tokenizer logic from scratch using Python and Google Colab to better understand how LLMs handle text at the token level.

Language: Jupyter Notebook - Size: 11.7 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

mohansree14/Token-Classification

A Streamlit app for biomedical named entity recognition (NER) using BioBERT. Enter biomedical text and get instant, colorful token-level predictions for labels `O`, `B-AC`, `B-LF`, and `I-LF`. Includes graphical visualization and an interaction log.

Language: Jupyter Notebook - Size: 731 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

FerdiKurt/carbon-credits

These smart contracts provide a system for carbon credit tokenization, issuance, trading, and retirement.

Language: Solidity - Size: 111 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 1

delBull/saaspandoras Fork of nextify-limited/saasfly

Acquire your right to participate in exclusive projects with Pandoras.

Language: TypeScript - Size: 22.5 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

NVIDIA/Cosmos-Tokenizer πŸ“¦

A suite of image and video neural tokenizers

Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 1,637 - Forks: 78

verygoodsecurity/vgs-collect-ios

VGS Collect iOS SDK

Language: Swift - Size: 63.5 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 23 - Forks: 15

bermudaphp/tokenizer

PHP tokenizer for finding class, interface, trait, and enum declarations.

Language: PHP - Size: 68.4 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

shaitarAn/subword-evenness-crosslingual-transfer

Analysis of subword evenness as a predictor of cross-lingual transfer success in multilingual language models (mBERT, XLM-R, mT5)

Size: 9.77 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

DanTheAI/LLM-Middleware-Pipeline

A modular, configurable LLM middleware pipeline that transforms raw prompts into enterprise-ready microservices.

Language: Python - Size: 32.2 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

ITSLab-UAegean/ais-manipulation

This is a repo related to the vessel AIS data, including filtering tokenization and trip extraction.

Language: Python - Size: 8.45 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

possible-worlds-research/wikinlp

A package to download and preprocess a Wikipedia dump, in any language.

Language: Python - Size: 118 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 9 - Forks: 1

ChaitanyaK77/Building-a-Small-Language-Model-SLM-

This Repository provides a Jupyter Notebook for building a small language model from scratch using 'TinyStories' dataset. Covers data preprocessing, BPE tokenization, binary storage, GPU memory management, and training a Transformer in PyTorch. Generate sample stories to test your model. Ideal for learning NLP and PyTorch.

Size: 0 Bytes - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

jshuadvd/LongRoPE

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

Language: Python - Size: 562 KB - Last synced at: 1 day ago - Pushed at: 11 months ago - Stars: 137 - Forks: 14

johannschopplich/tokenx

πŸ“ Fast and lightweight token estimation for any LLM without requiring a full tokenizer

Language: TypeScript - Size: 390 KB - Last synced at: about 3 hours ago - Pushed at: 19 days ago - Stars: 27 - Forks: 1

h3ro-dev/Royal-RWA-Website

Royal RWA - Revolutionary DeFi platform bridging traditional assets with blockchain through a three-token ecosystem

Language: TypeScript - Size: 354 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

mysto/python-fpe

FPE - Format Preserving Encryption with FF3 in Python

Language: Python - Size: 144 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 102 - Forks: 18

jwalsh/boston-python-llm-tokenizer

Learn tokenization in the context of Large Language Models (LLMs)

Language: Jupyter Notebook - Size: 13.8 MB - Last synced at: 2 days ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Node0/llm-tools

My expanding collection of scripts and tools designed to aid in working with large language models, understanding their performance characteristics and context limitations.

Language: HTML - Size: 1.07 MB - Last synced at: 5 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

braingpt-lovelab/backwards

Source code for <Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies>

Language: Jupyter Notebook - Size: 51.1 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 3 - Forks: 0

Mattbusel/tokenviz

TokenViz β€” A CLI tool to visualize token usage in OpenAI prompts, helping developers optimize and understand prompt structure for better model performance.

Language: Python - Size: 0 Bytes - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 1 - Forks: 0

Networks-Learning/token-pricing

Repository for the paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives", Arxiv 2025

Language: Jupyter Notebook - Size: 10.1 MB - Last synced at: 2 days ago - Pushed at: 23 days ago - Stars: 4 - Forks: 0

ppomes/TokenShield

PCI Compliance Gateway POC

Language: Go - Size: 48.8 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

zjukg/MyGO

[Paper][AAAI 2025] (MyGO)Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation

Language: Python - Size: 91 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 249 - Forks: 7

cbaziotis/ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Language: Python - Size: 778 KB - Last synced at: 6 days ago - Pushed at: 19 days ago - Stars: 670 - Forks: 93

VKCOM/YouTokenToMe πŸ“¦

Unsupervised text tokenizer focused on computational efficiency

Language: C++ - Size: 192 KB - Last synced at: 17 days ago - Pushed at: about 1 year ago - Stars: 968 - Forks: 105

ThalesGroup/CipherTrust_Application_Protection

Public code samples and resources for the Thales CipherTrust Application Protection products of the CipherTrust Data Security Platform

Language: Java - Size: 37.4 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 27 - Forks: 17

thjbdvlt/solipCysme

spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.

Language: Python - Size: 974 KB - Last synced at: 16 days ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 1

explosion/spaCy

πŸ’« Industrial-strength Natural Language Processing (NLP) in Python

Language: Python - Size: 194 MB - Last synced at: 18 days ago - Pushed at: 24 days ago - Stars: 31,699 - Forks: 4,508

WorksApplications/sudachi.rs

Sudachi in Rust πŸ¦€ and new generation of SudachiPy

Language: Rust - Size: 15 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 359 - Forks: 39

eliben/go-sentencepiece

Go implementation of the SentencePiece tokenizer

Language: Go - Size: 200 KB - Last synced at: 6 days ago - Pushed at: 10 months ago - Stars: 30 - Forks: 4

NueLanguage/nue

The Nue Programming Language

Language: C - Size: 134 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 6 - Forks: 0

daac-tools/vaporetto

πŸ›₯ Vaporetto: Very accelerated pointwise prediction based tokenizer

Language: Rust - Size: 3.99 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 238 - Forks: 10

CO2NEX/co2nex-architecture

Mermaid.js system diagram for the CO2NEX carbon credit climate platform infrastructure

Language: HTML - Size: 11.7 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

Jyonn/UnifiedTokenizer

A machine learning toolkit for tokenization and indexing

Language: Python - Size: 521 KB - Last synced at: 8 days ago - Pushed at: 20 days ago - Stars: 4 - Forks: 1

icelaterdc/Turk-NLP

Türkçe için kapsamlı açık kaynak NLP (Doğal Dil İşleme) kütüphanesi.

Language: Python - Size: 20.5 KB - Last synced at: 12 days ago - Pushed at: 21 days ago - Stars: 2 - Forks: 0

GhostFireDigital/TokenUp.ai

TokenUp.ai is a modular AI-native token infrastructure protocol designed for next-gen Web3 builders. Includes minting, tokenomics, governance, and analytics modules. Built by GhostFire Digital.

Language: HTML - Size: 1.51 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

fahadabid545/POS-Tagging

Performed Part-of-Speech (POS) tagging using NLTK to label words with their grammatical roles in text data. Useful for NLP preprocessing and syntactic analysis.

Language: Jupyter Notebook - Size: 5.86 KB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 0 - Forks: 0

alasdairforsythe/tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

Language: Go - Size: 734 KB - Last synced at: 22 days ago - Pushed at: 12 months ago - Stars: 581 - Forks: 21

kensho-technologies/pathpiece

PathPiece tokenizer

Language: Rust - Size: 6.39 MB - Last synced at: 11 days ago - Pushed at: 7 months ago - Stars: 12 - Forks: 1

fattmerchantorg/Fattmerchant-iOS-SDK

Fattmerchant iOS SDK

Language: Swift - Size: 155 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 3 - Forks: 2

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Language: Python - Size: 729 MB - Last synced at: 17 days ago - Pushed at: about 1 month ago - Stars: 160 - Forks: 14

RavenProject/Ravencoin

Ravencoin Core integration/staging tree

Language: C - Size: 461 MB - Last synced at: 23 days ago - Pushed at: about 1 year ago - Stars: 1,096 - Forks: 697

shivendrra/shredword

Fast & efficient BPE tokenizer written in C & python for LLM tranining

Language: C++ - Size: 18.2 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

ayushedith/ethermint

Minimal ERC-20 token built with Solidity & Hardhat

Language: JavaScript - Size: 84 KB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

daac-tools/vibrato

🎀 vibrato: Viterbi-based accelerated tokenizer

Language: Rust - Size: 1.08 MB - Last synced at: 24 days ago - Pushed at: about 1 month ago - Stars: 360 - Forks: 15

mridulsaklani/My_Tokenizer

It is a small model of tokenizer also used by every AI GPT's model to perform tha task how to convert alphabets into a specific assigned token and encoding or decoding

Language: Python - Size: 3.99 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

FoundationVision/OmniTokenizer

[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.

Language: Python - Size: 68.9 MB - Last synced at: 26 days ago - Pushed at: 12 months ago - Stars: 295 - Forks: 6

larsulbricht/awesome-digital-assets

Collection of high-quality resources on blockchain, tokenization, and DLT-based capital markets (EU-Focus)

Size: 307 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 6 - Forks: 0

PolyCash/polycash

The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.

Language: PHP - Size: 32.3 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 47 - Forks: 38

vipbondre/tokenization_xrpl

The Get Your Pass website showcases a secure and private ticketing system powered by the XRP Ledger (XRPL). This system ensures privacy, security, and efficiency throughout the ticket purchase process, abstracting complexities and protecting sensitive user information

Language: JavaScript - Size: 1.26 MB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 1 - Forks: 0

KanishkNavale/Text-Mining-with-TF-IDF-and-Cosine-Similarity

A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.

Language: Jupyter Notebook - Size: 7.34 MB - Last synced at: 9 days ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 1

bminixhofer/zett

Code for Zero-Shot Tokenizer Transfer

Language: Python - Size: 1.04 MB - Last synced at: 20 days ago - Pushed at: 5 months ago - Stars: 128 - Forks: 11

Devansh-Seth-DEV/LexiC

LexiC is a simple and modular C project that converts source code into a stream of tokens. It handles token counting, segmentation, and full tokenization, forming the first stage of a compiler or interpreter pipeline.

Language: C - Size: 713 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

LoopscaleLabs/rwa-token

The RWA Token Program is a wrapper and extension program for Solana Token Extensions that creates a uniform approach to permissions tokens on SVM blockchains.

Language: TypeScript - Size: 51.2 MB - Last synced at: 13 days ago - Pushed at: 9 months ago - Stars: 19 - Forks: 9

spindle-health/carduus

PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.

Language: Python - Size: 1.32 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 15 - Forks: 1

gautierdag/bpeasy

Fast bare-bones BPE for modern tokenizer training

Language: Python - Size: 1.41 MB - Last synced at: 29 days ago - Pushed at: 3 months ago - Stars: 156 - Forks: 5

Basis-Theory/terraform-provider-basistheory

Terraform provider for Basis-Theory

Language: Go - Size: 148 KB - Last synced at: 21 days ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 0

CO2NEX/carbon-tokens

This repository contains the technical specification, tokenomics, and smart contract blueprints for C2NX tokens β€” the native digital asset of the CO2NEX platform used for governance, verification bounties, and transaction fees in the carbon offset market.

Size: 4.88 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

lunasec-io/lunasec

LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/

Language: TypeScript - Size: 293 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 1,448 - Forks: 169

AmoDinho/datacamp-python-data-science-track

All the slides, accompanying code and exercises all stored in this repo. 🎈

Language: Python - Size: 74.1 MB - Last synced at: 29 days ago - Pushed at: almost 2 years ago - Stars: 837 - Forks: 522

Deed3Labs/Protocol-Contracts

The Deed Protocol Smart Contracts πŸ“‘

Language: Solidity - Size: 3.37 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 1 - Forks: 0

ijazul-haq/nlpashto

Pashto Natural Language Processing Toolkit

Size: 62.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 12 - Forks: 0

izikeros/count_tokens

Count tokens in a text file.

Language: Python - Size: 137 KB - Last synced at: 5 days ago - Pushed at: about 1 month ago - Stars: 7 - Forks: 0

natasha/razdel

Rule-based token, sentence segmentation for Russian language

Language: Python - Size: 37.2 MB - Last synced at: 28 days ago - Pushed at: almost 2 years ago - Stars: 267 - Forks: 32

Worklytics/psoxy

serverless ☁️ πŸš€ , pseudonymizing proxy between Worklytics and your workplace πŸ’Ό SaaS data sources' APIs. Data Loss Prevention (DLP) πŸ›‘οΈπŸ”’ and compliance layer deployable to AWS Lambda or GCP Cloud Functions.

Language: Java - Size: 34.5 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 13 - Forks: 6

thearhamsharif/BSCS-UBIT-2k21

Includes coursework and lab materials for students enrolled in the Bachelor of Science in Computer Science degree at UBIT.

Language: C++ - Size: 13.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

cosmaadrian/strawberry-problem

Official repository for "The Strawberry Problem πŸ“: Emergence of Character-level Understanding in Tokenized Language Models"

Language: Python - Size: 56.6 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

ixopay/tokenex-ios

TokenEx iOS SDK

Language: Swift - Size: 74.2 KB - Last synced at: 28 days ago - Pushed at: 12 months ago - Stars: 1 - Forks: 1

zouharvi/tokenization-scorer

Simple-to-use scoring function for arbitrarily tokenized texts.

Language: Python - Size: 42 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 40 - Forks: 5

sourav200199/Whats-Insight

Get chat insights - all in one go!

Language: Python - Size: 2.95 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

abu14/nlp-assignment-abenezer_tesfaye

Formal submission for the NLP assignment

Size: 12.7 KB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

macmade/ClangKit

ClangKit provides an Objective-C frontend to LibClang. Source tokenization, diagnostics and fix-its are actually implemented.

Language: C - Size: 15.2 MB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 365 - Forks: 45

jkrukowski/swift-sentencepiece

Use SentencePiece in Swift for tokenization and detokenization.

Language: Swift - Size: 2.43 MB - Last synced at: 28 days ago - Pushed at: 4 months ago - Stars: 9 - Forks: 2

TI-Toolkit/tivars_lib_py

A Python library for interacting with TI-(e)z80 (82/83/84 series) calculator files

Language: Python - Size: 3.79 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 19 - Forks: 1

prabhashj07/nepalikit

NepaliKit is a Python library for natural language processing (NLP) tasks in Nepali. It features tokenization (rule-based and SentencePiece), text preprocessing, stopword management, and sentence segmentation. Ideal for developers and researchers working with Nepali text data.

Language: Python - Size: 364 KB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 6 - Forks: 0

winkjs/wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

Language: JavaScript - Size: 2.05 MB - Last synced at: 12 days ago - Pushed at: over 2 years ago - Stars: 62 - Forks: 12

shivendrra/biosaic

Tokenizer for encoding/decoding dna sequences

Language: Python - Size: 71.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0