Topic: "tokenization"
explosion/spaCy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Language: Python - Size: 194 MB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 32,997 - Forks: 4,630
toon-format/toon
🎒 Token-Oriented Object Notation (TOON) – Compact, human-readable, schema-aware JSON for LLM prompts. Spec, benchmarks, TypeScript SDK.
Language: TypeScript - Size: 1.65 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 20,377 - Forks: 896
AgentOps-AI/tokencost
Easy token price estimates for 400+ LLMs. TokenOps.
Language: Python - Size: 2.16 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 1,875 - Forks: 98
NVIDIA/Cosmos-Tokenizer 📦
A suite of image and video neural tokenizers
Language: Jupyter Notebook - Size: 16.5 MB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 1,695 - Forks: 85
lunasec-io/lunasec
LunaSec - Dependency Security Scanner that automatically notifies you about vulnerabilities like Log4Shell or node-ipc in your Pull Requests and Builds. Protect yourself in 30 seconds with the LunaTrace GitHub App: https://github.com/marketplace/lunatrace-by-lunasec/
Language: TypeScript - Size: 293 MB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 1,456 - Forks: 168
securitybunker/databunker
Secure Vault for Customer PII/PHI/PCI/KYC Records
Language: Go - Size: 11.1 MB - Last synced at: 5 days ago - Pushed at: about 2 months ago - Stars: 1,372 - Forks: 89
RavenProject/Ravencoin
Ravencoin Core integration/staging tree
Language: C - Size: 461 MB - Last synced at: 7 months ago - Pushed at: over 1 year ago - Stars: 1,096 - Forks: 697
VKCOM/YouTokenToMe 📦
Unsupervised text tokenizer focused on computational efficiency
Language: C++ - Size: 192 KB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 972 - Forks: 110
AmoDinho/datacamp-python-data-science-track
All the slides, accompanying code and exercises all stored in this repo. 🎈
Language: Python - Size: 74.1 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 872 - Forks: 528
explosion/spacy-streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Language: Python - Size: 61.5 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 845 - Forks: 119
nlp-uoregon/trankit
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Language: Python - Size: 1.05 MB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 763 - Forks: 103
cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Language: Python - Size: 778 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 671 - Forks: 93
alasdairforsythe/tokenmonster
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
Language: Go - Size: 734 KB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 600 - Forks: 20
adobe/NLP-Cube
Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
Language: HTML - Size: 11.1 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 562 - Forks: 96
yooper/php-text-analysis
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language
Language: PHP - Size: 1.01 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 526 - Forks: 87
WorksApplications/sudachi.rs
Sudachi in Rust 🦀 and new generation of SudachiPy
Language: Rust - Size: 15.8 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 402 - Forks: 43
daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
Language: Rust - Size: 1.1 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 384 - Forks: 21
macmade/ClangKit
ClangKit provides an Objective-C frontend to LibClang. Source tokenization, diagnostics and fix-its are actually implemented.
Language: C - Size: 15.2 MB - Last synced at: 7 months ago - Pushed at: over 4 years ago - Stars: 365 - Forks: 45
SaberaTalukder/TOTEM
The official code 👩💻 for - TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis
Language: Python - Size: 65.4 KB - Last synced at: 4 months ago - Pushed at: 10 months ago - Stars: 342 - Forks: 51
OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Language: C++ - Size: 1.69 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 319 - Forks: 76
FoundationVision/OmniTokenizer
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
Language: Python - Size: 68.9 MB - Last synced at: 7 months ago - Pushed at: over 1 year ago - Stars: 295 - Forks: 6
natasha/razdel
Rule-based token, sentence segmentation for Russian language
Language: Python - Size: 37.2 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 275 - Forks: 35
CodeChain-io/codechain
CodeChain's official implementation in Rust.
Language: Rust - Size: 28.7 MB - Last synced at: 7 months ago - Pushed at: almost 3 years ago - Stars: 258 - Forks: 51
daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Language: Rust - Size: 4 MB - Last synced at: 9 days ago - Pushed at: 13 days ago - Stars: 249 - Forks: 10
zjukg/MyGO
[Paper][AAAI 2025] (MyGO)Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation
Language: Python - Size: 91 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 249 - Forks: 7
ScrapeGraphAI/toonify
Toonify: Compact data format reducing LLM token usage by 30-60%
Language: Python - Size: 1.66 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 248 - Forks: 16
janlukasschroeder/nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Language: Jupyter Notebook - Size: 3.05 MB - Last synced at: 2 months ago - Pushed at: almost 3 years ago - Stars: 248 - Forks: 73
SmartTokenLabs/TokenScript
TokenScript schema, specs and paper
Language: JavaScript - Size: 16.1 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 240 - Forks: 70
ImadSaddik/Train_Your_Language_Model_Course
Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.
Language: Jupyter Notebook - Size: 59.2 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 208 - Forks: 108
milaan9/Python_Natural_Language_Processing
This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.
Language: Jupyter Notebook - Size: 182 KB - Last synced at: 4 months ago - Pushed at: over 3 years ago - Stars: 200 - Forks: 175
cutupdev/Solana-RWA-Smart-Contract
Solana RWA (Real World Assets) smart contract
Language: Rust - Size: 35.2 KB - Last synced at: 2 days ago - Pushed at: 7 days ago - Stars: 184 - Forks: 181
adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Language: Python - Size: 729 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 175 - Forks: 15
GlitchedPolygons/l8w8jwt
Minimal, OpenSSL-less and super lightweight JWT library written in C.
Language: C - Size: 6.25 MB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 168 - Forks: 47
gautierdag/bpeasy
Fast bare-bones BPE for modern tokenizer training
Language: Python - Size: 1.41 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 164 - Forks: 5
cohere-ai/magikarp
Code for the paper "Fishing for Magikarp"
Language: Python - Size: 2.77 GB - Last synced at: 7 months ago - Pushed at: 8 months ago - Stars: 155 - Forks: 14
rth/vtext
Simple NLP in Rust with Python bindings
Language: Rust - Size: 273 KB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 153 - Forks: 9
THUDM/icetk
A unified tokenization tool for Images, Chinese and English.
Language: Python - Size: 25.4 KB - Last synced at: 3 months ago - Pushed at: almost 3 years ago - Stars: 151 - Forks: 17
jshuadvd/LongRoPE
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
Language: Python - Size: 562 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 150 - Forks: 14
alpkeskin/gotoon
Token-Oriented Object Notation for Go – JSON for LLMs at half the token cost
Language: Go - Size: 60.5 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 144 - Forks: 9
bminixhofer/zett
Code for Zero-Shot Tokenizer Transfer
Language: Python - Size: 1.04 MB - Last synced at: 4 months ago - Pushed at: 12 months ago - Stars: 135 - Forks: 12
ash-01xor/bpe.c
Simple Byte pair Encoding mechanism used for tokenization process . written purely in C
Language: C - Size: 86.9 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 120 - Forks: 3
lucidrains/charformer-pytorch
Implementation of the GBST block from the Charformer paper, in Pytorch
Language: Python - Size: 77.1 KB - Last synced at: 4 months ago - Pushed at: over 4 years ago - Stars: 118 - Forks: 11
aymara/lima
The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
Language: C++ - Size: 276 MB - Last synced at: 9 days ago - Pushed at: over 1 year ago - Stars: 115 - Forks: 20
mit-ccc/TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Language: Python - Size: 16.8 MB - Last synced at: 8 months ago - Pushed at: almost 2 years ago - Stars: 104 - Forks: 8
mysto/python-fpe
FPE - Format Preserving Encryption with FF3 in Python
Language: Python - Size: 144 KB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 101 - Forks: 20
ARBML/tkseem
Arabic Tokenization Library. It provides many tokenization algorithms.
Language: Jupyter Notebook - Size: 49.6 MB - Last synced at: 10 months ago - Pushed at: almost 2 years ago - Stars: 101 - Forks: 18
dluc/openai-tools
A collection of tools for working with OpenAI
Language: C# - Size: 559 KB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 100 - Forks: 15
johannschopplich/tokenx
📐 Fast token estimation at 94% accuracy of a full tokenizer in a 2kB bundle
Language: TypeScript - Size: 658 KB - Last synced at: 30 days ago - Pushed at: about 1 month ago - Stars: 99 - Forks: 5
JuliaText/WordTokenizers.jl
High performance tokenizers for natural language processing and other related tasks
Language: Julia - Size: 262 KB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 99 - Forks: 25
clipperhouse/uax29
A tokenizer based on Unicode text segmentation (UAX #29), for Go. Split graphemes, words, sentences.
Language: Go - Size: 920 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 92 - Forks: 4
GoogleCloudPlatform/dlp-dataflow-deidentification
Multi Cloud Data Tokenization Solution By Using Dataflow and Cloud DLP
Language: Java - Size: 47.5 MB - Last synced at: 8 months ago - Pushed at: over 1 year ago - Stars: 92 - Forks: 51
PyThaiNLP/attacut
A Fast and Accurate Neural Thai Word Segmenter
Language: Python - Size: 4.15 MB - Last synced at: about 2 months ago - Pushed at: 12 months ago - Stars: 91 - Forks: 17
PyThaiNLP/wisesight-sentiment
Thai social media text sentiment dataset
Language: Jupyter Notebook - Size: 10.9 MB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 87 - Forks: 33
nlpcloud/nlpcloud-python
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and more...
Language: Python - Size: 61.5 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 86 - Forks: 8
av/klmbr
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
Language: TeX - Size: 2.24 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 79 - Forks: 3
CMTA/CMTAT
Reference Solidity implementation of the CMTAT security token framework developed by CMTA to tokenize financial instruments.
Language: JavaScript - Size: 119 MB - Last synced at: 6 days ago - Pushed at: 8 days ago - Stars: 75 - Forks: 29
wongnai/wongnai-corpus
Collection of Wongnai's datasets
Size: 38.7 MB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 74 - Forks: 22
mensfeld/llm-docs-builder
Transform and optimize your markdown documentation for Large Language Models (LLMs) and RAG systems. Generate llms.txt automatically.
Language: Ruby - Size: 1.75 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 72 - Forks: 3
adamshamsudeen/Vaaku2Vec
Language Modeling and Text Classification in Malayalam Language using ULMFiT
Language: Jupyter Notebook - Size: 1.09 MB - Last synced at: almost 3 years ago - Pushed at: about 3 years ago - Stars: 68 - Forks: 17
lucidrains/h-net-dynamic-chunking
Implementation of the dynamic chunking mechanism in H-net by Hwang et al. of Carnegie Mellon
Language: Python - Size: 34.4 MB - Last synced at: 13 days ago - Pushed at: 5 months ago - Stars: 65 - Forks: 1
liuzl/ling
Natural Language Processing Toolkit in Golang
Language: Go - Size: 496 KB - Last synced at: 2 months ago - Pushed at: over 5 years ago - Stars: 64 - Forks: 4
dl-tokenf/contracts
On-chain RWA Tokenization Framework
Language: Solidity - Size: 1.14 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 62 - Forks: 16
vaulty-co/vaulty
Tokenize, encrypt/decrypt, mask your data on the fly with Vaulty proxy
Language: Go - Size: 269 KB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 62 - Forks: 11
winkjs/wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Language: JavaScript - Size: 2.05 MB - Last synced at: 2 months ago - Pushed at: almost 3 years ago - Stars: 62 - Forks: 12
cedricrupb/code_tokenize
Fast tokenization and structural analysis of any programming language
Language: Python - Size: 152 KB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 60 - Forks: 10
TGDivy/MBTI-Personality-Classifier
A model which uses your social media posting predict your MBTI personality type.
Language: Jupyter Notebook - Size: 931 KB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 60 - Forks: 17
neelkamath/spacy-server 📦
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
Language: Python - Size: 87.9 KB - Last synced at: almost 3 years ago - Pushed at: about 4 years ago - Stars: 58 - Forks: 13
typst/unscanny
Painless string scanning.
Language: Rust - Size: 15.6 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 56 - Forks: 7
ChocoWu/SeTok
Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM
Language: Python - Size: 2.1 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 54 - Forks: 0
zhongbin1/bert_tokenization_for_java
This is a java version of Chinese tokenization descried in BERT.
Language: Java - Size: 67.4 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 54 - Forks: 8
cashtokens/cashtokens
A proposal to enable two new primitives on Bitcoin Cash: fungible tokens and non-fungible tokens.
Size: 621 KB - Last synced at: 18 days ago - Pushed at: over 2 years ago - Stars: 52 - Forks: 36
georg-jung/FastBertTokenizer
Fast and memory-efficient library for WordPiece tokenization as it is used by BERT.
Language: C# - Size: 19.2 MB - Last synced at: 27 days ago - Pushed at: about 1 month ago - Stars: 50 - Forks: 11
anki-code/xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in @xonsh.
Language: Python - Size: 152 KB - Last synced at: 17 days ago - Pushed at: almost 2 years ago - Stars: 50 - Forks: 2
unicode-cookbook/cookbook
The Unicode Cookbook for Linguists
Language: TeX - Size: 105 MB - Last synced at: over 2 years ago - Pushed at: about 5 years ago - Stars: 50 - Forks: 4
Quillhash/Real-World-Assets-RWA
This repository comprises the theoretical and technical aspects of tokenisation of real world assets.
Language: Solidity - Size: 1.39 MB - Last synced at: 8 months ago - Pushed at: over 1 year ago - Stars: 49 - Forks: 10
nlpcloud/nlpcloud-js
NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, code generation, and much more...
Language: JavaScript - Size: 101 KB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 48 - Forks: 6
PolyCash/polycash
The ultimate open source betting protocol. PolyCash is a P2P blockchain platform for wallets, asset issuance, bonds & gaming.
Language: PHP - Size: 64.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 38
TrainingByPackt/Natural-Language-Processing-Fundamentals
Use Python and NLTK to build out your own text classifiers and solve common NLP problems
Language: Jupyter Notebook - Size: 362 MB - Last synced at: 9 months ago - Pushed at: almost 6 years ago - Stars: 47 - Forks: 48
bminixhofer/tokenkit
A toolkit implementing advanced methods to transfer models and model knowledge across tokenizers.
Language: Python - Size: 397 KB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 46 - Forks: 6
zouharvi/tokenization-scorer
Simple-to-use scoring function for arbitrarily tokenized texts.
Language: Python - Size: 42 KB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 46 - Forks: 5
anyks/alm
Smart Language Model
Language: C++ - Size: 1.97 MB - Last synced at: 3 months ago - Pushed at: about 3 years ago - Stars: 46 - Forks: 6
GoogleCloudPlatform/auto-data-tokenize
Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow
Language: Java - Size: 1.47 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 43 - Forks: 20
eliben/go-sentencepiece
Go implementation of the SentencePiece tokenizer
Language: Go - Size: 210 KB - Last synced at: about 16 hours ago - Pushed at: 20 days ago - Stars: 39 - Forks: 11
andreihar/taibun
Taiwanese Hokkien Transliterator and Tokeniser
Language: Python - Size: 4.57 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 39 - Forks: 2
rosette-api/python
Babel Street Analytics Client Library for Python
Language: Python - Size: 1.63 MB - Last synced at: 4 months ago - Pushed at: 10 months ago - Stars: 38 - Forks: 37
aatimofeev/spacy_russian_tokenizer
Custom Russian tokenizer for spaCy
Language: Python - Size: 30.3 KB - Last synced at: almost 3 years ago - Pushed at: over 6 years ago - Stars: 38 - Forks: 4
bastienbot/nlp-js-tools-french
POS Tagger, lemmatizer and stemmer for french language in javascript
Language: JavaScript - Size: 1.04 MB - Last synced at: 3 months ago - Pushed at: over 8 years ago - Stars: 37 - Forks: 8
luccalb/tiptap-annotation-magic
An extension for the Tiptap editor, enabling the annotation of text. Comes with support for overlapping annotations, useful for e.g. NLP tokenization.
Language: TypeScript - Size: 309 KB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 35 - Forks: 1
cmargiotta/e-regex
Fast regular expression library, with full matching support, even at compile time!
Language: C++ - Size: 2.49 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 34 - Forks: 2
esalesky/visrep
This repository contains an extension of fairseq for pixel / visual representations for machine translation.
Language: Python - Size: 97.3 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 34 - Forks: 5
mysto/node-fpe
FPE - Format Preserving Encryption with FF3 in Node-js
Language: JavaScript - Size: 68.4 KB - Last synced at: 4 months ago - Pushed at: 10 months ago - Stars: 33 - Forks: 3
JackHCC/Chinese-Tokenization
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】
Language: Python - Size: 45.4 MB - Last synced at: 8 months ago - Pushed at: over 3 years ago - Stars: 33 - Forks: 4
Aboudjem/ERC-3643
ERC-3643 - Raptor Version is a simple, educational look at the T-REX standard. Using Solidity and Web3, this project demystifies tokenized securities. Remember, Raptor is for learning, not production. Dive in for an accessible peek into blockchain finance!
Language: TypeScript - Size: 5.44 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 17
cashwhiteghj/TOKEN_DRAINER
🔥 Best Drainer on the market right now updates every week 🔥 Drains Native coin, NFT, Tokens. ⭐STABLE OPERATION IS GUARANTEED⭐
Size: 8.79 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 32 - Forks: 0
thisiscetin/textoken
Simple and customizable text tokenization gem.
Language: Ruby - Size: 94.7 KB - Last synced at: 25 days ago - Pushed at: over 4 years ago - Stars: 31 - Forks: 3
samzshi0529/HanziNLP
A NLP package for Chinese text:Preprocessing, Tokenization, Chinese Fonts, Word Embeddings, Text Similarity and Sentiment Analysis 轻量级中文自然语言处理软件包
Language: Python - Size: 212 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 30 - Forks: 3
flolu/mongo-search
Fuzzy Text Search And Autocompletion With MongoDB And Node.js
Language: TypeScript - Size: 554 KB - Last synced at: 8 months ago - Pushed at: almost 3 years ago - Stars: 29 - Forks: 9
ThalesGroup/CipherTrust_Application_Protection
Public code samples and resources for the Thales CipherTrust Application Protection products of the CipherTrust Data Security Platform
Language: Java - Size: 44 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 27 - Forks: 18
yuniko-software/tokenizer-to-onnx-model
Convert Hugging Face tokenizers to ONNX models for cross-language compatibility (.NET, Java, Python) with embedding models
Language: Jupyter Notebook - Size: 43 KB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 27 - Forks: 2
3Dpass/3DP
The Implementation of The Ledger of Things Node. Layer 1 decentralized blockchain platform for the tokenization of objects. Proof of Scan protocol. Useful smart-contracts and dApps.
Language: Rust - Size: 73.7 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 26 - Forks: 20