Topic: "tokenizer"
theseer/tokenizer
A small library for converting tokenized PHP source code into XML (and potentially other formats)
Language: PHP - Size: 83 KB - Last synced at: 9 days ago - Pushed at: about 1 year ago - Stars: 5,188 - Forks: 22

Chevrotain/chevrotain
Parser Building Toolkit for JavaScript
Language: TypeScript - Size: 36.5 MB - Last synced at: 2 days ago - Pushed at: 4 days ago - Stars: 2,599 - Forks: 212

roshan-research/hazm
Persian NLP Toolkit
Language: Python - Size: 25.5 MB - Last synced at: 11 days ago - Pushed at: 9 months ago - Stars: 1,270 - Forks: 186

natasha/natasha
Solves basic Russian NLP tasks, API for lower level Natasha projects
Language: Python - Size: 35.7 MB - Last synced at: 11 days ago - Pushed at: 6 months ago - Stars: 1,242 - Forks: 109

dqbd/tiktokenizer
Online playground for OpenAPI tokenizers
Language: TypeScript - Size: 709 KB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 1,104 - Forks: 129

lovit/soynlp
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
Language: Python - Size: 34.1 MB - Last synced at: 11 days ago - Pushed at: 2 months ago - Stars: 959 - Forks: 185

ikawaha/kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Language: Go - Size: 711 MB - Last synced at: 7 days ago - Pushed at: 19 days ago - Stars: 857 - Forks: 56

no-context/moo
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
Language: JavaScript - Size: 770 KB - Last synced at: 9 days ago - Pushed at: almost 2 years ago - Stars: 848 - Forks: 68

BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Language: Python - Size: 81.9 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 694 - Forks: 91

mathewsanders/Mustard
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Language: Swift - Size: 137 KB - Last synced at: 5 months ago - Pushed at: almost 7 years ago - Stars: 689 - Forks: 18

wangfenjin/simple
支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin
Language: C++ - Size: 967 KB - Last synced at: 6 days ago - Pushed at: 11 days ago - Stars: 669 - Forks: 96

risesoft-y9/Data-Labeling
数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。
Language: Java - Size: 1.77 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 669 - Forks: 95

cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Language: Python - Size: 659 KB - Last synced at: 15 days ago - Pushed at: about 1 year ago - Stars: 667 - Forks: 92

open-korean-text/open-korean-text
Open Korean Text Processor - An Open-source Korean Text Processor
Language: Scala - Size: 32.7 MB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 624 - Forks: 98

smoothnlp/SmoothNLP 📦
专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference
Language: Java - Size: 6.71 MB - Last synced at: 5 months ago - Pushed at: about 4 years ago - Stars: 624 - Forks: 112

jflex-de/jflex
The fast scanner generator for Java™ with full Unicode support
Language: Java - Size: 22.1 MB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 602 - Forks: 117

alasdairforsythe/tokenmonster
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
Language: Go - Size: 734 KB - Last synced at: 27 days ago - Pushed at: 10 months ago - Stars: 575 - Forks: 20

niieani/gpt-tokenizer
The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o / GPT-o1. Port of OpenAI's tiktoken with additional features.
Language: TypeScript - Size: 13.2 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 557 - Forks: 38

glayzzle/php-parser
:herb: NodeJS PHP Parser - extract AST or tokens
Language: JavaScript - Size: 29.5 MB - Last synced at: 10 days ago - Pushed at: 4 months ago - Stars: 542 - Forks: 71

lydell/js-tokens
Tiny JavaScript tokenizer.
Language: JavaScript - Size: 896 KB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 518 - Forks: 34

lionsoul2014/friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Language: C - Size: 3.07 MB - Last synced at: 16 days ago - Pushed at: over 1 year ago - Stars: 497 - Forks: 91

hplt-project/sacremoses
Python port of Moses tokenizer, truecaser and normalizer
Language: Python - Size: 724 KB - Last synced at: 3 days ago - Pushed at: 11 months ago - Stars: 493 - Forks: 59

leodevbro/vscode-blockman
VSCode extension to highlight nested code blocks
Language: TypeScript - Size: 66.5 MB - Last synced at: 17 days ago - Pushed at: 7 months ago - Stars: 476 - Forks: 17

CogComp/cogcomp-nlp
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Language: Java - Size: 85.5 MB - Last synced at: 24 days ago - Pushed at: almost 2 years ago - Stars: 475 - Forks: 144

neurosnap/sentences
A multilingual command line sentence tokenizer in Golang
Language: Go - Size: 15.3 MB - Last synced at: 14 days ago - Pushed at: about 1 year ago - Stars: 448 - Forks: 38

lindera/lindera
A multilingual morphological analysis library.
Language: Rust - Size: 178 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 444 - Forks: 42

polm/fugashi
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
Language: C++ - Size: 472 KB - Last synced at: 7 days ago - Pushed at: 4 months ago - Stars: 440 - Forks: 38

timtadh/lexmachine
Lex machinary for go.
Language: Go - Size: 296 KB - Last synced at: 5 months ago - Pushed at: almost 3 years ago - Stars: 406 - Forks: 28

taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
Language: Python - Size: 39.4 MB - Last synced at: 14 days ago - Pushed at: 10 months ago - Stars: 397 - Forks: 23

ku-nlp/jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Language: C++ - Size: 3.78 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 387 - Forks: 44

daac-tools/vibrato
🎤 vibrato: Viterbi-based accelerated tokenizer
Language: Rust - Size: 1.08 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 353 - Forks: 15

belladoreai/llama-tokenizer-js
JS tokenizer for LLaMA 1 and 2
Language: JavaScript - Size: 3.07 MB - Last synced at: 7 days ago - Pushed at: 10 months ago - Stars: 351 - Forks: 23

guillaume-be/rust-tokenizers
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
Language: Rust - Size: 1.12 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 307 - Forks: 27

OpenNMT/Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Language: C++ - Size: 1.69 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 302 - Forks: 74

NLPOptimize/flash-tokenizer
EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING
Language: C++ - Size: 195 MB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 301 - Forks: 3

zurawiki/tiktoken-rs
Ready-made tokenizer library for working with GPT and tiktoken
Language: Rust - Size: 3.69 MB - Last synced at: 7 days ago - Pushed at: 20 days ago - Stars: 300 - Forks: 53

artitw/text2text
Text2Text Language Modeling Toolkit
Language: Python - Size: 870 KB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 300 - Forks: 38

bitextor/bitextor
Bitextor generates translation memories from multilingual websites
Language: Python - Size: 177 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 290 - Forks: 43

FoundationVision/UniTok
A Unified Tokenizer for Visual Generation and Understanding
Language: Python - Size: 29.9 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 247 - Forks: 5

mediacloud/sentence-splitter
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Language: Python - Size: 45.9 KB - Last synced at: 3 days ago - Pushed at: over 2 years ago - Stars: 244 - Forks: 30

daac-tools/vaporetto
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Language: Rust - Size: 3.96 MB - Last synced at: 8 days ago - Pushed at: 11 days ago - Stars: 237 - Forks: 10

dmitry-brazhenko/SharpToken
SharpToken is a C# library for tokenizing natural language text. It's based on the tiktoken Python library and designed to be fast and accurate.
Language: C# - Size: 3.6 MB - Last synced at: 3 days ago - Pushed at: 11 months ago - Stars: 232 - Forks: 16

bnosac/udpipe
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Language: C++ - Size: 5.74 MB - Last synced at: 16 days ago - Pushed at: about 2 years ago - Stars: 214 - Forks: 33

netgen/query-translator
Query Translator is a search query translator with AST representation
Language: PHP - Size: 506 KB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 206 - Forks: 10

zhenye234/xcodec
AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Language: Python - Size: 1.77 MB - Last synced at: about 12 hours ago - Pushed at: about 13 hours ago - Stars: 200 - Forks: 12

fnl/syntok
Text tokenization and sentence segmentation (segtok v2)
Language: Python - Size: 203 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 193 - Forks: 34

Dadmatech/DadmaTools
DadmaTools is a Persian NLP tools developed by Dadmatech Co.
Language: Python - Size: 92.6 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 191 - Forks: 42

microsoft/Tokenizer
Typescript and .NET implementation of BPE tokenizer for OpenAI LLMs.
Language: C# - Size: 1.98 MB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 186 - Forks: 34

ropensci/tokenizers
Fast, Consistent Tokenization of Natural Language Text
Language: R - Size: 1.24 MB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 186 - Forks: 25

sugarme/tokenizer
NLP tokenizers written in Go language
Language: Go - Size: 1.48 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 183 - Forks: 30

mck89/peast
JavaScript parser written in PHP that generates AST from your code according to ECMAScript specification
Language: PHP - Size: 1.72 MB - Last synced at: about 9 hours ago - Pushed at: about 1 month ago - Stars: 181 - Forks: 20

botisan-ai/gpt3-tokenizer
Isomorphic JavaScript/TypeScript Tokenizer for GPT-3 and Codex Models by OpenAI.
Language: TypeScript - Size: 2.06 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 170 - Forks: 16

untitaker/html5gum
A WHATWG-compliant HTML5 tokenizer and tag soup parser
Language: Rust - Size: 576 KB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 160 - Forks: 10

xinjli/transphone
phoneme tokenizer and grapheme-to-phoneme model for 8k languages
Language: Python - Size: 342 KB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 156 - Forks: 15

adbar/simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Language: Python - Size: 729 MB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 154 - Forks: 12

gautierdag/bpeasy
Fast bare-bones BPE for modern tokenizer training
Language: Python - Size: 1.41 MB - Last synced at: 9 days ago - Pushed at: 18 days ago - Stars: 152 - Forks: 3

garvys-org/rustfst
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Language: Rust - Size: 6.6 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 152 - Forks: 17

howl-anderson/MicroTokenizer
一个轻量且功能全面的中文分词器,帮助学生了解分词器的工作原理。MicroTokenizer: A lightweight Chinese tokenizer designed for educational and research purposes. Provides a practical, hands-on approach to understanding NLP concepts, featuring multiple tokenization algorithms and customizable models. Ideal for students, researchers, and NLP enthusiasts..
Language: Python - Size: 174 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 150 - Forks: 22

tsproisl/SoMaJo
A tokenizer and sentence splitter for German and English web and social media texts.
Language: Python - Size: 1.35 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 140 - Forks: 21

nette/tokenizer 📦
[DISCONTINUED] Source code tokenizer
Language: PHP - Size: 104 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 140 - Forks: 23

foonathan/lex 📦
Replaced by foonathan/lexy
Language: C++ - Size: 308 KB - Last synced at: 5 months ago - Pushed at: over 4 years ago - Stars: 138 - Forks: 8

Kensuke-Mitsuzawa/JapaneseTokenizers
aim to use JapaneseTokenizer as easy as possible
Language: Python - Size: 271 KB - Last synced at: 1 day ago - Pushed at: about 6 years ago - Stars: 138 - Forks: 21

mykolaharmash/works-for-me
Collection of developer toolkits
Language: JavaScript - Size: 14.8 MB - Last synced at: 12 days ago - Pushed at: almost 7 years ago - Stars: 130 - Forks: 7

LorettaDevs/Loretta
A C# Lua, GLua and Luau parser, code analysis, transformation and generation library.
Language: C# - Size: 9.66 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 124 - Forks: 12

MagedSaeed/farasapy
A Python implementation of Farasa toolkit
Language: Python - Size: 265 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 123 - Forks: 21

GerHobbelt/jison Fork of zaach/jison
bison / YACC / LEX in JavaScript (LALR(1), SLR(1), etc. lexer/parser generator)
Language: JavaScript - Size: 32.2 MB - Last synced at: 2 days ago - Pushed at: about 4 years ago - Stars: 122 - Forks: 20

kakaobrain/kortok
The code and models for "An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks" (AACL-IJCNLP 2020)
Language: Python - Size: 5.6 MB - Last synced at: 17 days ago - Pushed at: over 4 years ago - Stars: 118 - Forks: 10

kyegomez/MambaByte
Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta
Language: Python - Size: 2.16 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 115 - Forks: 7

ropensci/hunspell
High-Performance Stemmer, Tokenizer, and Spell Checker for R
Language: C++ - Size: 4.45 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 113 - Forks: 44

clipperhouse/jargon
Tokenizers and lemmatizers for Go
Language: Go - Size: 1.1 MB - Last synced at: 11 days ago - Pushed at: 11 months ago - Stars: 109 - Forks: 1

belladoreai/llama3-tokenizer-js
JS tokenizer for LLaMA 3 and LLaMA 3.1
Language: JavaScript - Size: 7.22 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 108 - Forks: 7

bevacqua/megamark
:heart_eyes_cat: Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer
Language: JavaScript - Size: 2.28 MB - Last synced at: 4 days ago - Pushed at: about 4 years ago - Stars: 106 - Forks: 7

Cledev-Limited/Cledev.OpenAI
.NET 7 SDK for OpenAI with a Blazor Server playground
Language: C# - Size: 511 KB - Last synced at: 11 months ago - Pushed at: almost 2 years ago - Stars: 105 - Forks: 17

JuliaLang/Tokenize.jl
Tokenization for Julia source code
Language: Julia - Size: 472 KB - Last synced at: 5 days ago - Pushed at: 12 months ago - Stars: 104 - Forks: 27

AmrDeveloper/FileQL
A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.
Language: Rust - Size: 820 KB - Last synced at: 9 days ago - Pushed at: 14 days ago - Stars: 101 - Forks: 3

tlaceby/guide-to-interpreters-series
Contains source-code for viewers following along with my Beginners Guide To Building Interpreters series on my Youtube Channel.
Language: TypeScript - Size: 65.4 KB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 100 - Forks: 15

chriskonnertz/string-calc
PHP calculator library for mathematical terms (expressions) passed as strings
Language: PHP - Size: 307 KB - Last synced at: 15 days ago - Pushed at: almost 3 years ago - Stars: 100 - Forks: 19

dluc/openai-tools
A collection of tools for working with OpenAI
Language: C# - Size: 559 KB - Last synced at: about 23 hours ago - Pushed at: almost 2 years ago - Stars: 99 - Forks: 15

togatoga/kanpyo
Japanese Morphological Analyzer written in Rust
Language: Rust - Size: 10.4 MB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 98 - Forks: 1

explosion/spacy-experimental
🧪 Cutting-edge experimental spaCy components and features
Language: Python - Size: 1.33 MB - Last synced at: 14 days ago - Pushed at: 12 months ago - Stars: 98 - Forks: 19

yishn/chinese-tokenizer
Tokenizes Chinese texts into words.
Language: JavaScript - Size: 11.2 MB - Last synced at: 13 days ago - Pushed at: over 2 years ago - Stars: 96 - Forks: 25

bzick/tokenizer
Tokenizer (lexer) for golang
Language: Go - Size: 103 KB - Last synced at: 9 months ago - Pushed at: about 1 year ago - Stars: 89 - Forks: 5

colindembovsky/cols-agent-tasks
Colin's ALM Corner Custom Build Tasks
Language: PowerShell - Size: 2.79 MB - Last synced at: 16 days ago - Pushed at: 5 months ago - Stars: 84 - Forks: 71

alfianlosari/GPTEncoder
Swift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI ChatGPT API.
Language: Swift - Size: 554 KB - Last synced at: 10 days ago - Pushed at: about 2 years ago - Stars: 83 - Forks: 20

DCjanus/cang-jie
Chinese tokenizer for tantivy, based on jieba-rs
Language: Rust - Size: 36.1 KB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 80 - Forks: 23

HippoPHP/Hippo
PHP standards checker.
Language: PHP - Size: 458 KB - Last synced at: 11 months ago - Pushed at: over 7 years ago - Stars: 80 - Forks: 0

samber/go-gpt-3-encoder
Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3
Language: Go - Size: 558 KB - Last synced at: about 11 hours ago - Pushed at: 5 months ago - Stars: 79 - Forks: 21

venturachrisdev/djurl
Simple yet helpful library for writing Django urls by an easy, short and intuitive way.
Language: Python - Size: 48.8 KB - Last synced at: 4 days ago - Pushed at: over 6 years ago - Stars: 79 - Forks: 3

AayushSameerShah/Neural-Net-Zero-to-Hero-with-Andrej
This repository contains the collection of explorative notebooks pure in python and in the language that we, humans can read. Have tried to compile all lectures from the Andrej Karpathy's 💎 playlist on Neural Networks - which we will end up with building GPT.
Language: Jupyter Notebook - Size: 191 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 77 - Forks: 10

ikskuh/parser-toolkit
A toolkit that makes it easier to write recursive-descent parsers in Zig.
Language: Zig - Size: 1.09 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 75 - Forks: 8

TangXiaoLv/Android-Sqlite-Fts5-Tokenizer
集成了FTS5中文分词器的Sqlite3源码
Language: C++ - Size: 11.7 MB - Last synced at: 17 days ago - Pushed at: over 7 years ago - Stars: 75 - Forks: 16

openshieldai/openshield
OpenShield is a new generation security layer for AI models
Language: Go - Size: 2.26 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 74 - Forks: 7

textgain/grasp
Essential NLP & ML, short & fast pure Python code
Language: Python - Size: 58.8 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 74 - Forks: 19

csstools/tokenizer
Tokenize CSS according to the CSS Syntax
Language: TypeScript - Size: 1.62 MB - Last synced at: 1 day ago - Pushed at: 4 months ago - Stars: 70 - Forks: 5

janlelis/wirb
Ruby Object Inspection for IRB
Language: Ruby - Size: 204 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 70 - Forks: 9

OpenPecha/Botok
🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
Language: Python - Size: 30.8 MB - Last synced at: 12 days ago - Pushed at: about 1 month ago - Stars: 65 - Forks: 16

mideind/GreynirServer
The greynir.is Icelandic natural language processing API and website.
Language: Python - Size: 40.1 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 65 - Forks: 17

thautwarm/EBNFParser 📦
Convenient parser generator for Python(check out https://github.com/thautwarm/RBNF for an advanced version).
Language: Python - Size: 896 KB - Last synced at: 11 days ago - Pushed at: about 7 years ago - Stars: 65 - Forks: 6

Kyubyong/neural_tokenizer
Tokenize English sentences using neural networks.
Language: Python - Size: 177 KB - Last synced at: 18 days ago - Pushed at: over 7 years ago - Stars: 63 - Forks: 9

winkjs/wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Language: JavaScript - Size: 2.05 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 61 - Forks: 12
