An open API service providing repository metadata for many open source software ecosystems.

Topic: "tokenizer"

theseer/tokenizer

A small library for converting tokenized PHP source code into XML (and potentially other formats)

Language: PHP - Size: 83 KB - Last synced at: 9 days ago - Pushed at: about 1 year ago - Stars: 5,188 - Forks: 22

Chevrotain/chevrotain

Parser Building Toolkit for JavaScript

Language: TypeScript - Size: 36.5 MB - Last synced at: 2 days ago - Pushed at: 4 days ago - Stars: 2,599 - Forks: 212

roshan-research/hazm

Persian NLP Toolkit

Language: Python - Size: 25.5 MB - Last synced at: 11 days ago - Pushed at: 9 months ago - Stars: 1,270 - Forks: 186

natasha/natasha

Solves basic Russian NLP tasks, API for lower level Natasha projects

Language: Python - Size: 35.7 MB - Last synced at: 11 days ago - Pushed at: 6 months ago - Stars: 1,242 - Forks: 109

dqbd/tiktokenizer

Online playground for OpenAPI tokenizers

Language: TypeScript - Size: 709 KB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 1,104 - Forks: 129

lovit/soynlp

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

Language: Python - Size: 34.1 MB - Last synced at: 11 days ago - Pushed at: 2 months ago - Stars: 959 - Forks: 185

ikawaha/kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Language: Go - Size: 711 MB - Last synced at: 7 days ago - Pushed at: 19 days ago - Stars: 857 - Forks: 56

no-context/moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

Language: JavaScript - Size: 770 KB - Last synced at: 9 days ago - Pushed at: almost 2 years ago - Stars: 848 - Forks: 68

BLKSerene/Wordless

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

Language: Python - Size: 81.9 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 694 - Forks: 91

mathewsanders/Mustard

🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Language: Swift - Size: 137 KB - Last synced at: 5 months ago - Pushed at: almost 7 years ago - Stars: 689 - Forks: 18

wangfenjin/simple

支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin

Language: C++ - Size: 967 KB - Last synced at: 6 days ago - Pushed at: 11 days ago - Stars: 669 - Forks: 96

risesoft-y9/Data-Labeling

数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。

Language: Java - Size: 1.77 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 669 - Forks: 95

cbaziotis/ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Language: Python - Size: 659 KB - Last synced at: 15 days ago - Pushed at: about 1 year ago - Stars: 667 - Forks: 92

open-korean-text/open-korean-text

Open Korean Text Processor - An Open-source Korean Text Processor

Language: Scala - Size: 32.7 MB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 624 - Forks: 98

smoothnlp/SmoothNLP 📦

专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference

Language: Java - Size: 6.71 MB - Last synced at: 5 months ago - Pushed at: about 4 years ago - Stars: 624 - Forks: 112

jflex-de/jflex

The fast scanner generator for Java™ with full Unicode support

Language: Java - Size: 22.1 MB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 602 - Forks: 117

alasdairforsythe/tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

Language: Go - Size: 734 KB - Last synced at: 27 days ago - Pushed at: 10 months ago - Stars: 575 - Forks: 20

niieani/gpt-tokenizer

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o / GPT-o1. Port of OpenAI's tiktoken with additional features.

Language: TypeScript - Size: 13.2 MB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 557 - Forks: 38

glayzzle/php-parser

:herb: NodeJS PHP Parser - extract AST or tokens

Language: JavaScript - Size: 29.5 MB - Last synced at: 10 days ago - Pushed at: 4 months ago - Stars: 542 - Forks: 71

lydell/js-tokens

Tiny JavaScript tokenizer.

Language: JavaScript - Size: 896 KB - Last synced at: 5 days ago - Pushed at: 16 days ago - Stars: 518 - Forks: 34

lionsoul2014/friso

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

Language: C - Size: 3.07 MB - Last synced at: 16 days ago - Pushed at: over 1 year ago - Stars: 497 - Forks: 91

hplt-project/sacremoses

Python port of Moses tokenizer, truecaser and normalizer

Language: Python - Size: 724 KB - Last synced at: 3 days ago - Pushed at: 11 months ago - Stars: 493 - Forks: 59

leodevbro/vscode-blockman

VSCode extension to highlight nested code blocks

Language: TypeScript - Size: 66.5 MB - Last synced at: 17 days ago - Pushed at: 7 months ago - Stars: 476 - Forks: 17

CogComp/cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

Language: Java - Size: 85.5 MB - Last synced at: 24 days ago - Pushed at: almost 2 years ago - Stars: 475 - Forks: 144

neurosnap/sentences

A multilingual command line sentence tokenizer in Golang

Language: Go - Size: 15.3 MB - Last synced at: 14 days ago - Pushed at: about 1 year ago - Stars: 448 - Forks: 38

lindera/lindera

A multilingual morphological analysis library.

Language: Rust - Size: 178 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 444 - Forks: 42

polm/fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Language: C++ - Size: 472 KB - Last synced at: 7 days ago - Pushed at: 4 months ago - Stars: 440 - Forks: 38

timtadh/lexmachine

Lex machinary for go.

Language: Go - Size: 296 KB - Last synced at: 5 months ago - Pushed at: almost 3 years ago - Stars: 406 - Forks: 28

taishi-i/nagisa

A Japanese tokenizer based on recurrent neural networks

Language: Python - Size: 39.4 MB - Last synced at: 14 days ago - Pushed at: 10 months ago - Stars: 397 - Forks: 23

ku-nlp/jumanpp

Juman++ (a Morphological Analyzer Toolkit)

Language: C++ - Size: 3.78 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 387 - Forks: 44

daac-tools/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

Language: Rust - Size: 1.08 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 353 - Forks: 15

belladoreai/llama-tokenizer-js

JS tokenizer for LLaMA 1 and 2

Language: JavaScript - Size: 3.07 MB - Last synced at: 7 days ago - Pushed at: 10 months ago - Stars: 351 - Forks: 23

guillaume-be/rust-tokenizers

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

Language: Rust - Size: 1.12 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 307 - Forks: 27

OpenNMT/Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Language: C++ - Size: 1.69 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 302 - Forks: 74

NLPOptimize/flash-tokenizer

EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING

Language: C++ - Size: 195 MB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 301 - Forks: 3

zurawiki/tiktoken-rs

Ready-made tokenizer library for working with GPT and tiktoken

Language: Rust - Size: 3.69 MB - Last synced at: 7 days ago - Pushed at: 20 days ago - Stars: 300 - Forks: 53

artitw/text2text

Text2Text Language Modeling Toolkit

Language: Python - Size: 870 KB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 300 - Forks: 38

bitextor/bitextor

Bitextor generates translation memories from multilingual websites

Language: Python - Size: 177 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 290 - Forks: 43

FoundationVision/UniTok

A Unified Tokenizer for Visual Generation and Understanding

Language: Python - Size: 29.9 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 247 - Forks: 5

mediacloud/sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

Language: Python - Size: 45.9 KB - Last synced at: 3 days ago - Pushed at: over 2 years ago - Stars: 244 - Forks: 30

daac-tools/vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

Language: Rust - Size: 3.96 MB - Last synced at: 8 days ago - Pushed at: 11 days ago - Stars: 237 - Forks: 10

dmitry-brazhenko/SharpToken

SharpToken is a C# library for tokenizing natural language text. It's based on the tiktoken Python library and designed to be fast and accurate.

Language: C# - Size: 3.6 MB - Last synced at: 3 days ago - Pushed at: 11 months ago - Stars: 232 - Forks: 16

bnosac/udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

Language: C++ - Size: 5.74 MB - Last synced at: 16 days ago - Pushed at: about 2 years ago - Stars: 214 - Forks: 33

netgen/query-translator

Query Translator is a search query translator with AST representation

Language: PHP - Size: 506 KB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 206 - Forks: 10

zhenye234/xcodec

AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Language: Python - Size: 1.77 MB - Last synced at: about 12 hours ago - Pushed at: about 13 hours ago - Stars: 200 - Forks: 12

fnl/syntok

Text tokenization and sentence segmentation (segtok v2)

Language: Python - Size: 203 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 193 - Forks: 34

Dadmatech/DadmaTools

DadmaTools is a Persian NLP tools developed by Dadmatech Co.

Language: Python - Size: 92.6 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 191 - Forks: 42

microsoft/Tokenizer

Typescript and .NET implementation of BPE tokenizer for OpenAI LLMs.

Language: C# - Size: 1.98 MB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 186 - Forks: 34

ropensci/tokenizers

Fast, Consistent Tokenization of Natural Language Text

Language: R - Size: 1.24 MB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 186 - Forks: 25

sugarme/tokenizer

NLP tokenizers written in Go language

Language: Go - Size: 1.48 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 183 - Forks: 30

mck89/peast

JavaScript parser written in PHP that generates AST from your code according to ECMAScript specification

Language: PHP - Size: 1.72 MB - Last synced at: about 9 hours ago - Pushed at: about 1 month ago - Stars: 181 - Forks: 20

botisan-ai/gpt3-tokenizer

Isomorphic JavaScript/TypeScript Tokenizer for GPT-3 and Codex Models by OpenAI.

Language: TypeScript - Size: 2.06 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 170 - Forks: 16

untitaker/html5gum

A WHATWG-compliant HTML5 tokenizer and tag soup parser

Language: Rust - Size: 576 KB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 160 - Forks: 10

xinjli/transphone

phoneme tokenizer and grapheme-to-phoneme model for 8k languages

Language: Python - Size: 342 KB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 156 - Forks: 15

adbar/simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Language: Python - Size: 729 MB - Last synced at: 9 days ago - Pushed at: 5 months ago - Stars: 154 - Forks: 12

gautierdag/bpeasy

Fast bare-bones BPE for modern tokenizer training

Language: Python - Size: 1.41 MB - Last synced at: 9 days ago - Pushed at: 18 days ago - Stars: 152 - Forks: 3

garvys-org/rustfst

Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.

Language: Rust - Size: 6.6 MB - Last synced at: 18 days ago - Pushed at: 8 months ago - Stars: 152 - Forks: 17

howl-anderson/MicroTokenizer

一个轻量且功能全面的中文分词器,帮助学生了解分词器的工作原理。MicroTokenizer: A lightweight Chinese tokenizer designed for educational and research purposes. Provides a practical, hands-on approach to understanding NLP concepts, featuring multiple tokenization algorithms and customizable models. Ideal for students, researchers, and NLP enthusiasts..

Language: Python - Size: 174 MB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 150 - Forks: 22

tsproisl/SoMaJo

A tokenizer and sentence splitter for German and English web and social media texts.

Language: Python - Size: 1.35 MB - Last synced at: 5 days ago - Pushed at: 4 months ago - Stars: 140 - Forks: 21

nette/tokenizer 📦

[DISCONTINUED] Source code tokenizer

Language: PHP - Size: 104 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 140 - Forks: 23

foonathan/lex 📦

Replaced by foonathan/lexy

Language: C++ - Size: 308 KB - Last synced at: 5 months ago - Pushed at: over 4 years ago - Stars: 138 - Forks: 8

Kensuke-Mitsuzawa/JapaneseTokenizers

aim to use JapaneseTokenizer as easy as possible

Language: Python - Size: 271 KB - Last synced at: 1 day ago - Pushed at: about 6 years ago - Stars: 138 - Forks: 21

mykolaharmash/works-for-me

Collection of developer toolkits

Language: JavaScript - Size: 14.8 MB - Last synced at: 12 days ago - Pushed at: almost 7 years ago - Stars: 130 - Forks: 7

LorettaDevs/Loretta

A C# Lua, GLua and Luau parser, code analysis, transformation and generation library.

Language: C# - Size: 9.66 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 124 - Forks: 12

MagedSaeed/farasapy

A Python implementation of Farasa toolkit

Language: Python - Size: 265 MB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 123 - Forks: 21

GerHobbelt/jison Fork of zaach/jison

bison / YACC / LEX in JavaScript (LALR(1), SLR(1), etc. lexer/parser generator)

Language: JavaScript - Size: 32.2 MB - Last synced at: 2 days ago - Pushed at: about 4 years ago - Stars: 122 - Forks: 20

kakaobrain/kortok

The code and models for "An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks" (AACL-IJCNLP 2020)

Language: Python - Size: 5.6 MB - Last synced at: 17 days ago - Pushed at: over 4 years ago - Stars: 118 - Forks: 10

kyegomez/MambaByte

Implementation of MambaByte in "MambaByte: Token-free Selective State Space Model" in Pytorch and Zeta

Language: Python - Size: 2.16 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 115 - Forks: 7

ropensci/hunspell

High-Performance Stemmer, Tokenizer, and Spell Checker for R

Language: C++ - Size: 4.45 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 113 - Forks: 44

clipperhouse/jargon

Tokenizers and lemmatizers for Go

Language: Go - Size: 1.1 MB - Last synced at: 11 days ago - Pushed at: 11 months ago - Stars: 109 - Forks: 1

belladoreai/llama3-tokenizer-js

JS tokenizer for LLaMA 3 and LLaMA 3.1

Language: JavaScript - Size: 7.22 MB - Last synced at: 8 days ago - Pushed at: about 1 month ago - Stars: 108 - Forks: 7

bevacqua/megamark

:heart_eyes_cat: Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer

Language: JavaScript - Size: 2.28 MB - Last synced at: 4 days ago - Pushed at: about 4 years ago - Stars: 106 - Forks: 7

Cledev-Limited/Cledev.OpenAI

.NET 7 SDK for OpenAI with a Blazor Server playground

Language: C# - Size: 511 KB - Last synced at: 11 months ago - Pushed at: almost 2 years ago - Stars: 105 - Forks: 17

JuliaLang/Tokenize.jl

Tokenization for Julia source code

Language: Julia - Size: 472 KB - Last synced at: 5 days ago - Pushed at: 12 months ago - Stars: 104 - Forks: 27

AmrDeveloper/FileQL

A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.

Language: Rust - Size: 820 KB - Last synced at: 9 days ago - Pushed at: 14 days ago - Stars: 101 - Forks: 3

tlaceby/guide-to-interpreters-series

Contains source-code for viewers following along with my Beginners Guide To Building Interpreters series on my Youtube Channel.

Language: TypeScript - Size: 65.4 KB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 100 - Forks: 15

chriskonnertz/string-calc

PHP calculator library for mathematical terms (expressions) passed as strings

Language: PHP - Size: 307 KB - Last synced at: 15 days ago - Pushed at: almost 3 years ago - Stars: 100 - Forks: 19

dluc/openai-tools

A collection of tools for working with OpenAI

Language: C# - Size: 559 KB - Last synced at: about 23 hours ago - Pushed at: almost 2 years ago - Stars: 99 - Forks: 15

togatoga/kanpyo

Japanese Morphological Analyzer written in Rust

Language: Rust - Size: 10.4 MB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 98 - Forks: 1

explosion/spacy-experimental

🧪 Cutting-edge experimental spaCy components and features

Language: Python - Size: 1.33 MB - Last synced at: 14 days ago - Pushed at: 12 months ago - Stars: 98 - Forks: 19

yishn/chinese-tokenizer

Tokenizes Chinese texts into words.

Language: JavaScript - Size: 11.2 MB - Last synced at: 13 days ago - Pushed at: over 2 years ago - Stars: 96 - Forks: 25

bzick/tokenizer

Tokenizer (lexer) for golang

Language: Go - Size: 103 KB - Last synced at: 9 months ago - Pushed at: about 1 year ago - Stars: 89 - Forks: 5

colindembovsky/cols-agent-tasks

Colin's ALM Corner Custom Build Tasks

Language: PowerShell - Size: 2.79 MB - Last synced at: 16 days ago - Pushed at: 5 months ago - Stars: 84 - Forks: 71

alfianlosari/GPTEncoder

Swift BPE Encoder/Decoder for OpenAI GPT Models. A programmatic interface for tokenizing text for OpenAI ChatGPT API.

Language: Swift - Size: 554 KB - Last synced at: 10 days ago - Pushed at: about 2 years ago - Stars: 83 - Forks: 20

DCjanus/cang-jie

Chinese tokenizer for tantivy, based on jieba-rs

Language: Rust - Size: 36.1 KB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 80 - Forks: 23

HippoPHP/Hippo

PHP standards checker.

Language: PHP - Size: 458 KB - Last synced at: 11 months ago - Pushed at: over 7 years ago - Stars: 80 - Forks: 0

samber/go-gpt-3-encoder

Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3

Language: Go - Size: 558 KB - Last synced at: about 11 hours ago - Pushed at: 5 months ago - Stars: 79 - Forks: 21

venturachrisdev/djurl

Simple yet helpful library for writing Django urls by an easy, short and intuitive way.

Language: Python - Size: 48.8 KB - Last synced at: 4 days ago - Pushed at: over 6 years ago - Stars: 79 - Forks: 3

AayushSameerShah/Neural-Net-Zero-to-Hero-with-Andrej

This repository contains the collection of explorative notebooks pure in python and in the language that we, humans can read. Have tried to compile all lectures from the Andrej Karpathy's 💎 playlist on Neural Networks - which we will end up with building GPT.

Language: Jupyter Notebook - Size: 191 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 77 - Forks: 10

ikskuh/parser-toolkit

A toolkit that makes it easier to write recursive-descent parsers in Zig.

Language: Zig - Size: 1.09 MB - Last synced at: 13 days ago - Pushed at: about 1 month ago - Stars: 75 - Forks: 8

TangXiaoLv/Android-Sqlite-Fts5-Tokenizer

集成了FTS5中文分词器的Sqlite3源码

Language: C++ - Size: 11.7 MB - Last synced at: 17 days ago - Pushed at: over 7 years ago - Stars: 75 - Forks: 16

openshieldai/openshield

OpenShield is a new generation security layer for AI models

Language: Go - Size: 2.26 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 74 - Forks: 7

textgain/grasp

Essential NLP & ML, short & fast pure Python code

Language: Python - Size: 58.8 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 74 - Forks: 19

csstools/tokenizer

Tokenize CSS according to the CSS Syntax

Language: TypeScript - Size: 1.62 MB - Last synced at: 1 day ago - Pushed at: 4 months ago - Stars: 70 - Forks: 5

janlelis/wirb

Ruby Object Inspection for IRB

Language: Ruby - Size: 204 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 70 - Forks: 9

OpenPecha/Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

Language: Python - Size: 30.8 MB - Last synced at: 12 days ago - Pushed at: about 1 month ago - Stars: 65 - Forks: 16

mideind/GreynirServer

The greynir.is Icelandic natural language processing API and website.

Language: Python - Size: 40.1 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 65 - Forks: 17

thautwarm/EBNFParser 📦

Convenient parser generator for Python(check out https://github.com/thautwarm/RBNF for an advanced version).

Language: Python - Size: 896 KB - Last synced at: 11 days ago - Pushed at: about 7 years ago - Stars: 65 - Forks: 6

Kyubyong/neural_tokenizer

Tokenize English sentences using neural networks.

Language: Python - Size: 177 KB - Last synced at: 18 days ago - Pushed at: over 7 years ago - Stars: 63 - Forks: 9

winkjs/wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

Language: JavaScript - Size: 2.05 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 61 - Forks: 12