ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

JSON API: http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2Fucto
PURL: pkg:github/LanguageMachines/ucto

Stars: 70
Forks: 14
Open issues: 12

License: gpl-3.0
Language: C++
Size: 6.14 MB
Dependencies parsed at: Pending

Created at: over 12 years ago
Updated at: 15 days ago
Pushed at: about 1 month ago
Last synced at: about 5 hours ago

Commit Stats

Commits: 1360
Authors: 8
Mean commits per author: 170.0
Development Distribution Score: 0.491
More commit stats: https://commits.ecosyste.ms/hosts/GitHub/repositories/LanguageMachines/ucto

Topics: computational-linguistics, folia, language, natural-language-processing, nlp, punctuation, tokeniser

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos

GitHub / LanguageMachines / ucto

Commit Stats