Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub / LanguageMachines / ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

JSON API: https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2Fucto

Stars: 62
Forks: 13
Open Issues: 12

License: gpl-3.0
Language: C++
Repo Size: 6.25 MB
Dependencies: 5

Created: about 11 years ago
Updated: 1 day ago
Last pushed: 1 day ago
Last synced: 1 day ago

Commit Stats

Commits: 1360
Authors: 8
Mean commits per author: 170.0
Development Distribution Score: 0.491
More commit stats: https://commits.ecosyste.ms/hosts/GitHub/repositories/LanguageMachines/ucto

Topics: computational-linguistics, folia, language, natural-language-processing, nlp, punctuation, tokeniser

Files
    Loading...
    Readme
    Loading...
    Dependencies
    Dockerfile docker