Ecosyste.ms: Repos
An open API service providing repository metadata for many open source software ecosystems.
GitHub / LanguageMachines / ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --
JSON API: https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LanguageMachines%2Fucto
Stars: 62
Forks: 13
Open Issues: 12
License: gpl-3.0
Language: C++
Repo Size: 6.25 MB
Dependencies:
5
Created: about 11 years ago
Updated: 1 day ago
Last pushed: 1 day ago
Last synced: 1 day ago
Commit Stats
Commits: 1360
Authors: 8
Mean commits per author: 170.0
Development Distribution Score: 0.491
More commit stats: https://commits.ecosyste.ms/hosts/GitHub/repositories/LanguageMachines/ucto
Topics: computational-linguistics, folia, language, natural-language-processing, nlp, punctuation, tokeniser
Files
Dependencies
- Gottox/irc-message-action v2 composite
- actions/checkout v2 composite
- styfle/cancel-workflow-action 0.11.0 composite
- alpine latest build
- Mattraks/delete-workflow-runs v2 composite