Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: corpus

Repositories

kscanne/gaelg

NLP resources for Manx Gaelic, mainly in support of the gv2ga MT engine

Language: Perl - Size: 11.3 MB - Last synced: about 3 hours ago - Pushed: about 5 hours ago - Stars: 3 - Forks: 1

s-lilo/brat-peek

Framework for working with brat-annotated .ann files

Language: Python - Size: 146 KB - Last synced: about 10 hours ago - Pushed: about 12 hours ago - Stars: 7 - Forks: 2

luciamariaalvarezcrespo/GalMisoCorpus2023

:bookmark_tabs: Galician corpus for misogyny detection

Language: Python - Size: 4.22 MB - Last synced: about 11 hours ago - Pushed: about 13 hours ago - Stars: 6 - Forks: 0

endymecy/awesome-deeplearning-resources

Deep Learning and deep reinforcement learning research papers and some codes

Size: 290 MB - Last synced: about 4 hours ago - Pushed: 2 months ago - Stars: 2,824 - Forks: 664

sagesolar/Corpus-of-Taylor-Swift

This is a dataset consisting of all song lyric words found on all of Taylor Swift's studio albums (up to and including TTPD), as well as a selection of other songs written by her.

Size: 5.93 MB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 4 - Forks: 0

DFKI-NLP/product-corpus

This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the relation CompanyProvidesProduct.

Size: 22 MB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 12 - Forks: 2

open-discourse/open-discourse

Open Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).

Language: Python - Size: 1.68 MB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 81 - Forks: 7

dracor-org/gerdracor

German Drama Corpus

Language: CSS - Size: 133 MB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 8 - Forks: 10

centre-for-educational-technology/evkk

ELLE - Estonian language learning and analysis environment for learners, educators and linguists

Language: JavaScript - Size: 109 MB - Last synced: about 9 hours ago - Pushed: 1 day ago - Stars: 1 - Forks: 3

saferwall/malware-souk

Collaborative malware exchange repository.

Language: Python - Size: 55.8 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 27 - Forks: 7

amir-zeldes/gum

Repository for the Georgetown University Multilayer Corpus (GUM)

Language: Python - Size: 1.06 GB - Last synced: 3 days ago - Pushed: 5 days ago - Stars: 86 - Forks: 51

fendouai/Awesome-Chatbot

Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:

Language: Python - Size: 12.7 KB - Last synced: 3 days ago - Pushed: 4 days ago - Stars: 2,006 - Forks: 406

quanteda/quanteda

An R package for the Quantitative Analysis of Textual Data

Language: R - Size: 741 MB - Last synced: about 20 hours ago - Pushed: 2 days ago - Stars: 825 - Forks: 186

mhbashari/awesome-persian-nlp-ir

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Size: 192 KB - Last synced: about 7 hours ago - Pushed: 6 months ago - Stars: 700 - Forks: 113

khashashin/chechen_corpora

This repository contains the source code for the Chechen Language Corpora website.

Language: TypeScript - Size: 4.43 MB - Last synced: 4 days ago - Pushed: 4 days ago - Stars: 6 - Forks: 2

esteeschwarz/SPUND-LX

linguistics essais

Language: HTML - Size: 74.6 MB - Last synced: 3 days ago - Pushed: 4 days ago - Stars: 0 - Forks: 0

UniversalDependencies/UD_Portuguese-Bosque

This Universal Dependencies (UD) Portuguese treebank.

Language: Common Lisp - Size: 209 MB - Last synced: 4 days ago - Pushed: 4 days ago - Stars: 46 - Forks: 11

PyThaiNLP/thaigov-v2-corpus

Thai News Dataset from Thai government website.

Language: Jupyter Notebook - Size: 89.3 MB - Last synced: 8 days ago - Pushed: 8 days ago - Stars: 11 - Forks: 1

ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.

Language: Jupyter Notebook - Size: 84.2 MB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 42 - Forks: 5

ko-nlp/Korpora

Korean corpus repository

Language: Python - Size: 3.32 MB - Last synced: 5 days ago - Pushed: over 1 year ago - Stars: 651 - Forks: 77

franciellevargas/HausaHate

HausaHate is a benchmark dataset for Hausa hate speech detection task. it was extracted from West African Facebook pages and comprises 2,000 comments annotated according to a binary class (offensive and non-offensive) and hate speech targets (race, gender and none).

Size: 894 KB - Last synced: 6 days ago - Pushed: 7 days ago - Stars: 0 - Forks: 0

spottolaq/corpus-spotted-2020

This repository houses a comprehensive collection of 14,701 Instagram posts authored by Italian university students between January 2020 and December 2020. These posts offer invaluable insights into the experiences and reflections of students during the challenging period of the COVID-19 lockdown in Italy.

Size: 16.6 KB - Last synced: 7 days ago - Pushed: 7 days ago - Stars: 0 - Forks: 0

oroszgy/awesome-hungarian-nlp

A curated list of NLP resources for Hungarian

Size: 110 KB - Last synced: 6 days ago - Pushed: 6 months ago - Stars: 207 - Forks: 18

gauravcodepro/pubmed-abstract-fetcher

This function will prepare the abstract and the id information for all the pubmed articles that you want to read and have as a citation. I coded this using a web scraping approach and it is blazing fast and parses better than ncbi eutils

Language: Python - Size: 27.3 KB - Last synced: 2 days ago - Pushed: 8 days ago - Stars: 1 - Forks: 0

OYE93/Chinese-NLP-Corpus

Collections of Chinese NLP corpus

Language: Python - Size: 7.14 MB - Last synced: 5 days ago - Pushed: over 3 years ago - Stars: 848 - Forks: 207

dkalpakchi/awesome-swedish-nlp

A curated list of resources for natural language processing (NLP) in Swedish

Size: 25.4 KB - Last synced: about 6 hours ago - Pushed: over 1 year ago - Stars: 19 - Forks: 1

qundao/corpus

语料数据和词库收集：中文、英文停用词，情感分析，分类词典，敏感词库（违禁词，审查词）

Size: 19.6 MB - Last synced: 7 days ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0

EmilStenstrom/suc_to_iob

Convert the SUC 3.0 corpus from a custom format to IOB2 for use in training NER applications

Language: Python - Size: 25.4 KB - Last synced: 8 days ago - Pushed: over 2 years ago - Stars: 5 - Forks: 2

kan-bayashi/VCTKCorpusFullContextLabel

Full context label for VCTK Corpus.

Size: 54.3 MB - Last synced: 8 days ago - Pushed: about 4 years ago - Stars: 2 - Forks: 0

kamaravichow/text-summariser-python

Simple text summariser using NLTK in python

Language: Jupyter Notebook - Size: 15.6 KB - Last synced: 8 days ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0

yilunzhu/ontogum

Repository for the OntoGUM Corpus

Language: Python - Size: 8.36 MB - Last synced: 7 days ago - Pushed: 8 days ago - Stars: 6 - Forks: 0

bertvandepoel/snelSLiM

A linguistic set of tools in Go and web interface in PHP to do quick Stable Lexical Marker Analysis

Language: JavaScript - Size: 4.2 MB - Last synced: 8 days ago - Pushed: almost 3 years ago - Stars: 3 - Forks: 0

mikahama/SemFi

Semantic relations for Finnish words

Language: HTML - Size: 79.1 KB - Last synced: 8 days ago - Pushed: 5 months ago - Stars: 2 - Forks: 2

rcarmo/newsfeed-corpus

A Dockerized RSS feed fetcher for NLP work, using asyncio

Language: JavaScript - Size: 794 KB - Last synced: 8 days ago - Pushed: over 1 year ago - Stars: 20 - Forks: 2

dariusk/corpora

A collection of small corpuses of interesting data for the creation of bots and similar stuff.

Language: JavaScript - Size: 2.94 MB - Last synced: 7 days ago - Pushed: 3 months ago - Stars: 4,852 - Forks: 1,299

franciellevargas/SentiAspect-pt

The SentiAspect-pt comprises 180 product reviews annotated according to implicit and explicit fine-grained opinions, which were hierarchically organized for aspect-based sentiment analysis and opinion summarization applications.

Size: 1.48 MB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 5 - Forks: 1

mesolitica/malaysian-dataset

We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/

Language: Jupyter Notebook - Size: 1.32 GB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 282 - Forks: 101

quasilyte/phpcorpus

A collection of various PHP code; useful for PHP tools writers to get some insights on how "real-world" PHP code looks like

Size: 10.7 KB - Last synced: 10 days ago - Pushed: about 2 years ago - Stars: 1 - Forks: 1

quasilyte/eldb

Emacs Lisp corpus. Code collected from many-many projects for you to query it!

Language: Emacs Lisp - Size: 347 KB - Last synced: 10 days ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0

proycon/colibri-core

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Language: C++ - Size: 10.1 MB - Last synced: 3 days ago - Pushed: 6 months ago - Stars: 123 - Forks: 19

Taiwan-Social-Media-Corpus/blacklab-demo

A repo that demonstrates how to build Blacklab corpus via Docker and Nginx.

Language: Shell - Size: 184 KB - Last synced: 11 days ago - Pushed: 11 days ago - Stars: 0 - Forks: 0

ThinamXx/WordFrequency_using_NLTK

In this repository, I have used NLP to determine: What are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?

Language: HTML - Size: 1.04 MB - Last synced: 12 days ago - Pushed: over 3 years ago - Stars: 1 - Forks: 0

seanpm2001/DroppedText_Corpus

A text corpus collection for the DroppedText language.

Size: 2.14 MB - Last synced: 12 days ago - Pushed: over 1 year ago - Stars: 3 - Forks: 2

german-asr/megs

A merged version of multiple open-source German speech datasets.

Language: Jupyter Notebook - Size: 235 KB - Last synced: 6 days ago - Pushed: 6 days ago - Stars: 28 - Forks: 3

kgjerde/corporaexplorer

An R package for dynamic exploration of text collections

Language: R - Size: 5.89 MB - Last synced: 7 days ago - Pushed: 11 months ago - Stars: 63 - Forks: 4

innerNULL/mia

My Implementations' Archive

Language: Python - Size: 1.6 MB - Last synced: 15 days ago - Pushed: 15 days ago - Stars: 0 - Forks: 0

julienijs/keep_V-ing

The grammaticalization of keep

Language: R - Size: 5 MB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 0 - Forks: 0

clarin-eric/ParlaMint

ParlaMint: Comparable Parliamentary Corpora

Language: XSLT - Size: 2.1 GB - Last synced: 21 days ago - Pushed: 23 days ago - Stars: 37 - Forks: 50

grammarly/ua-gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Language: Macaulay2 - Size: 18 MB - Last synced: about 16 hours ago - Pushed: 3 months ago - Stars: 254 - Forks: 21

writecrow/crow_backend

The canonical resources to build the backend for a corpus/repository management framework for Crow, the Corpus and Repository of Writing

Language: PHP - Size: 2.72 MB - Last synced: 16 days ago - Pushed: 17 days ago - Stars: 1 - Forks: 0

KurdishBLARK/KTC

Kurdish Textbooks Corpus

Size: 1.92 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 6 - Forks: 0

crownpku/Small-Chinese-Corpus

Some useful Chinese corpus datasets 中文语料小数据

Size: 92.4 MB - Last synced: 8 days ago - Pushed: about 4 years ago - Stars: 526 - Forks: 161

kunansy/RNC

API for Russian National Corpus

Language: Python - Size: 745 KB - Last synced: 6 days ago - Pushed: 10 months ago - Stars: 7 - Forks: 0

alexeykosh/lingcorpora.py

API for corpora

Language: Python - Size: 146 KB - Last synced: 17 days ago - Pushed: almost 5 years ago - Stars: 8 - Forks: 11

NiuTrans/Classical-Modern

非常全的文言文（古文）-现代文平行语料

Language: Python - Size: 400 MB - Last synced: 18 days ago - Pushed: 18 days ago - Stars: 891 - Forks: 194

adliska/parallel_text_cleaning

Code for my BSc thesis: Cleaning of Parallel Texts for Machine Translation

Language: Java - Size: 15.6 KB - Last synced: 19 days ago - Pushed: about 8 years ago - Stars: 0 - Forks: 0

several27/FakeNewsCorpus

A dataset of millions of news articles scraped from a curated list of data sources.

Size: 442 KB - Last synced: 20 days ago - Pushed: over 4 years ago - Stars: 373 - Forks: 96

philipperemy/japanese-words-to-vectors

Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.

Language: Python - Size: 333 KB - Last synced: 8 days ago - Pushed: over 2 years ago - Stars: 83 - Forks: 19

agaraman0/Fundamentals-Of-NLP

Natural language Processing

Language: Jupyter Notebook - Size: 34.2 KB - Last synced: 22 days ago - Pushed: almost 6 years ago - Stars: 0 - Forks: 0

chatopera/efaqa-corpus-zh

❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库

Language: Python - Size: 204 KB - Last synced: 22 days ago - Pushed: 4 months ago - Stars: 547 - Forks: 80

christos-c/bible-corpus

A multilingual parallel corpus created from translations of the Bible.

Size: 138 MB - Last synced: 21 days ago - Pushed: about 2 months ago - Stars: 163 - Forks: 45

PenguinCabinet/mama-katu-DM-corpus

The corpus of Japanese spam messages of invitation Mama Katu.

Language: Python - Size: 44.9 KB - Last synced: 22 days ago - Pushed: 12 months ago - Stars: 42 - Forks: 12

744189447/tfidf

A golang library supporting Chinese and English tag extraction, Chinese word segmentation using Jieba, according to the tfidf weight to extract corpus tags, corpus set using BoltDB.

Language: Go - Size: 13.7 KB - Last synced: 22 days ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0

gambolputty/german-nouns

A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.

Language: Python - Size: 21 MB - Last synced: 20 days ago - Pushed: about 2 months ago - Stars: 127 - Forks: 18

techiaith/corpws-meincnodi-rhannau-ymadrodd

Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for benchmarking Welsh part-of-speech taggers

Size: 104 KB - Last synced: 23 days ago - Pushed: about 2 years ago - Stars: 0 - Forks: 1

sorinmarti/fruechtekorb

This is a text corpus management system for the german linguistic department of the university of Basel.

Language: PHP - Size: 531 KB - Last synced: 23 days ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0

Superar/Puntuguese

Language: Python - Size: 3.61 MB - Last synced: 22 days ago - Pushed: 22 days ago - Stars: 3 - Forks: 0

aquilax/bg-words-dict

Списък с думи на български език.

Language: Shell - Size: 7.82 MB - Last synced: 23 days ago - Pushed: about 3 years ago - Stars: 5 - Forks: 2

NCBI-Hackathons/ClusterDuck

Disease Clustering from Literature Based on Minimal Training Data

Language: Python - Size: 274 KB - Last synced: 23 days ago - Pushed: over 1 year ago - Stars: 7 - Forks: 6

hugovk/everyfinnishword

Every Finnish word

Size: 1.01 MB - Last synced: 8 days ago - Pushed: almost 9 years ago - Stars: 28 - Forks: 1

GateNLP/corpusconversion-tiger

Tool to convert the German Tiger corpus and other corpora in Tiger format to GATE

Language: Groovy - Size: 6.84 KB - Last synced: 24 days ago - Pushed: over 7 years ago - Stars: 0 - Forks: 0

fbkarsdorp/concy

Simple Concordance Tool

Language: Python - Size: 1.95 KB - Last synced: 24 days ago - Pushed: over 6 years ago - Stars: 2 - Forks: 0

tensorlayer/seq2seq-chatbot

Chatbot in 200 lines of code using TensorLayer

Language: Python - Size: 14.7 MB - Last synced: 24 days ago - Pushed: over 2 years ago - Stars: 836 - Forks: 316

linhd-postdata/disco

Diachronic Spanish Sonnet Corpus. Canonical and minor authors in Spanish (Europe and America): 15th to 19th century

Size: 12.2 MB - Last synced: 24 days ago - Pushed: almost 6 years ago - Stars: 4 - Forks: 0

lxs602/Chinese-Mandarin-Dictionaries

中文词典 / 中文詞典。Chinese / Chinese-English dictionaries.

Language: HTML - Size: 7.49 GB - Last synced: 24 days ago - Pushed: 24 days ago - Stars: 109 - Forks: 18

adbar/trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Language: Python - Size: 23.2 MB - Last synced: 27 days ago - Pushed: 27 days ago - Stars: 2,688 - Forks: 205

PyThaiNLP/Thai-Lao-Parallel-Corpus

Thai Lao Parallel corpus

Size: 524 KB - Last synced: 8 days ago - Pushed: over 2 years ago - Stars: 5 - Forks: 1

MiMoText/roman18

Collection de romans français du dix-huitième siècle (1751-1800) / Collection of Eighteenth-Century French Novels (1751-1800)

Language: HTML - Size: 440 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 17 - Forks: 7

INL/corpus-frontend

BlackLab Frontend, a feature-rich corpus search interface for BlackLab.

Language: TypeScript - Size: 13.3 MB - Last synced: 27 days ago - Pushed: 28 days ago - Stars: 15 - Forks: 6

INL/BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene

Language: Java - Size: 25.2 MB - Last synced: 28 days ago - Pushed: 29 days ago - Stars: 97 - Forks: 51

AMS21/DLXEmu-Corpus

Corpus storage for DLXEmu fuzzers

Size: 1.74 GB - Last synced: 26 days ago - Pushed: 26 days ago - Stars: 0 - Forks: 0

erc-dharma/tfb-daksinakosala-epigraphy

DHARMA project Task Force B, Dakṣiṇa Kosala epigraphic corpus being prepared by Natasja Bosma.

Language: HTML - Size: 1.55 MB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 0 - Forks: 0

flairNLP/fundus

A very simple news crawler with a funny name

Language: Python - Size: 14.4 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 38 - Forks: 5

CanCLID/canto-filter

粵文語料篩選器 Cantonese text filter

Language: Python - Size: 21.5 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 29 - Forks: 2

zjunlp/IEPile

IEPile: A Large-Scale Information Extraction Corpus

Language: Python - Size: 2.07 MB - Last synced: 28 days ago - Pushed: 29 days ago - Stars: 61 - Forks: 4

lucasjinreal/weibo_terminater

Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator

Language: Python - Size: 162 KB - Last synced: 26 days ago - Pushed: over 4 years ago - Stars: 2,312 - Forks: 460

howl-anderson/MITIE_Chinese_Wikipedia_corpus

Pre-trained Wikipedia corpus by MITIE

Size: 5.86 KB - Last synced: 8 days ago - Pushed: over 5 years ago - Stars: 52 - Forks: 9

unendin/Trump_Campaign_Corpus

Corpus of campaign speeches, interviews, debates, statements and tweets by Donald Trump

Size: 22.1 MB - Last synced: 29 days ago - Pushed: almost 7 years ago - Stars: 14 - Forks: 6

SMIL-SPCRAS/DAVIS

Official repo for "Audio-Visual Speech Recognition In-the-Wild: Multi-Angle Vehicle Cabin Corpus and Attention-based Method" in ICASSP 2024

Language: JavaScript - Size: 5.82 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 6 - Forks: 0

chatopera/insuranceqa-corpus-zh

:helicopter: 保险行业语料库，聊天机器人

Language: Python - Size: 533 MB - Last synced: 26 days ago - Pushed: 6 months ago - Stars: 989 - Forks: 338

islamAndAi/QURAN-NLP

Quran, Hadith, Translations, Tafaseer, Corpus Linguistics. Everything for NLP

Language: Jupyter Notebook - Size: 105 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 42 - Forks: 9

gunthercox/chatterbot-corpus

A multilingual dialog corpus

Language: Python - Size: 536 KB - Last synced: 27 days ago - Pushed: 3 months ago - Stars: 1,339 - Forks: 1,149

CBLUEbenchmark/CBLUE

中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Language: Python - Size: 1.61 MB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 667 - Forks: 118

nevenjovanovic/croatiae-auctores-latini-textus

XML texts of Croatian Latin authors (published as CroALa digital collection)

Language: XQuery - Size: 39.4 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 6 - Forks: 4

FLAGlab/SimCorp

This corpus contains different datasets of behaviorally equivalent C/C++ programs to evaluate their semantic similitude. The datasets: 6 Type-4 scenarios extracted from the BigCloneBench 10 programs for sorting, aggregation, and search algorithms 566 programs extracted from CodeForces solving 5 different problems

Language: C++ - Size: 9.06 MB - Last synced: 23 days ago - Pushed: almost 2 years ago - Stars: 2 - Forks: 0

Related Keywords

corpus 771 nlp 205 dataset 109 natural-language-processing 88 python 78 corpus-linguistics 72 linguistics 59 machine-learning 57 corpora 52 corpus-data 52 language 27 named-entity-recognition 24 chinese 22 sentiment-analysis 22 deep-learning 21 nltk 21 corpus-tools 21 datasets 21 chatbot 19 nlp-machine-learning 18 language-model 17 bert 17 digital-humanities 17 data 17 machine-translation 16 corpus-processing 16 dialogue 16 text 16 news 15 text-mining 15 word2vec 15 japanese 15 arabic 14 ner 14 wikipedia 14 twitter 14 arabic-nlp 13 english 13 text-classification 13 xml 12 crawler 12 parallel-corpus 12 information-extraction 12 search-engine 12 words 11 speech 11 annotation 11 data-science 11 scraper 11 ngrams 10 computational-linguistics 10 asr 10 python3 10 translation 9 tensorflow 9 web-scraping 9 treebank 9 classification 9 corpus-builder 9 thai-language 9 speech-recognition 9 chinese-nlp 8 information-retrieval 8 korean 8 nlp-datasets 8 kurdish 8 r 8 literature 8 dictionary 8 spacy 8 dialogues 8 tf-idf 8 chinese-corpus 8 parsing 8 natural-language-understanding 7 tei 7 text-processing 7 japanese-language 7 conversation 7 corpus-generator 7 indonesian-language 7 fuzzing 7 artificial-intelligence 7 text-corpus 7 tei-xml 7 stopwords 7 embeddings 7 spanish 7 summarization 7 lstm 7 pos-tagging 7 semantics 7 vocabulary 7 german 6 topic-modeling 6 docker 6 database 6 text-analysis 6 api 6 search 6