Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: corpus-linguistics

Repositories

mshakirDr/MFTE

MFTE (Multi Feature Tagger of English) Python is the Python version based on Le Foll's MFTE written in Perl. It is extended to include semantic tags from Biber (2006) and Biber et al. (1999), including other specific tags.

Language: Python - Size: 23.7 MB - Last synced: 36 minutes ago - Pushed: about 2 hours ago - Stars: 14 - Forks: 2

BLKSerene/Wordless

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

Language: Python - Size: 66.7 MB - Last synced: about 15 hours ago - Pushed: about 17 hours ago - Stars: 672 - Forks: 88

tanloong/interlaced.nvim

Neovim plugin for aligning bilingual parallel texts

Language: Lua - Size: 41 KB - Last synced: about 21 hours ago - Pushed: about 22 hours ago - Stars: 5 - Forks: 0

kuhumcst/texton

Text Tonsorium - a toolbox that automatically arranges NLP tools in workflows and enacts them with user's inputs

Language: PHP - Size: 7.72 MB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 4 - Forks: 0

tomachalek/vertigo

A corpus vertical file parser

Language: Go - Size: 80.1 KB - Last synced: 2 days ago - Pushed: 3 days ago - Stars: 1 - Forks: 0

complexico/anger-mad-coca

A repository for R codes and data for a paper titled "Exploring grammatical and semantic profiles of ANGRY and MAD: A corpus-based study". The paper uses data from the Corpus of Contemporary American English (COCA) as part of the undergraduate thesis project by Ida Ayu Saskara Tranggana Suari, supervised by Prof. I N. Sudipa and Gede Rajeg, PhD.

Language: R - Size: 2.09 MB - Last synced: 1 day ago - Pushed: 2 days ago - Stars: 0 - Forks: 0

CambridgeSemiticsLab/BH_time_collocations

Data for PhD Thesis: A Collocational Analysis of Biblical Hebrew Time Phrases

Language: Jupyter Notebook - Size: 486 MB - Last synced: 3 days ago - Pushed: 4 days ago - Stars: 6 - Forks: 0

engisalor/sketch-grammar-explorer

A Python package for the Sketch Engine API

Language: Python - Size: 293 KB - Last synced: 6 days ago - Pushed: 6 days ago - Stars: 5 - Forks: 0

fau-klue/pandas-association-measures

Statistical association measures for Python pandas

Language: Python - Size: 923 KB - Last synced: 7 days ago - Pushed: 5 months ago - Stars: 8 - Forks: 1

kirralabs/indonesian-NLP-resources

data resource untuk NLP bahasa indonesia

Size: 7.81 KB - Last synced: about 9 hours ago - Pushed: over 3 years ago - Stars: 221 - Forks: 50

czcorpus/kontext

An advanced, extensible web front-end for the Manatee-open corpus search engine

Language: TypeScript - Size: 37 MB - Last synced: 7 days ago - Pushed: 8 days ago - Stars: 59 - Forks: 22

suomela/types3

types3: Type accumulation curves

Language: Rust - Size: 1.75 MB - Last synced: 8 days ago - Pushed: 8 days ago - Stars: 1 - Forks: 0

oroszgy/awesome-hungarian-nlp

A curated list of NLP resources for Hungarian

Size: 110 KB - Last synced: 3 days ago - Pushed: 7 months ago - Stars: 208 - Forks: 18

complexico/verb-noun-assoc-corpus-experiment

Repository of data and results for an undergraduate thesis titled "A Corpus-Based Study to Triangulating Experimental Evidence Regarding Verb-Noun Association for Action Verbs" by I Gede Semara Dharma Putra.

Size: 787 KB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 0 - Forks: 0

c0ntradicti0n/CorpusCookApp

App and Scripts working with the corpus-builder CorpusCook, to have a corpus updated with corrected wrong predictions

Language: Python - Size: 181 MB - Last synced: 9 days ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0

elenlefoll/TextbookMDA

Repository of the online supplements to "Textbook English: A Multi-Dimensional Approach" (Le Foll, to appear).

Language: TeX - Size: 289 MB - Last synced: 10 days ago - Pushed: 10 days ago - Stars: 0 - Forks: 0

adbar/German-NLP

Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

Size: 103 KB - Last synced: 10 days ago - Pushed: 7 months ago - Stars: 405 - Forks: 57

oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

Language: Rust - Size: 4.72 MB - Last synced: about 5 hours ago - Pushed: 5 months ago - Stars: 150 - Forks: 14

hermanpetrov/Keyword_search

An extractor of keywords for Estonian texts.

Language: Python - Size: 50.6 MB - Last synced: 13 days ago - Pushed: about 2 years ago - Stars: 0 - Forks: 0

complexico/dipscorling2024

Repository for materials to be delivered at the Diponegoro Summer Course in Corpus Linguistics (DipSCORLING 2024) (22 - 27 July 2024).

Size: 2.28 MB - Last synced: 13 days ago - Pushed: 14 days ago - Stars: 0 - Forks: 0

CompLin/nheengatu

Tools and resources for the computational processing of Nheengatu (Modern Tupi)

Language: Python - Size: 6.04 MB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 6 - Forks: 1

Ohara124c41/MLNDT-Beta-Plagiarism_Detection

First machine learning project for beta testing (Udacity MLND-T). The students will be utilizing n-grams and associations to detect plagiarized essay submissions.

Language: Jupyter Notebook - Size: 448 KB - Last synced: 16 days ago - Pushed: over 5 years ago - Stars: 1 - Forks: 0

RenanKummer/ufrgs-exatolp-webapi

Web APIs for corpus linguistic research in Brazilian Portuguese

Language: C# - Size: 1020 KB - Last synced: 16 days ago - Pushed: 17 days ago - Stars: 0 - Forks: 0

esbudylin/rusrime

a terminal tool for searching rhymes within the Russian National Corpus

Language: Python - Size: 43 KB - Last synced: 19 days ago - Pushed: 19 days ago - Stars: 0 - Forks: 0

notesjor/CorpusExplorer.Terminal.Console

Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf Analysen/Daten des CorpusExplorer v2.0

Language: C# - Size: 596 KB - Last synced: 21 days ago - Pushed: 21 days ago - Stars: 7 - Forks: 0

julienijs/keep_V-ing

The grammaticalization of keep

Language: R - Size: 5 MB - Last synced: 22 days ago - Pushed: 23 days ago - Stars: 0 - Forks: 0

google/corpuscrawler

Crawler for linguistic corpora

Language: Python - Size: 487 KB - Last synced: 9 days ago - Pushed: 5 months ago - Stars: 181 - Forks: 56

writecrow/crow_backend

The canonical resources to build the backend for a corpus/repository management framework for Crow, the Corpus and Repository of Writing

Language: PHP - Size: 2.72 MB - Last synced: 23 days ago - Pushed: 24 days ago - Stars: 1 - Forks: 0

KurdishBLARK/KTC

Kurdish Textbooks Corpus

Size: 1.92 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 6 - Forks: 0

jacorread/jacorread.github.io

Alejandro Correa

Language: SCSS - Size: 115 MB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 0 - Forks: 0

lin-380-s24/lin-380-s24.github.io

Course site

Language: JavaScript - Size: 623 MB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 0 - Forks: 0

ispasic/idiometry

An idiom search engine

Language: JavaScript - Size: 776 KB - Last synced: 27 days ago - Pushed: over 1 year ago - Stars: 6 - Forks: 1

IngoKl/PyXMLConc

A very simple concordancer with XML support.

Language: Python - Size: 35.2 KB - Last synced: 29 days ago - Pushed: over 3 years ago - Stars: 0 - Forks: 1

sorinmarti/fruechtekorb

This is a text corpus management system for the german linguistic department of the university of Basel.

Language: PHP - Size: 531 KB - Last synced: about 1 month ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0

Superar/Puntuguese

Language: Python - Size: 3.61 MB - Last synced: 30 days ago - Pushed: about 1 month ago - Stars: 3 - Forks: 0

kmkurn/id-nlp-resource

A list of Indonesian NLP resources.

Size: 38.1 KB - Last synced: 26 days ago - Pushed: over 2 years ago - Stars: 266 - Forks: 48

fbkarsdorp/concy

Simple Concordance Tool

Language: Python - Size: 1.95 KB - Last synced: about 1 month ago - Pushed: over 6 years ago - Stars: 2 - Forks: 0

UUDigitalHumanitieslab/I-analyzer

The great textmining tool that obviates all others

Language: Python - Size: 48.3 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 6 - Forks: 1

snizio/italian-wiktionary-parser

This repository contains a python script for parsing an xml dump of the Italian Wiktionary (Wikizionario); it also contains the parsed dictionary in a JSON file and a ONLI (italian database of neologisms) scraper with the scraped data in a CSV file

Language: Python - Size: 137 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 5 - Forks: 0

acqdiv/acqdiv

Pipeline for the ACQDIV Corpus Database

Language: Python - Size: 2.59 MB - Last synced: 3 days ago - Pushed: over 3 years ago - Stars: 2 - Forks: 3

natasha/nerus

Large silver standart Russian corpus with NER, morphology and syntax markup

Language: Python - Size: 9.62 MB - Last synced: 22 days ago - Pushed: 10 months ago - Stars: 58 - Forks: 9

cboulanger/ltkg-tools

Legal Theory Knowledge Graph Project - Tools and Resources

Size: 114 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0

islamAndAi/QURAN-NLP

Quran, Hadith, Translations, Tafaseer, Corpus Linguistics. Everything for NLP

Language: Jupyter Notebook - Size: 105 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 42 - Forks: 9

JorgeFCS/multimodal-annotation-distance

A tool for determinating distances between multimodal annotations.

Language: Python - Size: 485 KB - Last synced: about 1 month ago - Pushed: 7 months ago - Stars: 0 - Forks: 1

louisowen6/NLP_bahasa_resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Size: 258 KB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 427 - Forks: 118

isabel-mm/stylo-r-novels

R+Python code for stylometric analysis on a corpus of Anglophone novels.

Language: Python - Size: 18.4 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 1 - Forks: 0

CLARIAH/wp6-missieven

General Missives in Text-Fabric

Language: Jupyter Notebook - Size: 279 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 5 - Forks: 2

faktorovich/Attribution

Computational-Linguistics Attribution Data

Size: 150 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 2 - Forks: 0

OpenCorpora/opencorpora

A web-based engine for creating and annotating textual corpora

Language: PHP - Size: 5.93 MB - Last synced: 29 days ago - Pushed: 9 months ago - Stars: 237 - Forks: 23

nikhil-iyer-97/Language-Identifier

Language identification toolkit for identifying what language a document is writen in

Language: Python - Size: 7.65 MB - Last synced: about 2 months ago - Pushed: almost 3 years ago - Stars: 5 - Forks: 1

notesjor/corpusexplorer2.0

Korpuslinguistik war noch nie so einfach...

Language: C# - Size: 32.1 MB - Last synced: 17 days ago - Pushed: 3 months ago - Stars: 19 - Forks: 2

sylvainloiseau/igtcorpus

Tools for IGT (interlinear glossed texts).

Language: Python - Size: 373 KB - Last synced: about 2 months ago - Pushed: 11 months ago - Stars: 1 - Forks: 0

pasmod/simurg

A Dataset for Training and Testing Abstractive Summarizers

Language: Python - Size: 9.25 MB - Last synced: about 2 months ago - Pushed: about 7 years ago - Stars: 3 - Forks: 1

bdar-lab/heb_architecture_corpus

Cleaned, parsed, and analyzed Hebrew textual corpus of documents pertaining to construction, planning, and architecture.

Size: 3.35 GB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 4 - Forks: 0

magizbox/scraper

Scraper

Language: Python - Size: 74.8 MB - Last synced: 2 months ago - Pushed: over 5 years ago - Stars: 13 - Forks: 7

OliverHellwig/sanskrit

Data for the quantitative study of (Vedic) Sanskrit

Language: Python - Size: 1010 MB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 103 - Forks: 41

interrogator/conll-df

CONLL-U to Pandas DataFrame

Language: Python - Size: 14.6 KB - Last synced: 18 days ago - Pushed: over 6 years ago - Stars: 29 - Forks: 9

johnwdubois/rezonator

Rezonator: Dynamics of human engagement

Language: Yacc - Size: 4.36 GB - Last synced: 2 months ago - Pushed: about 1 year ago - Stars: 34 - Forks: 1

KMCS-NII/AASC

AASC: ACL Anthology Sentence Corpus

Language: Perl - Size: 17.6 KB - Last synced: 17 days ago - Pushed: over 3 years ago - Stars: 21 - Forks: 2

PyThaiNLP/thai-law

Thai Law Dataset (Act of Parliament)

Language: Jupyter Notebook - Size: 11.3 MB - Last synced: 16 days ago - Pushed: almost 3 years ago - Stars: 14 - Forks: 4

engisalor/quartz

An app for visualizing Sketch Engine API data

Language: Python - Size: 715 KB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0

uma-pi1/OPIEC-pipeline

Language: Java - Size: 59.3 MB - Last synced: 10 days ago - Pushed: about 2 years ago - Stars: 14 - Forks: 2

Alex-bzh/compuling

Resources to learn how to manage corpus with Python.

Language: Jupyter Notebook - Size: 15.8 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 2 - Forks: 0

gederajeg/diatesis-bahasa-indonesia

Repository data dan kode pemrograman untuk bab buku berjudul "Kajian korpus kuantitatif terhadap aspek-aspek diatesis dalam bahasa Indonesia", yang menjadi bagian dari buku Tatabahasa Bahasa Indonesia Kontemporer (TBIK) berbasis korpus. Naskah utuh dapat diakses secara daring 👇.

Language: HTML - Size: 4.64 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0

IngoKl/textdirectory

TextDirectory allows you to filter, transform, and combine multiple text files into one aggregated file.

Language: Python - Size: 5.9 MB - Last synced: 11 days ago - Pushed: over 1 year ago - Stars: 11 - Forks: 2

roverbird/corpus_utils

Semantic word relations analysis and visualization for corpus linguistics and NLP

Language: R - Size: 28.4 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0

timarkh/tsakorpus

Yet another search platform for linguistic corpora.

Language: Python - Size: 3.28 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 16 - Forks: 12

scriptin/kanji-frequency

Kanji usage frequency data collected from various sources

Language: Astro - Size: 2.1 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 116 - Forks: 17

lennes/spect

SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/

Language: HTML - Size: 275 KB - Last synced: 3 months ago - Pushed: 9 months ago - Stars: 50 - Forks: 12

oscar-project/goclassy 📦

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

Language: Go - Size: 377 KB - Last synced: 28 days ago - Pushed: about 3 years ago - Stars: 84 - Forks: 6

stcoats/zipf_explorer

A tool for the visualization of word frequency differences.

Language: Python - Size: 34.1 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0

praaline/Praaline

Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora

Language: C - Size: 147 MB - Last synced: 5 months ago - Pushed: over 1 year ago - Stars: 26 - Forks: 4

timarkh/vk-texts-harvester

Harvest texts from vk.com through API.

Language: Python - Size: 11.7 KB - Last synced: 5 months ago - Pushed: about 4 years ago - Stars: 1 - Forks: 0

ITSec-Uni-Munster/Bilingual-Longitudinal-Analysis-of-Privacy-Policies

This repository contains the code of the PETS 2024.2 paper titled "Paper title: A Bilingual Longitudinal Analysis of Privacy Policies Measuring the Impacts of the GDPR and the CCPA/CPRA"

Language: Python - Size: 47.9 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

quadrama/DramaAnalysis

An R package for analysis of dramatic texts

Language: R - Size: 36.3 MB - Last synced: 27 days ago - Pushed: over 1 year ago - Stars: 15 - Forks: 2

digitallinguistics/data-explorer

The DLx portal for viewing, searching, and aggregating data

Language: JavaScript - Size: 7.66 MB - Last synced: about 1 month ago - Pushed: 10 months ago - Stars: 3 - Forks: 0

partigabor/phd-thesis

Ph.D. thesis of Gábor Parti, 2023

Language: TeX - Size: 441 MB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 2 - Forks: 0

gowribhat/sms-corpus-keyword-analysis

Language: Jupyter Notebook - Size: 144 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

jenniferwagner18/telenovela-transcripts

Analyze language used in Spanish-language novelas using corpus linguistics tools

Language: Python - Size: 5.86 KB - Last synced: 5 months ago - Pushed: almost 2 years ago - Stars: 2 - Forks: 0

alex-raw/imsdb_parse

Parse movie scripts for linguistic analysis

Language: Python - Size: 161 KB - Last synced: 5 months ago - Pushed: about 2 years ago - Stars: 2 - Forks: 0

LanguageMachines/PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation

Language: Python - Size: 4.26 MB - Last synced: 7 days ago - Pushed: over 1 year ago - Stars: 46 - Forks: 6

gederajeg/applicative-buy

R Notebook and Dataset for a corpus-based study of Indonesian BUY verbs in applicative construction with -KAN (published in NUSA special issue on Applicative Construction)

Language: HTML - Size: 1.89 MB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0

stewieboomhauer/IVK-Ler-Corpus

The data and code located in this repository introduce an international preparatory class learner corpus and its complexity analyses.

Language: R - Size: 31.1 MB - Last synced: 6 months ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

AustinZuniga/Filipino-wordlist

Filipino wordlist word-level

Language: Python - Size: 41.9 MB - Last synced: 6 months ago - Pushed: over 5 years ago - Stars: 6 - Forks: 0

julienijs/Linguistic_Complexity_and_Gender

Does linguistic complexity correlate with gender?

Language: R - Size: 1.97 MB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0

JonathanReeve/corpus-db

A textual corpus database for the digital humanities.

Language: Jupyter Notebook - Size: 26 MB - Last synced: 15 days ago - Pushed: almost 4 years ago - Stars: 57 - Forks: 8

JiashuWu/Books

My book list

Size: 4.36 GB - Last synced: 6 months ago - Pushed: almost 2 years ago - Stars: 295 - Forks: 222

digitallinguistics/DFT

Discourse Functional Transcription

Size: 23.4 KB - Last synced: about 1 month ago - Pushed: about 4 years ago - Stars: 2 - Forks: 1

dimboump/compare-clefts-ukmp

Code for final assignment for Corpus Studies course at the University of Antwerp (2022)

Language: Jupyter Notebook - Size: 1.65 MB - Last synced: 7 months ago - Pushed: over 1 year ago - Stars: 0 - Forks: 0

hexatomic/hexatomic

Hexatomic is an extensible software for deep multi-layer annotation of linguistic corpora

Language: Java - Size: 22.8 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 12 - Forks: 6

seanbethard/corpuswork

Corpuswork

Language: Jupyter Notebook - Size: 2.09 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 1 - Forks: 0

STRZGR/Natural-Language-Processing-with-Python-Analyzing-Text-with-the-Natural-Language-Toolkit

My solutions to selected exercises to "Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit" by Steven Bird, Ewan Klein, and Edward Loper.

Language: Jupyter Notebook - Size: 9.75 MB - Last synced: 7 months ago - Pushed: over 4 years ago - Stars: 43 - Forks: 34

kbatsuren/CogNet

CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates

Size: 88.7 MB - Last synced: 7 months ago - Pushed: 11 months ago - Stars: 32 - Forks: 8

gederajeg/happyr

The accompanying R package for Rajeg's (2019) PhD thesis titled "Metaphorical profiles and near-synonyms: A corpus-based study of Indonesian words for Happiness"

Language: R - Size: 3.16 MB - Last synced: 7 months ago - Pushed: over 2 years ago - Stars: 3 - Forks: 0

dterg/biomedical_corpora

Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.

Size: 21.5 KB - Last synced: 7 months ago - Pushed: about 6 years ago - Stars: 17 - Forks: 4

Related Keywords

corpus-linguistics 295 corpus 72 linguistics 63 nlp 60 natural-language-processing 45 corpus-tools 29 corpora 27 corpus-data 27 python 25 computational-linguistics 19 digital-humanities 18 corpus-processing 16 text-mining 15 nltk 13 r 12 indonesian-language 12 corpus-builder 9 dataset 8 language 8 text-analysis 7 construction-grammar 7 dlx 7 text-processing 7 sentiment-analysis 7 annotation 7 digital-linguistics 7 nlp-machine-learning 7 python3 7 indonesian 6 udayana-university 6 linguistik-korpus 6 syntax 6 language-documentation 6 learner-corpus 6 corpus-analysis 6 cognitive-linguistics 6 indonesian-linguistics 6 usage-based-linguistics 6 data-science 6 languages-of-russia 5 natural-language-understanding 5 machine-learning 5 corpus-generator 5 tagger 5 open-science 5 quantitative-corpus-linguistics 5 crawler 5 translation 4 twitter 4 frequency 4 text-corpus 4 machine-translation 4 ngrams 4 topic-modeling 4 corpus-search 4 frequency-lists 4 cwb 4 data 4 nlp-datasets 4 linguistic-analysis 4 named-entity-recognition 4 linguistic-complexity 4 text-classification 4 leipzig-corpora-collection 4 information-retrieval 4 data-mining 4 word2vec 3 historical-linguistics 3 n-grams 3 bigrams 3 deep-learning 3 language-model 3 wikipedia-corpus 3 common-crawl 3 wikipedia 3 database 3 linguistik-kognitif 3 russian 3 morphology 3 typology 3 parallel-corpora 3 visualization 3 descriptive-linguistics 3 brown-corpus 3 interlinear-gloss 3 neo-aramaic 3 distinctive-collexeme-analysis 3 bahasa-indonesia 3 happiness-metaphors 3 scraper 3 emotion-metaphors 3 speech 3 spacy 3 literature 3 stoplist 3 lexicography 3 r-programming 3 english 3 stopwords 3 complexico 3