Ecosyste.ms: Repos
An open API service providing repository metadata for many open source software ecosystems.
GitHub topics: corpus-linguistics
mshakirDr/MFTE
MFTE (Multi Feature Tagger of English) Python is the Python version based on Le Foll's MFTE written in Perl. It is extended to include semantic tags from Biber (2006) and Biber et al. (1999), including other specific tags.
Language: Python - Size: 23.7 MB - Last synced: 36 minutes ago - Pushed: about 2 hours ago - Stars: 14 - Forks: 2
BLKSerene/Wordless
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Language: Python - Size: 66.7 MB - Last synced: about 15 hours ago - Pushed: about 17 hours ago - Stars: 672 - Forks: 88
tanloong/interlaced.nvim
Neovim plugin for aligning bilingual parallel texts
Language: Lua - Size: 41 KB - Last synced: about 21 hours ago - Pushed: about 22 hours ago - Stars: 5 - Forks: 0
kuhumcst/texton
Text Tonsorium - a toolbox that automatically arranges NLP tools in workflows and enacts them with user's inputs
Language: PHP - Size: 7.72 MB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 4 - Forks: 0
tomachalek/vertigo
A corpus vertical file parser
Language: Go - Size: 80.1 KB - Last synced: 2 days ago - Pushed: 3 days ago - Stars: 1 - Forks: 0
complexico/anger-mad-coca
A repository for R codes and data for a paper titled "Exploring grammatical and semantic profiles of ANGRY and MAD: A corpus-based study". The paper uses data from the Corpus of Contemporary American English (COCA) as part of the undergraduate thesis project by Ida Ayu Saskara Tranggana Suari, supervised by Prof. I N. Sudipa and Gede Rajeg, PhD.
Language: R - Size: 2.09 MB - Last synced: 1 day ago - Pushed: 2 days ago - Stars: 0 - Forks: 0
CambridgeSemiticsLab/BH_time_collocations
Data for PhD Thesis: A Collocational Analysis of Biblical Hebrew Time Phrases
Language: Jupyter Notebook - Size: 486 MB - Last synced: 3 days ago - Pushed: 4 days ago - Stars: 6 - Forks: 0
engisalor/sketch-grammar-explorer
A Python package for the Sketch Engine API
Language: Python - Size: 293 KB - Last synced: 6 days ago - Pushed: 6 days ago - Stars: 5 - Forks: 0
fau-klue/pandas-association-measures
Statistical association measures for Python pandas
Language: Python - Size: 923 KB - Last synced: 7 days ago - Pushed: 5 months ago - Stars: 8 - Forks: 1
kirralabs/indonesian-NLP-resources
data resource untuk NLP bahasa indonesia
Size: 7.81 KB - Last synced: about 9 hours ago - Pushed: over 3 years ago - Stars: 221 - Forks: 50
czcorpus/kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
Language: TypeScript - Size: 37 MB - Last synced: 7 days ago - Pushed: 8 days ago - Stars: 59 - Forks: 22
suomela/types3
types3: Type accumulation curves
Language: Rust - Size: 1.75 MB - Last synced: 8 days ago - Pushed: 8 days ago - Stars: 1 - Forks: 0
oroszgy/awesome-hungarian-nlp
A curated list of NLP resources for Hungarian
Size: 110 KB - Last synced: 3 days ago - Pushed: 7 months ago - Stars: 208 - Forks: 18
complexico/verb-noun-assoc-corpus-experiment
Repository of data and results for an undergraduate thesis titled "A Corpus-Based Study to Triangulating Experimental Evidence Regarding Verb-Noun Association for Action Verbs" by I Gede Semara Dharma Putra.
Size: 787 KB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 0 - Forks: 0
c0ntradicti0n/CorpusCookApp
App and Scripts working with the corpus-builder CorpusCook, to have a corpus updated with corrected wrong predictions
Language: Python - Size: 181 MB - Last synced: 9 days ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0
elenlefoll/TextbookMDA
Repository of the online supplements to "Textbook English: A Multi-Dimensional Approach" (Le Foll, to appear).
Language: TeX - Size: 289 MB - Last synced: 10 days ago - Pushed: 10 days ago - Stars: 0 - Forks: 0
adbar/German-NLP
Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
Size: 103 KB - Last synced: 10 days ago - Pushed: 7 months ago - Stars: 405 - Forks: 57
oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
Language: Rust - Size: 4.72 MB - Last synced: about 5 hours ago - Pushed: 5 months ago - Stars: 150 - Forks: 14
hermanpetrov/Keyword_search
An extractor of keywords for Estonian texts.
Language: Python - Size: 50.6 MB - Last synced: 13 days ago - Pushed: about 2 years ago - Stars: 0 - Forks: 0
complexico/dipscorling2024
Repository for materials to be delivered at the Diponegoro Summer Course in Corpus Linguistics (DipSCORLING 2024) (22 - 27 July 2024).
Size: 2.28 MB - Last synced: 13 days ago - Pushed: 14 days ago - Stars: 0 - Forks: 0
CompLin/nheengatu
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
Language: Python - Size: 6.04 MB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 6 - Forks: 1
Ohara124c41/MLNDT-Beta-Plagiarism_Detection
First machine learning project for beta testing (Udacity MLND-T). The students will be utilizing n-grams and associations to detect plagiarized essay submissions.
Language: Jupyter Notebook - Size: 448 KB - Last synced: 16 days ago - Pushed: over 5 years ago - Stars: 1 - Forks: 0
RenanKummer/ufrgs-exatolp-webapi
Web APIs for corpus linguistic research in Brazilian Portuguese
Language: C# - Size: 1020 KB - Last synced: 16 days ago - Pushed: 17 days ago - Stars: 0 - Forks: 0
esbudylin/rusrime
a terminal tool for searching rhymes within the Russian National Corpus
Language: Python - Size: 43 KB - Last synced: 19 days ago - Pushed: 19 days ago - Stars: 0 - Forks: 0
notesjor/CorpusExplorer.Terminal.Console
Erlaubt anderen Programmen/Programmiersprachen den Zugriff auf Analysen/Daten des CorpusExplorer v2.0
Language: C# - Size: 596 KB - Last synced: 21 days ago - Pushed: 21 days ago - Stars: 7 - Forks: 0
julienijs/keep_V-ing
The grammaticalization of keep
Language: R - Size: 5 MB - Last synced: 22 days ago - Pushed: 23 days ago - Stars: 0 - Forks: 0
google/corpuscrawler
Crawler for linguistic corpora
Language: Python - Size: 487 KB - Last synced: 9 days ago - Pushed: 5 months ago - Stars: 181 - Forks: 56
writecrow/crow_backend
The canonical resources to build the backend for a corpus/repository management framework for Crow, the Corpus and Repository of Writing
Language: PHP - Size: 2.72 MB - Last synced: 23 days ago - Pushed: 24 days ago - Stars: 1 - Forks: 0
KurdishBLARK/KTC
Kurdish Textbooks Corpus
Size: 1.92 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 6 - Forks: 0
jacorread/jacorread.github.io
Alejandro Correa
Language: SCSS - Size: 115 MB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 0 - Forks: 0
lin-380-s24/lin-380-s24.github.io
Course site
Language: JavaScript - Size: 623 MB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 0 - Forks: 0
ispasic/idiometry
An idiom search engine
Language: JavaScript - Size: 776 KB - Last synced: 27 days ago - Pushed: over 1 year ago - Stars: 6 - Forks: 1
IngoKl/PyXMLConc
A very simple concordancer with XML support.
Language: Python - Size: 35.2 KB - Last synced: 29 days ago - Pushed: over 3 years ago - Stars: 0 - Forks: 1
sorinmarti/fruechtekorb
This is a text corpus management system for the german linguistic department of the university of Basel.
Language: PHP - Size: 531 KB - Last synced: about 1 month ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0
Superar/Puntuguese
Language: Python - Size: 3.61 MB - Last synced: 30 days ago - Pushed: about 1 month ago - Stars: 3 - Forks: 0
kmkurn/id-nlp-resource
A list of Indonesian NLP resources.
Size: 38.1 KB - Last synced: 26 days ago - Pushed: over 2 years ago - Stars: 266 - Forks: 48
fbkarsdorp/concy
Simple Concordance Tool
Language: Python - Size: 1.95 KB - Last synced: about 1 month ago - Pushed: over 6 years ago - Stars: 2 - Forks: 0
UUDigitalHumanitieslab/I-analyzer
The great textmining tool that obviates all others
Language: Python - Size: 48.3 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 6 - Forks: 1
snizio/italian-wiktionary-parser
This repository contains a python script for parsing an xml dump of the Italian Wiktionary (Wikizionario); it also contains the parsed dictionary in a JSON file and a ONLI (italian database of neologisms) scraper with the scraped data in a CSV file
Language: Python - Size: 137 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 5 - Forks: 0
acqdiv/acqdiv
Pipeline for the ACQDIV Corpus Database
Language: Python - Size: 2.59 MB - Last synced: 3 days ago - Pushed: over 3 years ago - Stars: 2 - Forks: 3
natasha/nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
Language: Python - Size: 9.62 MB - Last synced: 22 days ago - Pushed: 10 months ago - Stars: 58 - Forks: 9
cboulanger/ltkg-tools
Legal Theory Knowledge Graph Project - Tools and Resources
Size: 114 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0
islamAndAi/QURAN-NLP
Quran, Hadith, Translations, Tafaseer, Corpus Linguistics. Everything for NLP
Language: Jupyter Notebook - Size: 105 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 42 - Forks: 9
JorgeFCS/multimodal-annotation-distance
A tool for determinating distances between multimodal annotations.
Language: Python - Size: 485 KB - Last synced: about 1 month ago - Pushed: 7 months ago - Stars: 0 - Forks: 1
louisowen6/NLP_bahasa_resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Size: 258 KB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 427 - Forks: 118
isabel-mm/stylo-r-novels
R+Python code for stylometric analysis on a corpus of Anglophone novels.
Language: Python - Size: 18.4 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 1 - Forks: 0
CLARIAH/wp6-missieven
General Missives in Text-Fabric
Language: Jupyter Notebook - Size: 279 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 5 - Forks: 2
faktorovich/Attribution
Computational-Linguistics Attribution Data
Size: 150 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 2 - Forks: 0
OpenCorpora/opencorpora
A web-based engine for creating and annotating textual corpora
Language: PHP - Size: 5.93 MB - Last synced: 29 days ago - Pushed: 9 months ago - Stars: 237 - Forks: 23
nikhil-iyer-97/Language-Identifier
Language identification toolkit for identifying what language a document is writen in
Language: Python - Size: 7.65 MB - Last synced: about 2 months ago - Pushed: almost 3 years ago - Stars: 5 - Forks: 1
notesjor/corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Language: C# - Size: 32.1 MB - Last synced: 17 days ago - Pushed: 3 months ago - Stars: 19 - Forks: 2
sylvainloiseau/igtcorpus
Tools for IGT (interlinear glossed texts).
Language: Python - Size: 373 KB - Last synced: about 2 months ago - Pushed: 11 months ago - Stars: 1 - Forks: 0
pasmod/simurg
A Dataset for Training and Testing Abstractive Summarizers
Language: Python - Size: 9.25 MB - Last synced: about 2 months ago - Pushed: about 7 years ago - Stars: 3 - Forks: 1
bdar-lab/heb_architecture_corpus
Cleaned, parsed, and analyzed Hebrew textual corpus of documents pertaining to construction, planning, and architecture.
Size: 3.35 GB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 4 - Forks: 0
magizbox/scraper
Scraper
Language: Python - Size: 74.8 MB - Last synced: 2 months ago - Pushed: over 5 years ago - Stars: 13 - Forks: 7
OliverHellwig/sanskrit
Data for the quantitative study of (Vedic) Sanskrit
Language: Python - Size: 1010 MB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 103 - Forks: 41
interrogator/conll-df
CONLL-U to Pandas DataFrame
Language: Python - Size: 14.6 KB - Last synced: 18 days ago - Pushed: over 6 years ago - Stars: 29 - Forks: 9
johnwdubois/rezonator
Rezonator: Dynamics of human engagement
Language: Yacc - Size: 4.36 GB - Last synced: 2 months ago - Pushed: about 1 year ago - Stars: 34 - Forks: 1
KMCS-NII/AASC
AASC: ACL Anthology Sentence Corpus
Language: Perl - Size: 17.6 KB - Last synced: 17 days ago - Pushed: over 3 years ago - Stars: 21 - Forks: 2
PyThaiNLP/thai-law
Thai Law Dataset (Act of Parliament)
Language: Jupyter Notebook - Size: 11.3 MB - Last synced: 16 days ago - Pushed: almost 3 years ago - Stars: 14 - Forks: 4
engisalor/quartz
An app for visualizing Sketch Engine API data
Language: Python - Size: 715 KB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0
uma-pi1/OPIEC-pipeline
Language: Java - Size: 59.3 MB - Last synced: 10 days ago - Pushed: about 2 years ago - Stars: 14 - Forks: 2
Alex-bzh/compuling
Resources to learn how to manage corpus with Python.
Language: Jupyter Notebook - Size: 15.8 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 2 - Forks: 0
gederajeg/diatesis-bahasa-indonesia
Repository data dan kode pemrograman untuk bab buku berjudul "Kajian korpus kuantitatif terhadap aspek-aspek diatesis dalam bahasa Indonesia", yang menjadi bagian dari buku Tatabahasa Bahasa Indonesia Kontemporer (TBIK) berbasis korpus. Naskah utuh dapat diakses secara daring π.
Language: HTML - Size: 4.64 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0
IngoKl/textdirectory
TextDirectory allows you to filter, transform, and combine multiple text files into one aggregated file.
Language: Python - Size: 5.9 MB - Last synced: 11 days ago - Pushed: over 1 year ago - Stars: 11 - Forks: 2
roverbird/corpus_utils
Semantic word relations analysis and visualization for corpus linguistics and NLP
Language: R - Size: 28.4 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0
timarkh/tsakorpus
Yet another search platform for linguistic corpora.
Language: Python - Size: 3.28 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 16 - Forks: 12
scriptin/kanji-frequency
Kanji usage frequency data collected from various sources
Language: Astro - Size: 2.1 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 116 - Forks: 17
lennes/spect
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
Language: HTML - Size: 275 KB - Last synced: 3 months ago - Pushed: 9 months ago - Stars: 50 - Forks: 12
oscar-project/goclassy π¦
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Language: Go - Size: 377 KB - Last synced: 28 days ago - Pushed: about 3 years ago - Stars: 84 - Forks: 6
stcoats/zipf_explorer
A tool for the visualization of word frequency differences.
Language: Python - Size: 34.1 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0
praaline/Praaline
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
Language: C - Size: 147 MB - Last synced: 5 months ago - Pushed: over 1 year ago - Stars: 26 - Forks: 4
timarkh/vk-texts-harvester
Harvest texts from vk.com through API.
Language: Python - Size: 11.7 KB - Last synced: 5 months ago - Pushed: about 4 years ago - Stars: 1 - Forks: 0
ITSec-Uni-Munster/Bilingual-Longitudinal-Analysis-of-Privacy-Policies
This repository contains the code of the PETS 2024.2 paper titled "Paper title: A Bilingual Longitudinal Analysis of Privacy Policies Measuring the Impacts of the GDPR and the CCPA/CPRA"
Language: Python - Size: 47.9 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0
quadrama/DramaAnalysis
An R package for analysis of dramatic texts
Language: R - Size: 36.3 MB - Last synced: 27 days ago - Pushed: over 1 year ago - Stars: 15 - Forks: 2
digitallinguistics/data-explorer
The DLx portal for viewing, searching, and aggregating data
Language: JavaScript - Size: 7.66 MB - Last synced: about 1 month ago - Pushed: 10 months ago - Stars: 3 - Forks: 0
partigabor/phd-thesis
Ph.D. thesis of GΓ‘bor Parti, 2023
Language: TeX - Size: 441 MB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 2 - Forks: 0
gowribhat/sms-corpus-keyword-analysis
Language: Jupyter Notebook - Size: 144 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0
jenniferwagner18/telenovela-transcripts
Analyze language used in Spanish-language novelas using corpus linguistics tools
Language: Python - Size: 5.86 KB - Last synced: 5 months ago - Pushed: almost 2 years ago - Stars: 2 - Forks: 0
alex-raw/imsdb_parse
Parse movie scripts for linguistic analysis
Language: Python - Size: 161 KB - Last synced: 5 months ago - Pushed: about 2 years ago - Stars: 2 - Forks: 0
LanguageMachines/PICCL
A set of workflows for corpus building through OCR, post-correction and normalisation
Language: Python - Size: 4.26 MB - Last synced: 7 days ago - Pushed: over 1 year ago - Stars: 46 - Forks: 6
gederajeg/applicative-buy
R Notebook and Dataset for a corpus-based study of Indonesian BUY verbs in applicative construction with -KAN (published in NUSA special issue on Applicative Construction)
Language: HTML - Size: 1.89 MB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0
stewieboomhauer/IVK-Ler-Corpus
The data and code located in this repository introduce an international preparatory class learner corpus and its complexity analyses.
Language: R - Size: 31.1 MB - Last synced: 6 months ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0
AustinZuniga/Filipino-wordlist
Filipino wordlist word-level
Language: Python - Size: 41.9 MB - Last synced: 6 months ago - Pushed: over 5 years ago - Stars: 6 - Forks: 0
julienijs/Linguistic_Complexity_and_Gender
Does linguistic complexity correlate with gender?
Language: R - Size: 1.97 MB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0
JonathanReeve/corpus-db
A textual corpus database for the digital humanities.
Language: Jupyter Notebook - Size: 26 MB - Last synced: 15 days ago - Pushed: almost 4 years ago - Stars: 57 - Forks: 8
JiashuWu/Books
My book list
Size: 4.36 GB - Last synced: 6 months ago - Pushed: almost 2 years ago - Stars: 295 - Forks: 222
digitallinguistics/DFT
Discourse Functional Transcription
Size: 23.4 KB - Last synced: about 1 month ago - Pushed: about 4 years ago - Stars: 2 - Forks: 1
dimboump/compare-clefts-ukmp
Code for final assignment for Corpus Studies course at the University of Antwerp (2022)
Language: Jupyter Notebook - Size: 1.65 MB - Last synced: 7 months ago - Pushed: over 1 year ago - Stars: 0 - Forks: 0
hexatomic/hexatomic
Hexatomic is an extensible software for deep multi-layer annotation of linguistic corpora
Language: Java - Size: 22.8 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 12 - Forks: 6
seanbethard/corpuswork
Corpuswork
Language: Jupyter Notebook - Size: 2.09 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 1 - Forks: 0
STRZGR/Natural-Language-Processing-with-Python-Analyzing-Text-with-the-Natural-Language-Toolkit
My solutions to selected exercises to "Natural Language Processing with Python β Analyzing Text with the Natural Language Toolkit" by Steven Bird, Ewan Klein, and Edward Loper.
Language: Jupyter Notebook - Size: 9.75 MB - Last synced: 7 months ago - Pushed: over 4 years ago - Stars: 43 - Forks: 34
kbatsuren/CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
Size: 88.7 MB - Last synced: 7 months ago - Pushed: 11 months ago - Stars: 32 - Forks: 8
gederajeg/happyr
The accompanying R package for Rajeg's (2019) PhD thesis titled "Metaphorical profiles and near-synonyms: A corpus-based study of Indonesian words for Happiness"
Language: R - Size: 3.16 MB - Last synced: 7 months ago - Pushed: over 2 years ago - Stars: 3 - Forks: 0
dterg/biomedical_corpora
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
Size: 21.5 KB - Last synced: 7 months ago - Pushed: about 6 years ago - Stars: 17 - Forks: 4
Digital-Pushkin-Lab/RuAdapt
A Parallel Russian-Simple Russian Dataset
Size: 4.99 MB - Last synced: 7 months ago - Pushed: about 1 year ago - Stars: 6 - Forks: 2
ssciwr/argumentation-management
Annotator combining different NLP pipelines.
Language: Python - Size: 3.68 MB - Last synced: 5 days ago - Pushed: 7 months ago - Stars: 0 - Forks: 1
partigabor/corpus
A corpus and computational linguistic workspace
Language: Jupyter Notebook - Size: 228 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 0 - Forks: 0
keeleleek/votic-corpora
Votic language corpora
Language: XQuery - Size: 951 KB - Last synced: 8 months ago - Pushed: over 5 years ago - Stars: 2 - Forks: 0
fau-klue/docker-corpus-tool
Docker Images for IMS Open Corpus Workbench and UCS Toolkit
Language: Dockerfile - Size: 16.6 KB - Last synced: 8 months ago - Pushed: 12 months ago - Stars: 4 - Forks: 1