An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: corpus-generator

johentsch/ms3

A parser for annotated MuseScore 3 files.

Language: Python - Size: 105 MB - Last synced at: 2 days ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 3

user1342/AutoCorpus

AutoCorpus is a tool backed by a large language model (LLM) for automatically generating corpus files for fuzzing.

Language: Python - Size: 390 KB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 70 - Forks: 10

writecrow/crow_backend

The canonical resources to build the backend for a corpus/repository management framework for Crow, the Corpus and Repository of Writing

Language: PHP - Size: 2.41 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

biomedicalinformaticsgroup/cadmus

A full-text article retrieval pipeline for biomedical literature.

Language: Python - Size: 271 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 20 - Forks: 2

bitextor/bitextor

Bitextor generates translation memories from multilingual websites

Language: Python - Size: 177 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 290 - Forks: 43

patasmith/corpusmaker

Create a corpus for fine-tuning an OpenAI model

Language: Python - Size: 201 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

uma-pi1/OPIEC-pipeline

Language: Java - Size: 59.3 MB - Last synced at: 6 days ago - Pushed at: about 3 years ago - Stars: 14 - Forks: 2

mohabmes/Sinai-corpus

A clean Fusha Arabic tagged corpus.

Language: Python - Size: 53.9 MB - Last synced at: 20 days ago - Pushed at: over 4 years ago - Stars: 10 - Forks: 0

Pendulun/WebCrawler

Language: Python - Size: 648 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

apple-fritter/weechat.driftwood

Natively log WeeChat channel and private messages, CTCP, and notices, in the driftwood standard. Written in Python.

Language: Python - Size: 31.3 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

apple-fritter/scrimshaw

Scrimshaw parses IRC logs stored in the driftwood format for quotes attributable to a given user. Written in Rust.

Language: Rust - Size: 112 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

phueb/MissingAdjunct

Generate pseudo-English sentences for research in semantic composition

Language: Python - Size: 18.7 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

FerreroJeremy/Plagiarized-Corpus-Generator

A corpus builder for evaluation of plagiarism detection tools

Language: PHP - Size: 242 KB - Last synced at: about 2 years ago - Pushed at: over 8 years ago - Stars: 2 - Forks: 0

felipetovarhenao/exquisitecorpus

A set of corpus-based sampling & analysis M4L devices

Language: Max - Size: 22.8 MB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 7 - Forks: 1

ibnmalik/golden-corpus-arabic

golden arabic corpus build for test Assem's arabicstemmer and other arabic stemmers

Language: Python - Size: 17.6 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 6 - Forks: 8

saganoren/ukr-twi-corpus

A corpus of Ukrainian Twitter texts + instructions for downloading and filtering texts.

Language: Jupyter Notebook - Size: 49.5 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 11 - Forks: 3

thecsw/katya-dev

Katya or The Liberated Corpus a text corpus that allows you to request and scrape any web resource!

Language: Go - Size: 22.8 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 4 - Forks: 0

phueb/GroundedLang

A prototype for generating language in a grounded simulation of a simple hunter-gatherer world

Language: Python - Size: 139 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

ishalyminov/babi_tools

Augmentation scripts for the bAbI Dialog Tasks dataset

Language: Python - Size: 101 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 14 - Forks: 5