An open API service providing repository metadata for many open source software ecosystems.

Topic: "parallel-corpus"

NiuTrans/Classical-Modern

非常全的文言文(古文)-现代文平行语料

Language: Python - Size: 400 MB - Last synced at: about 23 hours ago - Pushed at: about 1 year ago - Stars: 1,331 - Forks: 306

kirralabs/indonesian-NLP-resources

data resource untuk NLP bahasa indonesia

Size: 7.81 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 228 - Forks: 50

csebuetnlp/banglanmt

This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.

Language: Python - Size: 2.05 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 144 - Forks: 45

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

Language: Python - Size: 4.32 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 104 - Forks: 20

bfsujason/bertalign

Multilingual sentence alignment using sentence embeddings

Language: Python - Size: 296 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 97 - Forks: 44

FerreroJeremy/Cross-Language-Dataset

A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection

Size: 657 MB - Last synced at: about 2 years ago - Pushed at: almost 8 years ago - Stars: 60 - Forks: 25

tsuruoka-lab/BSD

The Business Scene Dialogue corpus

Size: 2.91 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 55 - Forks: 6

matbahasa/TALPCo

TUFS Asian Language Parallel Corpus

Language: TeX - Size: 1.57 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 48 - Forks: 13

sharad461/nepali-translator

Neural Machine Translation on the Nepali-English language pair

Language: Python - Size: 3.85 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 47 - Forks: 16

ShathaTm/LK-Hadith-Corpus

Leeds University and King Saud University (LK) Hadith Corpus

Language: Python - Size: 13.4 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 38 - Forks: 18

asmelashteka/HornMT

Machine translation (MT) benchmark dataset for languages in the Horn of Africa.

Size: 3.91 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 37 - Forks: 11

Caucasus-Rosetta/Lingua-Corpus

Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)

Language: Python - Size: 25.2 GB - Last synced at: 8 days ago - Pushed at: 6 months ago - Stars: 35 - Forks: 6

Kartikaggarwal98/Indian_ParallelCorpus

Curated list of publicly available parallel corpus for Indian Languages

Size: 8.79 KB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 28 - Forks: 1

BramVanroy/astred

An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For instance useful for comparing a translation with the original text, to find differences and similarities between two different translations, or to see how a machine translation differs from a reference translation.

Language: Python - Size: 257 KB - Last synced at: 7 days ago - Pushed at: over 3 years ago - Stars: 22 - Forks: 0

korenyoni/opus-api

OPUS (opus.nlpl.eu) Python3 API

Language: Python - Size: 117 KB - Last synced at: 29 days ago - Pushed at: 6 months ago - Stars: 18 - Forks: 5

priyanshu2103/Sanskrit-Hindi-Machine-Translation

Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning

Language: Jupyter Notebook - Size: 9.69 MB - Last synced at: 10 months ago - Pushed at: over 4 years ago - Stars: 16 - Forks: 12

KurdishBLARK/InterdialectCorpus

A parallel corpus of Sorani, Kurmanji and English

Size: 21.7 MB - Last synced at: 16 days ago - Pushed at: over 4 years ago - Stars: 13 - Forks: 3

cfiltnlp/IITB-English-Hindi-PC

The IIT Bombay English-Hindi Parallel Corpus

Language: Jupyter Notebook - Size: 501 KB - Last synced at: almost 2 years ago - Pushed at: about 3 years ago - Stars: 11 - Forks: 2

YerevaNN/PARASITE

🪱 PARASITE || A parallel sentence data preprocessing toolkit. Originally developed as a part of the `en-ru` winner submission of WMT20 Biomedical Translation Task.

Language: Python - Size: 426 KB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 11 - Forks: 6

tanloong/interlaced.nvim

Neovim plugin for aligning bilingual parallel texts

Language: Lua - Size: 250 KB - Last synced at: 16 days ago - Pushed at: 28 days ago - Stars: 10 - Forks: 0

spraakbanken/swell-editor

Editor for normalising learner texts (error annotation and tagging.)

Language: TypeScript - Size: 4.18 MB - Last synced at: 23 days ago - Pushed at: almost 2 years ago - Stars: 9 - Forks: 3

soumendrak/MTEnglish2Odia 📦

Machine Translation from English to Odia language.

Language: Jupyter Notebook - Size: 62.6 MB - Last synced at: 10 months ago - Pushed at: almost 4 years ago - Stars: 9 - Forks: 7

michmech/irish-sentence-bank

4,500 sentences in Irish, tokenized, manually lemmatized, translated into English.

Size: 400 KB - Last synced at: 3 months ago - Pushed at: about 7 years ago - Stars: 9 - Forks: 1

stibiumghost/tajik-to-persian-transliteration

Tajik-to-Persian transliteration project

Language: Jupyter Notebook - Size: 45.6 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 8 - Forks: 1

farshadjafari/parallel_corpus_generator

Python application, generating parallel corpus for any language pairs, can be used for training nmt (Neural Machine Translation) systems

Language: Python - Size: 33.8 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 1

UUDigitalHumanitieslab/timealign

Parallel corpus annotation and visualization

Language: Python - Size: 6.16 MB - Last synced at: 17 days ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 1

UUDigitalHumanitieslab/perfectextractor

Extracting present perfects (and related forms) from parallel corpora

Language: Python - Size: 1.26 MB - Last synced at: 17 days ago - Pushed at: over 2 years ago - Stars: 7 - Forks: 2

rggdmonk/hadal

A simple and efficient tool for mining and aligning sentences with pre-trained models.

Language: Python - Size: 680 KB - Last synced at: 19 days ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 0

shashanksiripragada/pib-crawl

Code to extract multilingual parallel corpus from Press Information Bureau (PIB) website.

Language: Python - Size: 1.03 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 6 - Forks: 4

x39826/Pali_Tripitaka

Pali Buddhist scriptures of 15 countries and its parallel corpus

Language: Python - Size: 56.6 KB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 6 - Forks: 4

PyThaiNLP/Thai-Lao-Parallel-Corpus

Thai Lao Parallel corpus

Size: 524 KB - Last synced at: 3 months ago - Pushed at: over 3 years ago - Stars: 5 - Forks: 1

tldr-pages/tldr-translation-pairs-gen

Generates a structured dataset in various formats derived from tldr-pages.

Language: TypeScript - Size: 365 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 4 - Forks: 3

Deeptiman/php-dom-parser-translation-tool

A Simple DOM Parser and Translation Tool using PHP, HTML, and MySQL. The translation model is supported for English to Odia language. There is a built in dictionary to support the translation.

Language: PHP - Size: 4.62 MB - Last synced at: about 2 months ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 1

OdiaWikimedia/English-Odia 📦

English to Odia/Oriya parallel corpus of phrases

Language: Jupyter Notebook - Size: 4 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 4 - Forks: 4

x39826/Multilang_Translator_For_Pali_Tripitaka

Parallel corpus and multilingual machine translation system of the Pali Buddhist scriptures in 15 countries(15国巴利文大藏经平行语料与多语言机器翻译系统)

Language: Python - Size: 43.7 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 3 - Forks: 1

time-in-translation/preprocess-corpora

Creating (parallel) corpora from scratch using Uplug tooling

Language: Python - Size: 770 KB - Last synced at: 5 days ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 1

gederajeg/constructional-equivalence

Repository of supplementary materials and RStudio project for the paper on corpus-based approach to measuring constructional equivalence.

Language: TeX - Size: 2.53 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

tm4roon/jawikinews-headline-dataset

A parallel corpus of article-headline pairs obtained from Japanese Wikinews.

Language: Jupyter Notebook - Size: 7.9 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

davidwarrior22/machine-translation-for-african-languages

This repository focuses on developing machine translation and NLP tools specifically for African languages. Join us in addressing the challenges and opportunities in this vital area of language technology! 🛠️🌍

Language: TeX - Size: 177 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

AlexsandroANP/heiwukong-multilingual-corpora

“黑悟空” 多语言(平行)语料库,包含实现代码和 Streamlit 检索项目

Language: Python - Size: 5.52 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/5310000-Groups-Chinese-Germany-Parallel-Corpus-Data

5310000-Groups-Chinese-Germany-Parallel-Corpus-Data

Size: 1.95 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/9830000-Groups-Chinese-Japanese-Parallel-Corpus-Data

Chinese-Japanese Parallel Corpus Data

Size: 1.08 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/6020000-Groups-Chinese-French-Parallel-Corpus-Data

6020000-Groups-Chinese-French-Parallel-Corpus-Data

Size: 1.15 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1000000-Groups-Chinese-Russian-Parallel-Corpus-Data

1000000-Groups-Chinese-Russian-Parallel-Corpus-Data

Size: 1.16 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1080000-Groups-English-Russian-Parallel-Corpus-Data

English and Russian parallel corpus

Size: 168 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/7440000-Groups-Chinese-Hindi-Parallel-Corpus-Data

7440000-Groups-Chinese-Hindi-Parallel-Corpus-Data

Size: 189 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/10030000-Groups-Chinese-Portuguese-Parallel-Corpus-Data

10030000-Groups-Chinese-Portuguese-Parallel-Corpus-Data

Size: 235 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/7290000-Groups--Chinese--Vietnamese-Parallel-Corpus-Data

7290000-Groups-Chinese-Vietnamese-Parallel-Corpus-Data

Size: 219 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/750000-Groups-Chinese-Burmese-Parallel-Corpus-Data

750000-Groups-Chinese-Burmese-Parallel-Corpus-Data

Size: 128 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/4720000-Groups-Chinese-Uighur-Parallel-Corpus-Data

4720000-Groups-Chinese-Uighur-Parallel-Corpus-Data

Size: 153 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/850000-Groups-English-Japanese-Parallel-Corpus-Data

English Japanese Parallel Corpus Data

Size: 187 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/3140000-Groups-Chinese-Spanish-Parallel-Corpus-Data

3140000-Groups-Chinese-Spanish-Parallel-Corpus-Data

Size: 223 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/12820000-Groups-Chinese-Korean-Parallel-Corpus-Data

Chinese-Korean-Parallel-Corpus-Data

Size: 147 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1140000-Groups-Chinese-Hebrew-Parallel-Corpus-Data

1140000-Groups-Chinese-Hebrew-Parallel-Corpus-Data

Size: 158 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/5010000-Groups-Chinese-Tibetan-Parallel-Corpus-Data

5010000-Groups-Chinese-Tibetan-Parallel-Corpus-Data

Size: 142 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/APY230328001_980000-Groups-Chinese-Urdu-Parallel-Corpus-Data

APY230328001_980000-Groups-Chinese-Urdu-Parallel-Corpus-Data

Size: 186 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1980000-Groups-Chinese-Polish-Parallel-Corpus-Data

1980000-Groups-Chinese-Polish-Parallel-Corpus-Data

Size: 257 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/3060000-Groups-Chinese-English-Parallel-Corpus-Data

Chinese-English-Parallel-Corpus-Data

Size: 141 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/100000-Groups-Chinese-Uighur-Parallel-Corpus-Data

100000-Groups-Chinese-Uighur-Parallel-Corpus-Data

Size: 1000 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/380000-Groups-Japanese-English-Parallel-Corpus-Data

Japanese and English parallel corpus

Size: 1.57 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1340000-Groups-English-Korean-Parallel-Corpus-Data

English and Korean parallel corpus

Size: 161 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Nexdata-AI/380000-Groups-Uighur-Chinese-Parallel-Corpus-Data

380000-Groups-Uighur-Chinese-Parallel-Corpus-Data

Size: 1020 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

umoqnier/Esquite Fork of ElotlMX/Esquite

Framework para corpus paralelos | Framework for parallel corpora

Language: Python - Size: 6.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

xavier-gz/SLI_Galician_Corpora

Corpora for Galician language

Size: 365 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

time-in-translation/ParTy2OPUS

ParTy2OPUS converts documents from the ParTy corpus to the OPUS format

Language: Python - Size: 3.91 KB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

ikros98/ULMFiT-for-Italian

ULMFit model for the Italian language / creation of a parallel corpus

Language: Jupyter Notebook - Size: 13.7 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

OdiaNLP/wikipedia-corpus 📦

Odia wikipedia monolingual corpus extraction

Language: Python - Size: 1.02 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

mrsumitbd/SOParallelCorpusReplication

Replication package for SO processing for bitext

Language: Python - Size: 434 KB - Last synced at: almost 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 1

CasparChou/srt2corpus

It can help you to convert srt file into CN-? parallel corpus

Language: JavaScript - Size: 10.7 KB - Last synced at: about 1 month ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0

erayyildiz/parallel-sentence-quality-filter

Parallel sentence quality filter based on text classification methods

Language: Perl - Size: 1.21 MB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0

kscanne/ccgb

Backend DB for Corpas Comhthreomhar Gaeilge-Béarla, an Irish-English parallel corpus

Language: HTML - Size: 70.3 KB - Last synced at: about 2 years ago - Pushed at: almost 8 years ago - Stars: 0 - Forks: 0

Related Topics
machine-translation 36 nlp 29 natural-language-processing 14 corpus 12 text 7 dataset 6 parallel-corpora 6 neural-machine-translation 5 machine-learning 4 odia-language 4 uighur 3 english 3 odia 3 alignment 3 corpus-linguistics 3 sentence-alignment 3 low-resource-machine-translation 2 vietnamese 2 machine-translation-data-processing 2 low-resource-languages 2 deep-learning 2 preprocessing 2 english-translation 2 corpus-tools 2 japanese 2 tigrinya 2 monolingual-corpora 2 statistical-machine-translation 2 corpora 2 indonesian 2 named-entity-recognition 2 sentiment-analysis 2 africa 2 pali-tripitaka 2 gaeilge 2 irish 2 translation 2 hindi 2 burmese 2 ethiopia 2 horn-of-africa 2 hebrew 1 tldr-pages 1 spanish-translation 1 russian 1 french 1 germany 1 lemmatization 1 nlp-library 1 parallel-sentence-mining 1 apache 1 corpus-tool 1 dom-parser 1 persian 1 second-language-acquisition 1 sla 1 swell 1 swell-editor 1 construction-grammar 1 constructionist-approach 1 english-indonesian-translation 1 open-code 1 open-data 1 open-science 1 open-subtitle 1 quantitative-linguistics 1 r-programming 1 r-programming-projects 1 translation-equivalence 1 translation-studies 1 udayana-university 1 universitas-udayana 1 verbal-near-synonyms 1 language-translation 1 tibetan 1 urdu 1 hacktoberfest 1 tldr 1 nvim-plugin 1 linguistics 1 parsing 1 spacy 1 stanza 1 corpus-processing 1 api 1 corporate 1 language-model 1 opus 1 python 1 african-languages 1 datascience 1 deep-neural-networks 1 javascript 1 jupyter-notebook 1 kaggle-dataset 1 multilingaul 1 nueral-machine-translation 1 transfer-learning 1 wmt2022 1 linguist 1