GitHub topics: parallel-corpus

Repositories

NiuTrans/Classical-Modern

非常全的文言文（古文）-现代文平行语料

Language: Python - Size: 400 MB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 1,326 - Forks: 304

tldr-pages/tldr-translation-pairs-gen

Generates a structured dataset in various formats derived from tldr-pages.

Language: TypeScript - Size: 365 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 4 - Forks: 3

KurdishBLARK/InterdialectCorpus

A parallel corpus of Sorani, Kurmanji and English

Size: 21.7 MB - Last synced at: 10 days ago - Pushed at: over 4 years ago - Stars: 13 - Forks: 3

tanloong/interlaced.nvim

Neovim plugin for aligning bilingual parallel texts

Language: Lua - Size: 250 KB - Last synced at: 10 days ago - Pushed at: 22 days ago - Stars: 10 - Forks: 0

Helsinki-NLP/OpusFilter

OpusFilter - Parallel corpus processing toolkit

Language: Python - Size: 4.32 MB - Last synced at: 27 days ago - Pushed at: about 2 months ago - Stars: 104 - Forks: 20

rggdmonk/hadal

A simple and eﬀicient tool for mining and aligning sentences with pre-trained models.

Language: Python - Size: 680 KB - Last synced at: 12 days ago - Pushed at: 12 months ago - Stars: 6 - Forks: 0

kirralabs/indonesian-NLP-resources

data resource untuk NLP bahasa indonesia

Size: 7.81 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 228 - Forks: 50

stibiumghost/tajik-to-persian-transliteration

Tajik-to-Persian transliteration project

Language: Jupyter Notebook - Size: 45.6 MB - Last synced at: 28 days ago - Pushed at: 10 months ago - Stars: 8 - Forks: 1

Caucasus-Rosetta/Lingua-Corpus

Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)

Language: Python - Size: 25.2 GB - Last synced at: 1 day ago - Pushed at: 6 months ago - Stars: 35 - Forks: 6

korenyoni/opus-api

OPUS (opus.nlpl.eu) Python3 API

Language: Python - Size: 117 KB - Last synced at: 23 days ago - Pushed at: 6 months ago - Stars: 18 - Forks: 5

UUDigitalHumanitieslab/perfectextractor

Extracting present perfects (and related forms) from parallel corpora

Language: Python - Size: 1.26 MB - Last synced at: 10 days ago - Pushed at: over 2 years ago - Stars: 7 - Forks: 2

An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For instance useful for comparing a translation with the original text, to find differences and similarities between two different translations, or to see how a machine translation differs from a reference translation.

Language: Python - Size: 257 KB - Last synced at: about 13 hours ago - Pushed at: over 3 years ago - Stars: 22 - Forks: 0

bfsujason/bertalign

Multilingual sentence alignment using sentence embeddings

Language: Python - Size: 296 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 97 - Forks: 44

AlexsandroANP/heiwukong-multilingual-corpora

“黑悟空” 多语言（平行）语料库，包含实现代码和 Streamlit 检索项目

Language: Python - Size: 5.52 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

sharad461/nepali-translator

Neural Machine Translation on the Nepali-English language pair

Language: Python - Size: 3.85 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 47 - Forks: 16

Deeptiman/php-dom-parser-translation-tool

A Simple DOM Parser and Translation Tool using PHP, HTML, and MySQL. The translation model is supported for English to Odia language. There is a built in dictionary to support the translation.

Language: PHP - Size: 4.62 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 1

Nexdata-AI/5310000-Groups-Chinese-Germany-Parallel-Corpus-Data

5310000-Groups-Chinese-Germany-Parallel-Corpus-Data

Size: 1.95 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/9830000-Groups-Chinese-Japanese-Parallel-Corpus-Data

Chinese-Japanese Parallel Corpus Data

Size: 1.08 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/6020000-Groups-Chinese-French-Parallel-Corpus-Data

6020000-Groups-Chinese-French-Parallel-Corpus-Data

Size: 1.15 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1000000-Groups-Chinese-Russian-Parallel-Corpus-Data

1000000-Groups-Chinese-Russian-Parallel-Corpus-Data

Size: 1.16 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1080000-Groups-English-Russian-Parallel-Corpus-Data

English and Russian parallel corpus

Size: 168 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/7440000-Groups-Chinese-Hindi-Parallel-Corpus-Data

7440000-Groups-Chinese-Hindi-Parallel-Corpus-Data

Size: 189 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/10030000-Groups-Chinese-Portuguese-Parallel-Corpus-Data

10030000-Groups-Chinese-Portuguese-Parallel-Corpus-Data

Size: 235 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/7290000-Groups--Chinese--Vietnamese-Parallel-Corpus-Data

7290000-Groups-Chinese-Vietnamese-Parallel-Corpus-Data

Size: 219 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/750000-Groups-Chinese-Burmese-Parallel-Corpus-Data

750000-Groups-Chinese-Burmese-Parallel-Corpus-Data

Size: 128 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/4720000-Groups-Chinese-Uighur-Parallel-Corpus-Data

4720000-Groups-Chinese-Uighur-Parallel-Corpus-Data

Size: 153 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/850000-Groups-English-Japanese-Parallel-Corpus-Data

English Japanese Parallel Corpus Data

Size: 187 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/3140000-Groups-Chinese-Spanish-Parallel-Corpus-Data

3140000-Groups-Chinese-Spanish-Parallel-Corpus-Data

Size: 223 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/12820000-Groups-Chinese-Korean-Parallel-Corpus-Data

Chinese-Korean-Parallel-Corpus-Data

Size: 147 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1140000-Groups-Chinese-Hebrew-Parallel-Corpus-Data

1140000-Groups-Chinese-Hebrew-Parallel-Corpus-Data

Size: 158 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/5010000-Groups-Chinese-Tibetan-Parallel-Corpus-Data

5010000-Groups-Chinese-Tibetan-Parallel-Corpus-Data

Size: 142 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/APY230328001_980000-Groups-Chinese-Urdu-Parallel-Corpus-Data

APY230328001_980000-Groups-Chinese-Urdu-Parallel-Corpus-Data

Size: 186 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1980000-Groups-Chinese-Polish-Parallel-Corpus-Data

1980000-Groups-Chinese-Polish-Parallel-Corpus-Data

Size: 257 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/3060000-Groups-Chinese-English-Parallel-Corpus-Data

Chinese-English-Parallel-Corpus-Data

Size: 141 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/100000-Groups-Chinese-Uighur-Parallel-Corpus-Data

100000-Groups-Chinese-Uighur-Parallel-Corpus-Data

Size: 1000 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/380000-Groups-Japanese-English-Parallel-Corpus-Data

Japanese and English parallel corpus

Size: 1.57 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1340000-Groups-English-Korean-Parallel-Corpus-Data

English and Korean parallel corpus

Size: 161 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/380000-Groups-Uighur-Chinese-Parallel-Corpus-Data

380000-Groups-Uighur-Chinese-Parallel-Corpus-Data

Size: 1020 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

michmech/irish-sentence-bank

4,500 sentences in Irish, tokenized, manually lemmatized, translated into English.

Size: 400 KB - Last synced at: 3 months ago - Pushed at: about 7 years ago - Stars: 9 - Forks: 1

PyThaiNLP/Thai-Lao-Parallel-Corpus

Thai Lao Parallel corpus

Size: 524 KB - Last synced at: 2 months ago - Pushed at: over 3 years ago - Stars: 5 - Forks: 1

matbahasa/TALPCo

TUFS Asian Language Parallel Corpus

Language: TeX - Size: 1.57 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 48 - Forks: 13

csebuetnlp/banglanmt

This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.

Language: Python - Size: 2.05 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 144 - Forks: 45

tm4roon/jawikinews-headline-dataset

A parallel corpus of article-headline pairs obtained from Japanese Wikinews.

Language: Jupyter Notebook - Size: 7.9 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

priyanshu2103/Sanskrit-Hindi-Machine-Translation

Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning

Language: Jupyter Notebook - Size: 9.69 MB - Last synced at: 10 months ago - Pushed at: over 4 years ago - Stars: 16 - Forks: 12

CasparChou/srt2corpus

It can help you to convert srt file into CN-? parallel corpus

Language: JavaScript - Size: 10.7 KB - Last synced at: 24 days ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0

OdiaWikimedia/English-Odia 📦

English to Odia/Oriya parallel corpus of phrases

Language: Jupyter Notebook - Size: 4 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 4 - Forks: 4

soumendrak/MTEnglish2Odia 📦

Machine Translation from English to Odia language.

Language: Jupyter Notebook - Size: 62.6 MB - Last synced at: 10 months ago - Pushed at: almost 4 years ago - Stars: 9 - Forks: 7

shashanksiripragada/pib-crawl

Code to extract multilingual parallel corpus from Press Information Bureau (PIB) website.

Language: Python - Size: 1.03 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 6 - Forks: 4

YerevaNN/PARASITE

🪱 PARASITE || A parallel sentence data preprocessing toolkit. Originally developed as a part of the `en-ru` winner submission of WMT20 Biomedical Translation Task.

Language: Python - Size: 426 KB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 11 - Forks: 6

UUDigitalHumanitieslab/timealign

Parallel corpus annotation and visualization

Language: Python - Size: 6.16 MB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 1

time-in-translation/preprocess-corpora

Creating (parallel) corpora from scratch using Uplug tooling

Language: Python - Size: 770 KB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 1

gederajeg/constructional-equivalence

Repository of supplementary materials and RStudio project for the paper on corpus-based approach to measuring constructional equivalence.

Language: TeX - Size: 2.53 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

asmelashteka/HornMT

Machine translation (MT) benchmark dataset for languages in the Horn of Africa.

Size: 3.91 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 37 - Forks: 11

spraakbanken/swell-editor

Editor for normalising learner texts (error annotation and tagging.)

Language: TypeScript - Size: 4.18 MB - Last synced at: 17 days ago - Pushed at: almost 2 years ago - Stars: 9 - Forks: 3

ShathaTm/LK-Hadith-Corpus

Leeds University and King Saud University (LK) Hadith Corpus

Language: Python - Size: 13.4 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 38 - Forks: 18

cfiltnlp/IITB-English-Hindi-PC

The IIT Bombay English-Hindi Parallel Corpus

Language: Jupyter Notebook - Size: 501 KB - Last synced at: almost 2 years ago - Pushed at: about 3 years ago - Stars: 11 - Forks: 2

xavier-gz/SLI_Galician_Corpora

Corpora for Galician language

Size: 365 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

tsuruoka-lab/BSD

The Business Scene Dialogue corpus

Size: 2.91 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 55 - Forks: 6

OdiaNLP/wikipedia-corpus 📦

Odia wikipedia monolingual corpus extraction

Language: Python - Size: 1.02 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

FerreroJeremy/Cross-Language-Dataset

A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection

Size: 657 MB - Last synced at: about 2 years ago - Pushed at: almost 8 years ago - Stars: 60 - Forks: 25

Kartikaggarwal98/Indian_ParallelCorpus

Curated list of publicly available parallel corpus for Indian Languages

Size: 8.79 KB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 28 - Forks: 1

farshadjafari/parallel_corpus_generator

Python application, generating parallel corpus for any language pairs, can be used for training nmt (Neural Machine Translation) systems

Language: Python - Size: 33.8 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 1

x39826/Pali_Tripitaka

Pali Buddhist scriptures of 15 countries and its parallel corpus

Language: Python - Size: 56.6 KB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 6 - Forks: 4

x39826/Multilang_Translator_For_Pali_Tripitaka

Parallel corpus and multilingual machine translation system of the Pali Buddhist scriptures in 15 countries（15国巴利文大藏经平行语料与多语言机器翻译系统）

Language: Python - Size: 43.7 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 3 - Forks: 1

umoqnier/Esquite Fork of ElotlMX/Esquite

Framework para corpus paralelos | Framework for parallel corpora

Language: Python - Size: 6.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

Related Keywords

parallel-corpus 70 machine-translation 35 nlp 28 natural-language-processing 14 corpus 12 text 7 parallel-corpora 6 neural-machine-translation 5 dataset 5 odia-language 4 machine-learning 4 odia 3 english 3 uighur 3 alignment 3 sentence-alignment 3 corpus-linguistics 3 corpora 2 translation 2 statistical-machine-translation 2 hindi 2 vietnamese 2 burmese 2 deep-learning 2 gaeilge 2 irish 2 preprocessing 2 japanese 2 english-translation 2 monolingual-corpora 2 low-resource-machine-translation 2 low-resource-languages 2 corpus-tools 2 sentiment-analysis 2 indonesian 2 machine-translation-data-processing 2 named-entity-recognition 2 pali-tripitaka 2 stackoverflow 1 django-application 1 visualization 1 nlp-machine-learning 1 construction-grammar 1 italian 1 constructionist-approach 1 english-indonesian-translation 1 open-code 1 open-data 1 open-science 1 crawling-python 1 open-subtitle 1 quantitative-linguistics 1 r-programming 1 r-programming-projects 1 low-resource-nlp 1 distant-supervision 1 headline-generation 1 summarization 1 fasttext-embeddings 1 sanskrit 1 sanskrit-english 1 conversion 1 traditional-and-simplified-chinese 1 indic-languages 1 ulmfit 1 python3 1 indian-language 1 multilingual-nmt 1 alignments 1 crosslingual 1 sentiment-classification 1 annotation 1 arabic-english 1 hadeeth 1 hadith 1 e-texts 1 galician 1 supervised-machine-learning 1 tagging 1 training-data 1 annotated-corpora 1 document-aligned 1 data 1 data-analysis 1 cross-language-dataset 1 multi-granularity-dataset 1 computational-linguistics 1 multi-language-dataset 1 indian-languages 1 machinetranslation 1 multilingual-translation 1 translation-equivalence 1 translation-studies 1 udayana-university 1 elasticsearch 1 universitas-udayana 1 verbal-near-synonyms 1 afar 1 africa 1 amharic 1