GitHub topics: parallel-corpus
NiuTrans/Classical-Modern
非常全的文言文(古文)-现代文平行语料
Language: Python - Size: 400 MB - Last synced at: 8 days ago - Pushed at: about 1 year ago - Stars: 1,326 - Forks: 304

tldr-pages/tldr-translation-pairs-gen
Generates a structured dataset in various formats derived from tldr-pages.
Language: TypeScript - Size: 365 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 4 - Forks: 3

KurdishBLARK/InterdialectCorpus
A parallel corpus of Sorani, Kurmanji and English
Size: 21.7 MB - Last synced at: 10 days ago - Pushed at: over 4 years ago - Stars: 13 - Forks: 3

tanloong/interlaced.nvim
Neovim plugin for aligning bilingual parallel texts
Language: Lua - Size: 250 KB - Last synced at: 10 days ago - Pushed at: 22 days ago - Stars: 10 - Forks: 0

Helsinki-NLP/OpusFilter
OpusFilter - Parallel corpus processing toolkit
Language: Python - Size: 4.32 MB - Last synced at: 27 days ago - Pushed at: about 2 months ago - Stars: 104 - Forks: 20

rggdmonk/hadal
A simple and efficient tool for mining and aligning sentences with pre-trained models.
Language: Python - Size: 680 KB - Last synced at: 12 days ago - Pushed at: 12 months ago - Stars: 6 - Forks: 0

kirralabs/indonesian-NLP-resources
data resource untuk NLP bahasa indonesia
Size: 7.81 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 228 - Forks: 50

stibiumghost/tajik-to-persian-transliteration
Tajik-to-Persian transliteration project
Language: Jupyter Notebook - Size: 45.6 MB - Last synced at: 28 days ago - Pushed at: 10 months ago - Stars: 8 - Forks: 1

Caucasus-Rosetta/Lingua-Corpus
Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)
Language: Python - Size: 25.2 GB - Last synced at: 1 day ago - Pushed at: 6 months ago - Stars: 35 - Forks: 6

korenyoni/opus-api
OPUS (opus.nlpl.eu) Python3 API
Language: Python - Size: 117 KB - Last synced at: 23 days ago - Pushed at: 6 months ago - Stars: 18 - Forks: 5

UUDigitalHumanitieslab/perfectextractor
Extracting present perfects (and related forms) from parallel corpora
Language: Python - Size: 1.26 MB - Last synced at: 10 days ago - Pushed at: over 2 years ago - Stars: 7 - Forks: 2

BramVanroy/astred
An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For instance useful for comparing a translation with the original text, to find differences and similarities between two different translations, or to see how a machine translation differs from a reference translation.
Language: Python - Size: 257 KB - Last synced at: about 13 hours ago - Pushed at: over 3 years ago - Stars: 22 - Forks: 0

bfsujason/bertalign
Multilingual sentence alignment using sentence embeddings
Language: Python - Size: 296 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 97 - Forks: 44

AlexsandroANP/heiwukong-multilingual-corpora
“黑悟空” 多语言(平行)语料库,包含实现代码和 Streamlit 检索项目
Language: Python - Size: 5.52 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

sharad461/nepali-translator
Neural Machine Translation on the Nepali-English language pair
Language: Python - Size: 3.85 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 47 - Forks: 16

Deeptiman/php-dom-parser-translation-tool
A Simple DOM Parser and Translation Tool using PHP, HTML, and MySQL. The translation model is supported for English to Odia language. There is a built in dictionary to support the translation.
Language: PHP - Size: 4.62 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 1

Nexdata-AI/5310000-Groups-Chinese-Germany-Parallel-Corpus-Data
5310000-Groups-Chinese-Germany-Parallel-Corpus-Data
Size: 1.95 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/9830000-Groups-Chinese-Japanese-Parallel-Corpus-Data
Chinese-Japanese Parallel Corpus Data
Size: 1.08 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/6020000-Groups-Chinese-French-Parallel-Corpus-Data
6020000-Groups-Chinese-French-Parallel-Corpus-Data
Size: 1.15 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1000000-Groups-Chinese-Russian-Parallel-Corpus-Data
1000000-Groups-Chinese-Russian-Parallel-Corpus-Data
Size: 1.16 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1080000-Groups-English-Russian-Parallel-Corpus-Data
English and Russian parallel corpus
Size: 168 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/7440000-Groups-Chinese-Hindi-Parallel-Corpus-Data
7440000-Groups-Chinese-Hindi-Parallel-Corpus-Data
Size: 189 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/10030000-Groups-Chinese-Portuguese-Parallel-Corpus-Data
10030000-Groups-Chinese-Portuguese-Parallel-Corpus-Data
Size: 235 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/7290000-Groups--Chinese--Vietnamese-Parallel-Corpus-Data
7290000-Groups-Chinese-Vietnamese-Parallel-Corpus-Data
Size: 219 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/750000-Groups-Chinese-Burmese-Parallel-Corpus-Data
750000-Groups-Chinese-Burmese-Parallel-Corpus-Data
Size: 128 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/4720000-Groups-Chinese-Uighur-Parallel-Corpus-Data
4720000-Groups-Chinese-Uighur-Parallel-Corpus-Data
Size: 153 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/850000-Groups-English-Japanese-Parallel-Corpus-Data
English Japanese Parallel Corpus Data
Size: 187 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/3140000-Groups-Chinese-Spanish-Parallel-Corpus-Data
3140000-Groups-Chinese-Spanish-Parallel-Corpus-Data
Size: 223 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/12820000-Groups-Chinese-Korean-Parallel-Corpus-Data
Chinese-Korean-Parallel-Corpus-Data
Size: 147 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1140000-Groups-Chinese-Hebrew-Parallel-Corpus-Data
1140000-Groups-Chinese-Hebrew-Parallel-Corpus-Data
Size: 158 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/5010000-Groups-Chinese-Tibetan-Parallel-Corpus-Data
5010000-Groups-Chinese-Tibetan-Parallel-Corpus-Data
Size: 142 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/APY230328001_980000-Groups-Chinese-Urdu-Parallel-Corpus-Data
APY230328001_980000-Groups-Chinese-Urdu-Parallel-Corpus-Data
Size: 186 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1980000-Groups-Chinese-Polish-Parallel-Corpus-Data
1980000-Groups-Chinese-Polish-Parallel-Corpus-Data
Size: 257 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/3060000-Groups-Chinese-English-Parallel-Corpus-Data
Chinese-English-Parallel-Corpus-Data
Size: 141 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/100000-Groups-Chinese-Uighur-Parallel-Corpus-Data
100000-Groups-Chinese-Uighur-Parallel-Corpus-Data
Size: 1000 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/380000-Groups-Japanese-English-Parallel-Corpus-Data
Japanese and English parallel corpus
Size: 1.57 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/1340000-Groups-English-Korean-Parallel-Corpus-Data
English and Korean parallel corpus
Size: 161 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Nexdata-AI/380000-Groups-Uighur-Chinese-Parallel-Corpus-Data
380000-Groups-Uighur-Chinese-Parallel-Corpus-Data
Size: 1020 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

michmech/irish-sentence-bank
4,500 sentences in Irish, tokenized, manually lemmatized, translated into English.
Size: 400 KB - Last synced at: 3 months ago - Pushed at: about 7 years ago - Stars: 9 - Forks: 1

PyThaiNLP/Thai-Lao-Parallel-Corpus
Thai Lao Parallel corpus
Size: 524 KB - Last synced at: 2 months ago - Pushed at: over 3 years ago - Stars: 5 - Forks: 1

matbahasa/TALPCo
TUFS Asian Language Parallel Corpus
Language: TeX - Size: 1.57 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 48 - Forks: 13

csebuetnlp/banglanmt
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Language: Python - Size: 2.05 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 144 - Forks: 45

tm4roon/jawikinews-headline-dataset
A parallel corpus of article-headline pairs obtained from Japanese Wikinews.
Language: Jupyter Notebook - Size: 7.9 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

priyanshu2103/Sanskrit-Hindi-Machine-Translation
Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning
Language: Jupyter Notebook - Size: 9.69 MB - Last synced at: 10 months ago - Pushed at: over 4 years ago - Stars: 16 - Forks: 12

CasparChou/srt2corpus
It can help you to convert srt file into CN-? parallel corpus
Language: JavaScript - Size: 10.7 KB - Last synced at: 24 days ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0

OdiaWikimedia/English-Odia 📦
English to Odia/Oriya parallel corpus of phrases
Language: Jupyter Notebook - Size: 4 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 4 - Forks: 4

soumendrak/MTEnglish2Odia 📦
Machine Translation from English to Odia language.
Language: Jupyter Notebook - Size: 62.6 MB - Last synced at: 10 months ago - Pushed at: almost 4 years ago - Stars: 9 - Forks: 7

shashanksiripragada/pib-crawl
Code to extract multilingual parallel corpus from Press Information Bureau (PIB) website.
Language: Python - Size: 1.03 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 6 - Forks: 4

YerevaNN/PARASITE
🪱 PARASITE || A parallel sentence data preprocessing toolkit. Originally developed as a part of the `en-ru` winner submission of WMT20 Biomedical Translation Task.
Language: Python - Size: 426 KB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 11 - Forks: 6

UUDigitalHumanitieslab/timealign
Parallel corpus annotation and visualization
Language: Python - Size: 6.16 MB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 1

time-in-translation/preprocess-corpora
Creating (parallel) corpora from scratch using Uplug tooling
Language: Python - Size: 770 KB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 1

gederajeg/constructional-equivalence
Repository of supplementary materials and RStudio project for the paper on corpus-based approach to measuring constructional equivalence.
Language: TeX - Size: 2.53 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

asmelashteka/HornMT
Machine translation (MT) benchmark dataset for languages in the Horn of Africa.
Size: 3.91 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 37 - Forks: 11

spraakbanken/swell-editor
Editor for normalising learner texts (error annotation and tagging.)
Language: TypeScript - Size: 4.18 MB - Last synced at: 17 days ago - Pushed at: almost 2 years ago - Stars: 9 - Forks: 3

ShathaTm/LK-Hadith-Corpus
Leeds University and King Saud University (LK) Hadith Corpus
Language: Python - Size: 13.4 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 38 - Forks: 18

cfiltnlp/IITB-English-Hindi-PC
The IIT Bombay English-Hindi Parallel Corpus
Language: Jupyter Notebook - Size: 501 KB - Last synced at: almost 2 years ago - Pushed at: about 3 years ago - Stars: 11 - Forks: 2

xavier-gz/SLI_Galician_Corpora
Corpora for Galician language
Size: 365 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

tsuruoka-lab/BSD
The Business Scene Dialogue corpus
Size: 2.91 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 55 - Forks: 6

OdiaNLP/wikipedia-corpus 📦
Odia wikipedia monolingual corpus extraction
Language: Python - Size: 1.02 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

FerreroJeremy/Cross-Language-Dataset
A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection
Size: 657 MB - Last synced at: about 2 years ago - Pushed at: almost 8 years ago - Stars: 60 - Forks: 25

Kartikaggarwal98/Indian_ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages
Size: 8.79 KB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 28 - Forks: 1

farshadjafari/parallel_corpus_generator
Python application, generating parallel corpus for any language pairs, can be used for training nmt (Neural Machine Translation) systems
Language: Python - Size: 33.8 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 1

x39826/Pali_Tripitaka
Pali Buddhist scriptures of 15 countries and its parallel corpus
Language: Python - Size: 56.6 KB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 6 - Forks: 4

x39826/Multilang_Translator_For_Pali_Tripitaka
Parallel corpus and multilingual machine translation system of the Pali Buddhist scriptures in 15 countries(15国巴利文大藏经平行语料与多语言机器翻译系统)
Language: Python - Size: 43.7 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 3 - Forks: 1

umoqnier/Esquite Fork of ElotlMX/Esquite
Framework para corpus paralelos | Framework for parallel corpora
Language: Python - Size: 6.9 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

ikros98/ULMFiT-for-Italian
ULMFit model for the Italian language / creation of a parallel corpus
Language: Jupyter Notebook - Size: 13.7 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

time-in-translation/ParTy2OPUS
ParTy2OPUS converts documents from the ParTy corpus to the OPUS format
Language: Python - Size: 3.91 KB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

kscanne/ccgb
Backend DB for Corpas Comhthreomhar Gaeilge-Béarla, an Irish-English parallel corpus
Language: HTML - Size: 70.3 KB - Last synced at: about 2 years ago - Pushed at: almost 8 years ago - Stars: 0 - Forks: 0

mrsumitbd/SOParallelCorpusReplication
Replication package for SO processing for bitext
Language: Python - Size: 434 KB - Last synced at: almost 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 1

erayyildiz/parallel-sentence-quality-filter
Parallel sentence quality filter based on text classification methods
Language: Perl - Size: 1.21 MB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0
