low-resource-languages | Topic | Ecosyste.ms: Repos

Topic: "low-resource-languages"

RichardLitt/low-resource-languages

Resources for conservation, development, and documentation of low resource (human) languages.

Language: TeX - Size: 1.33 MB - Last synced at: about 23 hours ago - Pushed at: about 2 months ago - Stars: 419 - Forks: 59

csebuetnlp/xl-sum

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.

Language: Python - Size: 5.41 MB - Last synced at: 10 months ago - Pushed at: about 1 year ago - Stars: 249 - Forks: 42

csebuetnlp/banglanmt

This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.

Language: Python - Size: 2.05 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 144 - Forks: 45

cisnlp/GlotLID

💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023

Language: Python - Size: 409 KB - Last synced at: 21 days ago - Pushed at: 6 months ago - Stars: 128 - Forks: 8

Andrews2017/africanlp-public-datasets

A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.

Size: 146 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 81 - Forks: 19

ljvmiranda921/calamanCy

NLP pipelines for Tagalog using spaCy

Language: Python - Size: 978 KB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 56 - Forks: 3

jcblaisecruz02/Filipino-Text-Benchmarks

Open-source benchmark datasets and pretrained transformer models in the Filipino language.

Language: Python - Size: 565 KB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 46 - Forks: 6

emotion-analysis-project/SemEval2025-Task11

SemEval2024-task 11: Bridging the Gap in Text-Based Emotion Detection

Language: Jupyter Notebook - Size: 86.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 41 - Forks: 5

EveryVoiceTTS/EveryVoice

The EveryVoice TTS Toolkit - Text To Speech for your language

Language: Python - Size: 9.92 MB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 34 - Forks: 2

cdli-gh/Semi-Supervised-NMT-for-Sumerian-English

Exploring the Limits of Low-Resource Neural Machine Translation

Language: Jupyter Notebook - Size: 156 MB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 34 - Forks: 10

hausanlp/NaijaSenti Fork of shmuhammadd/NaijaSenti

This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.

Size: 29.7 MB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 24

kbatsuren/CogNet

CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates

Size: 88.7 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 32 - Forks: 8

Kartikaggarwal98/Indian_ParallelCorpus

Curated list of publicly available parallel corpus for Indian Languages

Size: 8.79 KB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 28 - Forks: 1

csikasote/BembaSpeech

This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.

Size: 2.41 GB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 27 - Forks: 2

alexandra-chron/relm_unmt

Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".

Language: Python - Size: 2.24 MB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 27 - Forks: 2

luciusssss/ZhuangBench

[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly

Language: Python - Size: 3.24 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 21 - Forks: 0

Rumeysakeskin/Turkish-Text-to-Speech

Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan

Language: Python - Size: 8.84 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 3

luciusssss/mc2_corpus

[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)

Language: Python - Size: 602 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 20 - Forks: 2

charlesliucn/LanMIT Fork of kaldi-asr/kaldi

📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.

Language: C++ - Size: 139 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 20 - Forks: 0

RichardLitt/thesis

My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University

Language: TeX - Size: 36.7 MB - Last synced at: 2 months ago - Pushed at: almost 7 years ago - Stars: 20 - Forks: 4

khuangaf/CONCRETE

Official implementation of "CONCRETE: Improving Cross-lingual Fact Checking with Cross-lingual Retrieval" (COLING'22)

Language: Python - Size: 137 KB - Last synced at: 29 days ago - Pushed at: over 2 years ago - Stars: 17 - Forks: 1

Aditi138/EntityTargetedActiveLearning

Language: Python - Size: 45.9 KB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 17 - Forks: 3

CoEDL/vad-sli-asr

A pipeline to isolate and transcribe one language in mixed-language speech

Language: Python - Size: 350 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 16 - Forks: 3

alecokas/BiLatticeRNN-Confidence

Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks https://arxiv.org/abs/1910.11933 or https://ieeexplore.ieee.org/document/9053264

Language: Python - Size: 614 KB - Last synced at: about 2 months ago - Pushed at: about 5 years ago - Stars: 16 - Forks: 4

jcblaisecruz02/Tagalog-fake-news

Fake news detection in Filipino via Multitask Transfer Learning

Size: 1.24 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 13 - Forks: 2

Rumeysakeskin/Turkish-Speech-to-Text

Fine-tuning for automatic speech recognition on low-resource languages with character-based CTC model

Language: Jupyter Notebook - Size: 48.4 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 13 - Forks: 1

cisnlp/GlotWeb

🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.

Language: Python - Size: 1.59 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 12 - Forks: 0

wannaphong/Awesome-Lao-NLP

Awesome Lao Natural Language Processing

Size: 11.7 KB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 12 - Forks: 0

dmatekenya/Chichewa-Speech2Text

Automated Speech Recognition for Chichewa.

Language: Jupyter Notebook - Size: 57.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 12 - Forks: 1

jhdeov/interlingual-MFA

Workflow for forced alignment between languages

Language: Python - Size: 260 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 1

unza-speech-lab/zambezi-voice

Repository for multilingual speech data resources for native languages of Zambia.

Size: 5.77 GB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 4

clefourrier/CopperMT

[ACL 2021, Findings] Cognate Prediction Per Machine Translation

Language: JavaScript - Size: 37.2 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 10 - Forks: 0

LeeLanguageLab/HokkienTranslation

Educational language-learning app for Hokkien, a low-resource language, featuring flashcards, quizzes, and generative AI!

Language: JavaScript - Size: 51.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 9 - Forks: 2

ofdn/OpenSpeaks-Before-AI

A set of frameworks for creating the AI/ML building blocks for low-resource languages.

Size: 14.6 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 9 - Forks: 1

AsifulNobel/Metsys

Chatbot Solution for Resource-Poor Languages. Contains code and data for Journal Article 'Focused domain contextual AI chatbot framework for resource poor languages'.

Language: Python - Size: 56.1 MB - Last synced at: 6 months ago - Pushed at: almost 4 years ago - Stars: 9 - Forks: 6

fajri91/minangNLP

Minangkabau NLP corpus. PACLIC 2020

Language: Python - Size: 6.22 MB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 9 - Forks: 3

alecokas/swahili-text-gcn

Graph Convolutional Network for Swahili News Classification: https://arxiv.org/abs/2103.09325

Language: Jupyter Notebook - Size: 521 KB - Last synced at: about 19 hours ago - Pushed at: almost 4 years ago - Stars: 8 - Forks: 4

ruoyuxie/noisy_parallel_data_alignment

Enhanced awesome-align for low-resource languages and noise simulation: https://arxiv.org/abs/2301.09685

Language: Python - Size: 245 KB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 7 - Forks: 1

IgnatiusEzeani/IGBONLP

This is a repository for the IGBONLP Project.

Language: Modula-3 - Size: 127 MB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 7 - Forks: 4

tafseer-nayeem/BengaliReadability

Code and Dataset of our work, Simple or Complex? Learning to Predict Readability of Bengali Texts accepted at AAAI 2021.

Language: Python - Size: 9.52 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 7 - Forks: 5

mokha/verdd

Veʹrdd is an open-source dictionary editing framework with the focus on low-resourced and endangered languages. The framework is mainly built to facilitate collecting, importing, editing and exporting dictionaries while allowing the involvement of the native speakers to contribute easily to the preservation of the language and construction of the dictionary.

Language: Python - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 1

NN-Project-2/Emotion-TTS-Emebddings

This project enhances Text-to-Speech systems by integrating advanced emotion embeddings, allowing for more expressive and human-like speech synthesis. By capturing the nuances of human emotions, our approach aims to create synthetic voices that resonate with listeners, enabling effective emotional expression in speech generation.

Language: Python - Size: 13.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

mmaguero/josa-corpus

Jopara (Guarani-dominant mixed with Spanish) sentiment analysis corpus

Size: 8.79 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 6 - Forks: 0

tafseer-nayeem/BengaliSummarization

Code and Dataset of our work, Unsupervised Abstractive Summarization of Bengali Text Documents accepted at EACL 2021.

Language: Python - Size: 2.63 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 6 - Forks: 6

harmanpreet93/low-resource-machine-translation

Low resource machine translation using Transformers and Iterative Back translation

Language: Python - Size: 7.34 MB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 6 - Forks: 1

Michael-Beukman/NERTransfer

[IJCNLP-AACL 2023] Investigating transfer learning in low-resourced languages, specifically in a named entity recognition (NER) task. http://arxiv.org/abs/2309.05311

Language: Python - Size: 242 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 0

njallskarp/finetune-qa-powerset

Finetuning BERT models on a powerset of different linguistic domains

Language: Python - Size: 198 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 0

alexandra-chron/umt-lmu-wmt2020

Unsupervised MT systems of LMU Munich submitted to WMT 2020 Unsupervised Machine Translation Shared Task.

Language: Python - Size: 438 KB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 5 - Forks: 0

rockerritesh/maithili_lipi_AI_proj

This project is prepared in partial fulfilment of the requirement for for the the bachelor’s degree in Electronics and Communication Engineering. This contains vowel letter detection of Tirhuta Lipi.

Language: Jupyter Notebook - Size: 27.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

worldbank/LLMs-Practical-Guide

A practical introduction to Generative AI and LLMs, equipping professionals with essential skills to apply Gen AI in workflows, data processes, and tool development through hands-on labs and case studies.

Language: Jupyter Notebook - Size: 11.3 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 4 - Forks: 3

ToluClassics/LowResourceOCR 📦

This work is an adaptation of CNN+Transformer architecture to training text recognition models for Yorùbá & Igbo Languages

Language: Python - Size: 1.15 GB - Last synced at: 10 months ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 1

csikasote/bembaspeech-exps

Bemba ASR model obtained by fine-tuning a well performing DeepSpeech English pretrained model.

Language: Jupyter Notebook - Size: 3.2 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 2

cbelth/ATP-morphology

Code for "The Greedy and Recursive Search for Morphological Productivity." Ready for use on new data.

Language: Jupyter Notebook - Size: 23.5 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 1

MahtaFetrat/Mana-Forced-Aligner

A robust forced alignment tool for low-resource languages using multiple ASR models and CER-based matching. Built for noisy data and imperfect transcripts.

Language: Jupyter Notebook - Size: 4.86 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 3 - Forks: 0

devrimcavusoglu/nonwestlit

NONWESTLIT Project Codebase

Language: Python - Size: 239 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0

zabir-nabil/bangla-multilingual-llm-eval

Evaluation of Open and Closed-Source Multi-lingual LLMs for Low-Resource Bangla Language

Language: Jupyter Notebook - Size: 29.3 MB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

kenza-ily/mt_hallucination_detection

Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models | EMNLP Findings 2024

Language: Python - Size: 3.91 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 3 - Forks: 0

cisnlp/GlotStoryBook

Children StoryBooks for 180 langauges.

Language: Jupyter Notebook - Size: 16.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

victoriapedlar/isizulu-text-generation

Open-Ended Text Generation in isiZulu: Decoding Strategies for a Morphologically Rich Low-Resource Language

Language: Python - Size: 1.5 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 3 - Forks: 0

Rumeysakeskin/ASR-fine-tuning-for-low-resource-languages

Transfer learning for ASR with subword encoding CTC model (NVIDIA NeMo Citrinet) on low-resource languages

Language: Jupyter Notebook - Size: 455 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

rexshijaku/alb-fake-news-corpus

The First Ever Albanian Fake News Corpus

Size: 6.63 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 1

elerdg/ASR-for-low-resource-languages

Fine-tune wav2vec2-xls-r on data from low-resource-languages

Language: Jupyter Notebook - Size: 5.76 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 1

andrea-cavallo-98/Low-resource-Machine-Translation

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Language: Jupyter Notebook - Size: 4.32 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 2

hmar-lang/hmar-bible-dataset

A dataset featuring English to Hmar translations of the Bible, designed for use in linguistic research, cultural preservation, and machine learning applications.

Size: 5.98 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 0

jo-valer/machine-translation-ladin-fascian

Repository of our paper Nesciun Lengaz Lascià Endò: Machine Translation for Fassa Ladin.

Language: Jupyter Notebook - Size: 271 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

Rui0828/Learning-From-Mistakes-Prompting

LoResMT@ACL 2024: Learning-From-Mistakes Prompting for Indigenous Language Translation – A feedback-driven approach to enhance low-resource translation.

Language: Python - Size: 4.92 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

kashubian-translator/pl-csb-model

Model training and BLEU calculation tools for a Polish-Kashubian translator.

Language: Python - Size: 55.7 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

ddindidu/K-OMG

Example dataset and prompt design of Korean Offensive language Machine Generation (K-OMG), published at IJCNLP-AACL 2023.

Language: Python - Size: 3.3 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

agustinbrusco/tokens-thorugh-lang

Analysis of LLM token representation of texts in different languages

Language: Jupyter Notebook - Size: 7.9 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

fokhruli/CM-seti-anlysis

Implementation for the paper titled, " Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding", IEEE Access, 2023

Language: Python - Size: 2.33 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

strickvl/balochi-tokenizer

A custom tokenizer for the Balochi language.

Language: Jupyter Notebook - Size: 2.04 MB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

HenningBuhl/low-resource-machine-translation

This repository is an open-source colleciton of various low-resource machine translation experiments.

Language: Python - Size: 428 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 1

frankl1/Word2vec-For-NER-In-Low-Resource-Languages

An efficient word representation for named entities recognition in low-resource languages.

Language: Jupyter Notebook - Size: 33.3 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

ljvmiranda921/ud-tagalog-spacy

Training a POS Tagger and Dependency Parser for a Low-Resource Language (Tagalog)

Language: Python - Size: 52.2 MB - Last synced at: 2 months ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

uds-lsv/transfer-distant-transformer-african

Code + data for the EMNLP'20 publication "Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages"

Language: Python - Size: 1.47 MB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 4

ogunlao/low_res_speech_project

Accompanying code for research work on Weakly Supervised Learning of Speech features for Low resource languages

Language: Python - Size: 4.02 MB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 2 - Forks: 1

Llamacha/Churana

Size: 26.4 KB - Last synced at: almost 2 years ago - Pushed at: almost 4 years ago - Stars: 2 - Forks: 1

OmeshThokchom/N7speech

Manipuri ASR – A state-of-the-art, low-latency speech-to-text library with advanced voice activity detection and real-time transcription, purpose-built for the Manipuri language.

Language: Python - Size: 269 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

mrapacz/interlinear-translation

Morphology-enhanced neural models for Ancient Greek interlinear translation, achieving 35-38% BLEU improvements for English and Polish translations. Includes custom T5 implementations and training code. [LoResLM@COLING2025]

Language: Python - Size: 889 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 1 - Forks: 0

SaraikiNLP/SaraikiNLP

SaraikiNLP | Natural Language Processing for Saraiki Language | NLP Toolkit | Saraiki NLP

Language: Jupyter Notebook - Size: 304 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Miagao-Valley/kulasisi

A web application that helps communities preserve and revitalize their languages.

Language: TypeScript - Size: 21.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

NN-Project-1/dis-Vector-Embedding

The Dis-Vector project enhances voice conversion and synthesis through disentangled embeddings, allowing for high-quality, zero-shot voice cloning across multiple languages. This model leverages separate encoders for content, pitch, rhythm, and timbre, enabling precise control over synthesized voice characteristics.

Language: Python - Size: 11.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0