Topic: "low-resource-languages"
RichardLitt/low-resource-languages
Resources for conservation, development, and documentation of low resource (human) languages.
Language: TeX - Size: 1.33 MB - Last synced at: about 23 hours ago - Pushed at: about 2 months ago - Stars: 419 - Forks: 59

csebuetnlp/xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Language: Python - Size: 5.41 MB - Last synced at: 10 months ago - Pushed at: about 1 year ago - Stars: 249 - Forks: 42

csebuetnlp/banglanmt
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Language: Python - Size: 2.05 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 144 - Forks: 45

cisnlp/GlotLID
💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
Language: Python - Size: 409 KB - Last synced at: 21 days ago - Pushed at: 6 months ago - Stars: 128 - Forks: 8

Andrews2017/africanlp-public-datasets
A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
Size: 146 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 81 - Forks: 19

ljvmiranda921/calamanCy
NLP pipelines for Tagalog using spaCy
Language: Python - Size: 978 KB - Last synced at: 4 days ago - Pushed at: 4 months ago - Stars: 56 - Forks: 3

jcblaisecruz02/Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Language: Python - Size: 565 KB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 46 - Forks: 6

emotion-analysis-project/SemEval2025-Task11
SemEval2024-task 11: Bridging the Gap in Text-Based Emotion Detection
Language: Jupyter Notebook - Size: 86.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 41 - Forks: 5

EveryVoiceTTS/EveryVoice
The EveryVoice TTS Toolkit - Text To Speech for your language
Language: Python - Size: 9.92 MB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 34 - Forks: 2

cdli-gh/Semi-Supervised-NMT-for-Sumerian-English
Exploring the Limits of Low-Resource Neural Machine Translation
Language: Jupyter Notebook - Size: 156 MB - Last synced at: 7 days ago - Pushed at: over 2 years ago - Stars: 34 - Forks: 10

hausanlp/NaijaSenti Fork of shmuhammadd/NaijaSenti
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
Size: 29.7 MB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 24

kbatsuren/CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
Size: 88.7 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 32 - Forks: 8

Kartikaggarwal98/Indian_ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages
Size: 8.79 KB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 28 - Forks: 1

csikasote/BembaSpeech
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.
Size: 2.41 GB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 27 - Forks: 2

alexandra-chron/relm_unmt
Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".
Language: Python - Size: 2.24 MB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 27 - Forks: 2

luciusssss/ZhuangBench
[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly
Language: Python - Size: 3.24 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 21 - Forks: 0

Rumeysakeskin/Turkish-Text-to-Speech
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan
Language: Python - Size: 8.84 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 3

luciusssss/mc2_corpus
[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
Language: Python - Size: 602 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 20 - Forks: 2

charlesliucn/LanMIT Fork of kaldi-asr/kaldi
📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.
Language: C++ - Size: 139 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 20 - Forks: 0

RichardLitt/thesis
My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University
Language: TeX - Size: 36.7 MB - Last synced at: 2 months ago - Pushed at: almost 7 years ago - Stars: 20 - Forks: 4

khuangaf/CONCRETE
Official implementation of "CONCRETE: Improving Cross-lingual Fact Checking with Cross-lingual Retrieval" (COLING'22)
Language: Python - Size: 137 KB - Last synced at: 29 days ago - Pushed at: over 2 years ago - Stars: 17 - Forks: 1

Aditi138/EntityTargetedActiveLearning
Language: Python - Size: 45.9 KB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 17 - Forks: 3

CoEDL/vad-sli-asr
A pipeline to isolate and transcribe one language in mixed-language speech
Language: Python - Size: 350 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 16 - Forks: 3

alecokas/BiLatticeRNN-Confidence
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks https://arxiv.org/abs/1910.11933 or https://ieeexplore.ieee.org/document/9053264
Language: Python - Size: 614 KB - Last synced at: about 2 months ago - Pushed at: about 5 years ago - Stars: 16 - Forks: 4

jcblaisecruz02/Tagalog-fake-news
Fake news detection in Filipino via Multitask Transfer Learning
Size: 1.24 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 13 - Forks: 2

Rumeysakeskin/Turkish-Speech-to-Text
Fine-tuning for automatic speech recognition on low-resource languages with character-based CTC model
Language: Jupyter Notebook - Size: 48.4 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 13 - Forks: 1

cisnlp/GlotWeb
🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
Language: Python - Size: 1.59 MB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 12 - Forks: 0

wannaphong/Awesome-Lao-NLP
Awesome Lao Natural Language Processing
Size: 11.7 KB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 12 - Forks: 0

dmatekenya/Chichewa-Speech2Text
Automated Speech Recognition for Chichewa.
Language: Jupyter Notebook - Size: 57.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 12 - Forks: 1

jhdeov/interlingual-MFA
Workflow for forced alignment between languages
Language: Python - Size: 260 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 1

unza-speech-lab/zambezi-voice
Repository for multilingual speech data resources for native languages of Zambia.
Size: 5.77 GB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 10 - Forks: 4

clefourrier/CopperMT
[ACL 2021, Findings] Cognate Prediction Per Machine Translation
Language: JavaScript - Size: 37.2 MB - Last synced at: about 1 month ago - Pushed at: almost 3 years ago - Stars: 10 - Forks: 0

LeeLanguageLab/HokkienTranslation
Educational language-learning app for Hokkien, a low-resource language, featuring flashcards, quizzes, and generative AI!
Language: JavaScript - Size: 51.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 9 - Forks: 2

ofdn/OpenSpeaks-Before-AI
A set of frameworks for creating the AI/ML building blocks for low-resource languages.
Size: 14.6 MB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 9 - Forks: 1

AsifulNobel/Metsys
Chatbot Solution for Resource-Poor Languages. Contains code and data for Journal Article 'Focused domain contextual AI chatbot framework for resource poor languages'.
Language: Python - Size: 56.1 MB - Last synced at: 6 months ago - Pushed at: almost 4 years ago - Stars: 9 - Forks: 6

fajri91/minangNLP
Minangkabau NLP corpus. PACLIC 2020
Language: Python - Size: 6.22 MB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 9 - Forks: 3

alecokas/swahili-text-gcn
Graph Convolutional Network for Swahili News Classification: https://arxiv.org/abs/2103.09325
Language: Jupyter Notebook - Size: 521 KB - Last synced at: about 19 hours ago - Pushed at: almost 4 years ago - Stars: 8 - Forks: 4

ruoyuxie/noisy_parallel_data_alignment
Enhanced awesome-align for low-resource languages and noise simulation: https://arxiv.org/abs/2301.09685
Language: Python - Size: 245 KB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 7 - Forks: 1

IgnatiusEzeani/IGBONLP
This is a repository for the IGBONLP Project.
Language: Modula-3 - Size: 127 MB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 7 - Forks: 4

tafseer-nayeem/BengaliReadability
Code and Dataset of our work, Simple or Complex? Learning to Predict Readability of Bengali Texts accepted at AAAI 2021.
Language: Python - Size: 9.52 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 7 - Forks: 5

mokha/verdd
Veʹrdd is an open-source dictionary editing framework with the focus on low-resourced and endangered languages. The framework is mainly built to facilitate collecting, importing, editing and exporting dictionaries while allowing the involvement of the native speakers to contribute easily to the preservation of the language and construction of the dictionary.
Language: Python - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 1

NN-Project-2/Emotion-TTS-Emebddings
This project enhances Text-to-Speech systems by integrating advanced emotion embeddings, allowing for more expressive and human-like speech synthesis. By capturing the nuances of human emotions, our approach aims to create synthetic voices that resonate with listeners, enabling effective emotional expression in speech generation.
Language: Python - Size: 13.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

mmaguero/josa-corpus
Jopara (Guarani-dominant mixed with Spanish) sentiment analysis corpus
Size: 8.79 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 6 - Forks: 0

tafseer-nayeem/BengaliSummarization
Code and Dataset of our work, Unsupervised Abstractive Summarization of Bengali Text Documents accepted at EACL 2021.
Language: Python - Size: 2.63 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 6 - Forks: 6

harmanpreet93/low-resource-machine-translation
Low resource machine translation using Transformers and Iterative Back translation
Language: Python - Size: 7.34 MB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 6 - Forks: 1

Michael-Beukman/NERTransfer
[IJCNLP-AACL 2023] Investigating transfer learning in low-resourced languages, specifically in a named entity recognition (NER) task. http://arxiv.org/abs/2309.05311
Language: Python - Size: 242 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 0

njallskarp/finetune-qa-powerset
Finetuning BERT models on a powerset of different linguistic domains
Language: Python - Size: 198 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 0

alexandra-chron/umt-lmu-wmt2020
Unsupervised MT systems of LMU Munich submitted to WMT 2020 Unsupervised Machine Translation Shared Task.
Language: Python - Size: 438 KB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 5 - Forks: 0

rockerritesh/maithili_lipi_AI_proj
This project is prepared in partial fulfilment of the requirement for for the the bachelor’s degree in Electronics and Communication Engineering. This contains vowel letter detection of Tirhuta Lipi.
Language: Jupyter Notebook - Size: 27.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 0

worldbank/LLMs-Practical-Guide
A practical introduction to Generative AI and LLMs, equipping professionals with essential skills to apply Gen AI in workflows, data processes, and tool development through hands-on labs and case studies.
Language: Jupyter Notebook - Size: 11.3 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 4 - Forks: 3

ToluClassics/LowResourceOCR 📦
This work is an adaptation of CNN+Transformer architecture to training text recognition models for Yorùbá & Igbo Languages
Language: Python - Size: 1.15 GB - Last synced at: 10 months ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 1

csikasote/bembaspeech-exps
Bemba ASR model obtained by fine-tuning a well performing DeepSpeech English pretrained model.
Language: Jupyter Notebook - Size: 3.2 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 2

cbelth/ATP-morphology
Code for "The Greedy and Recursive Search for Morphological Productivity." Ready for use on new data.
Language: Jupyter Notebook - Size: 23.5 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 1

MahtaFetrat/Mana-Forced-Aligner
A robust forced alignment tool for low-resource languages using multiple ASR models and CER-based matching. Built for noisy data and imperfect transcripts.
Language: Jupyter Notebook - Size: 4.86 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 3 - Forks: 0

devrimcavusoglu/nonwestlit
NONWESTLIT Project Codebase
Language: Python - Size: 239 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0

zabir-nabil/bangla-multilingual-llm-eval
Evaluation of Open and Closed-Source Multi-lingual LLMs for Low-Resource Bangla Language
Language: Jupyter Notebook - Size: 29.3 MB - Last synced at: about 2 months ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

kenza-ily/mt_hallucination_detection
Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models | EMNLP Findings 2024
Language: Python - Size: 3.91 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 3 - Forks: 0

cisnlp/GlotStoryBook
Children StoryBooks for 180 langauges.
Language: Jupyter Notebook - Size: 16.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

victoriapedlar/isizulu-text-generation
Open-Ended Text Generation in isiZulu: Decoding Strategies for a Morphologically Rich Low-Resource Language
Language: Python - Size: 1.5 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 3 - Forks: 0

Rumeysakeskin/ASR-fine-tuning-for-low-resource-languages
Transfer learning for ASR with subword encoding CTC model (NVIDIA NeMo Citrinet) on low-resource languages
Language: Jupyter Notebook - Size: 455 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

rexshijaku/alb-fake-news-corpus
The First Ever Albanian Fake News Corpus
Size: 6.63 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 1

elerdg/ASR-for-low-resource-languages
Fine-tune wav2vec2-xls-r on data from low-resource-languages
Language: Jupyter Notebook - Size: 5.76 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 1

andrea-cavallo-98/Low-resource-Machine-Translation
Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.
Language: Jupyter Notebook - Size: 4.32 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 2

hmar-lang/hmar-bible-dataset
A dataset featuring English to Hmar translations of the Bible, designed for use in linguistic research, cultural preservation, and machine learning applications.
Size: 5.98 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 0

jo-valer/machine-translation-ladin-fascian
Repository of our paper Nesciun Lengaz Lascià Endò: Machine Translation for Fassa Ladin.
Language: Jupyter Notebook - Size: 271 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

Rui0828/Learning-From-Mistakes-Prompting
LoResMT@ACL 2024: Learning-From-Mistakes Prompting for Indigenous Language Translation – A feedback-driven approach to enhance low-resource translation.
Language: Python - Size: 4.92 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

kashubian-translator/pl-csb-model
Model training and BLEU calculation tools for a Polish-Kashubian translator.
Language: Python - Size: 55.7 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

ddindidu/K-OMG
Example dataset and prompt design of Korean Offensive language Machine Generation (K-OMG), published at IJCNLP-AACL 2023.
Language: Python - Size: 3.3 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

agustinbrusco/tokens-thorugh-lang
Analysis of LLM token representation of texts in different languages
Language: Jupyter Notebook - Size: 7.9 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

fokhruli/CM-seti-anlysis
Implementation for the paper titled, " Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding", IEEE Access, 2023
Language: Python - Size: 2.33 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

strickvl/balochi-tokenizer
A custom tokenizer for the Balochi language.
Language: Jupyter Notebook - Size: 2.04 MB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

HenningBuhl/low-resource-machine-translation
This repository is an open-source colleciton of various low-resource machine translation experiments.
Language: Python - Size: 428 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 1

frankl1/Word2vec-For-NER-In-Low-Resource-Languages
An efficient word representation for named entities recognition in low-resource languages.
Language: Jupyter Notebook - Size: 33.3 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

ljvmiranda921/ud-tagalog-spacy
Training a POS Tagger and Dependency Parser for a Low-Resource Language (Tagalog)
Language: Python - Size: 52.2 MB - Last synced at: 2 months ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

uds-lsv/transfer-distant-transformer-african
Code + data for the EMNLP'20 publication "Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages"
Language: Python - Size: 1.47 MB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 4

ogunlao/low_res_speech_project
Accompanying code for research work on Weakly Supervised Learning of Speech features for Low resource languages
Language: Python - Size: 4.02 MB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 2 - Forks: 1

Llamacha/Churana
Size: 26.4 KB - Last synced at: almost 2 years ago - Pushed at: almost 4 years ago - Stars: 2 - Forks: 1

OmeshThokchom/N7speech
Manipuri ASR – A state-of-the-art, low-latency speech-to-text library with advanced voice activity detection and real-time transcription, purpose-built for the Manipuri language.
Language: Python - Size: 269 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

mrapacz/interlinear-translation
Morphology-enhanced neural models for Ancient Greek interlinear translation, achieving 35-38% BLEU improvements for English and Polish translations. Includes custom T5 implementations and training code. [LoResLM@COLING2025]
Language: Python - Size: 889 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 1 - Forks: 0

SaraikiNLP/SaraikiNLP
SaraikiNLP | Natural Language Processing for Saraiki Language | NLP Toolkit | Saraiki NLP
Language: Jupyter Notebook - Size: 304 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Miagao-Valley/kulasisi
A web application that helps communities preserve and revitalize their languages.
Language: TypeScript - Size: 21.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

NN-Project-1/dis-Vector-Embedding
The Dis-Vector project enhances voice conversion and synthesis through disentangled embeddings, allowing for high-quality, zero-shot voice cloning across multiple languages. This model leverages separate encoders for content, pitch, rhythm, and timbre, enabling precise control over synthesized voice characteristics.
Language: Python - Size: 11.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

andrianllmm/wika-data
Philippine language resources.
Language: Python - Size: 5.83 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

saedeht/language-understanding-chatgpt
repo for the Adapted few-shot prompting...
Language: Jupyter Notebook - Size: 58.6 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

haturusinghe/subasa-plm
A framework for adapting Pretrained Language Models (XLM-R, BERT etc.) for Low-Resourced Offensive Language Detection in Sinhala using pretrained models and intermediate tasks.
Language: Python - Size: 3.73 MB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

alexeyev/tratreetra
simple syntactic transfer based on the treebank translation
Language: Python - Size: 42 KB - Last synced at: 13 days ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

kyrgyznlp/kyrgyznlp.github.io
KyrgyzNLP: Browsing, Bibliography, and Scientometrics
Language: HTML - Size: 21.4 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

BrianMsane/siSwati-Datasets
Repository for siSwati NLP datasets which I have worked on in my research. Anything from sentiment analysis, named entity recognition, fill-masking, and more!
Size: 153 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

BrianMsane/sswcleaner
Natural language processing tool for siSwati
Language: Python - Size: 173 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

VinAIResearch/UniBridge
UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource Languages (ACL 2024)
Language: Python - Size: 485 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

ndamulelonemakh/our-stopwords
Auto-generated stopwords for South African Bantu Languages
Language: Python - Size: 23.4 KB - Last synced at: 3 days ago - Pushed at: 12 months ago - Stars: 1 - Forks: 0

generalpurposelab/instruct-global
Repo associated with the forthcoming paper 'Instruct-global: aligning language models to follow instructions in low-resource languages'. Instruct-global automates the process of generating instruction datasets in low-resource languages (LRLs).
Language: Python - Size: 7.73 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

dsfsi/zabantu-beta
ZaBantu is a fleet of light-weight Masked Language Models for Southern Bantu Languages
Language: Python - Size: 3.12 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

GGLAB-KU/turkish-plu
Code for AACL23 paper "Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish"
Language: Python - Size: 146 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

mdm-code/manx
Fine-tune LLM for early Middle English lemmatization with data from LAEME.
Language: Python - Size: 157 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

umar1997/propaganda-codeswitched-text
[EMNLP 2023] Official repository of paper titled "Detecting Propaganda Techniques in Code-Switched Social Media Text"
Language: Jupyter Notebook - Size: 47 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 1

IParraMartin/Eu2Vec-Models
Eu2Vec Models
Language: Jupyter Notebook - Size: 69.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

dsfsi/embedding-eval-data
Embedding Evaluation Data for South African Languages
Size: 8.79 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

greenw0lf/MSc-VT-Thesis
Scripts and files I used throughout my Master thesis.
Language: Jupyter Notebook - Size: 2.05 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

IParraMartin/BasqueNews-Dataset
A folder containing news headers and descriptions from Basque newspapers
Size: 77.1 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0
