GitHub topics: nlp-datasets
StonyBrookNLP/appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Paper.
Language: Python - Size: 4.81 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 179 - Forks: 17

irfnrdh/Awesome-Indonesia-NLP
Resource NLP & Bahasa
Size: 52.7 KB - Last synced at: 3 days ago - Pushed at: over 5 years ago - Stars: 270 - Forks: 67

StonyBrookNLP/appworld-leaderboard
🌍 Leaderboard Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL2024
Language: Python - Size: 127 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 3 - Forks: 1

uma-pi1/OPIEC
Reading the data from OPIEC - an Open Information Extraction corpus
Language: Java - Size: 237 KB - Last synced at: 4 days ago - Pushed at: almost 6 years ago - Stars: 37 - Forks: 6

liutiedong/goat
a Fine-tuned LLaMA that is Good at Arithmetic Tasks
Language: Jupyter Notebook - Size: 863 KB - Last synced at: 1 day ago - Pushed at: over 1 year ago - Stars: 177 - Forks: 17

Niger-Volta-LTI/yoruba-text
Yorùbá language training text for NLP, ASR and TTS tasks
Language: Python - Size: 76.2 MB - Last synced at: 8 days ago - Pushed at: about 2 years ago - Stars: 76 - Forks: 26

Pzoom522/HistSumm
Code and data for "Summarising Historical Text in Modern Languages" (EACL 2021)
Language: Jupyter Notebook - Size: 237 KB - Last synced at: 21 days ago - Pushed at: almost 4 years ago - Stars: 72 - Forks: 9

grammarly/ua-gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Language: Macaulay2 - Size: 18 MB - Last synced at: 15 days ago - Pushed at: about 1 year ago - Stars: 259 - Forks: 22

mihail911/nlp-library
curated collection of papers for the nlp practitioner 📖👩🔬
Size: 63.5 KB - Last synced at: 18 days ago - Pushed at: over 4 years ago - Stars: 1,075 - Forks: 91

BigToothDev/pet-project-nlp
Natural language processing pet project. It includes data web scraping, lemmatizing, stemming, and working with related words (hyponyms, hypernyms, meronyms, holonyms). This specific code gathers all data from chosen pages of the Suspilne (Суспільне) webpage. Next, the data is manipulated and processed for future analysis
Language: Python - Size: 48.5 MB - Last synced at: 9 days ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

aajanki/finnish-nlp-datasets
Open Finnish NLP datasets
Size: 30.3 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 14 - Forks: 1

Robert-Morabito/STOP
Repository for the paper STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions (EMNLP 2024)
Language: Python - Size: 375 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

AndyTheFactory/romanian-nlp-datasets
A list of Romanian NLP Datasets
Size: 190 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 39 - Forks: 7

abdelazizharane/Corpus-Chadian-Languages
Corpus of Chadian languages - Corpus des langues locales tchadiennes
Size: 7.9 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

JadynHax/scpscraper
A Python library designed for scraping data from the SCP wiki.
Language: Python - Size: 216 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 15 - Forks: 4

Jpzinn654/qa-portuguese-v1
This is a split 500 thousands rows of a dataset from hugging face in portuguese to train NLP's for Question-and-Answering
Language: Python - Size: 4.88 KB - Last synced at: 2 days ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

RaiBP/incidental-bilingualism
Python program for detecting unintentional bilingual and translation instances in NLP datasets.
Language: Python - Size: 266 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

fido-ai/ua-datasets
A collection of datasets for Ukrainian language
Language: Python - Size: 2.08 MB - Last synced at: 10 days ago - Pushed at: 9 months ago - Stars: 58 - Forks: 2

LIAAD/PT-Pump-Up
Hub for the Portuguese language NLP Resources
Language: PHP - Size: 8.37 MB - Last synced at: 18 days ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 0

selimfirat/bilkent-turkish-writings-dataset
Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.
Language: Jupyter Notebook - Size: 25.7 MB - Last synced at: 2 months ago - Pushed at: about 7 years ago - Stars: 49 - Forks: 2

christosojan/MSA_in_Indian_Languages
Implementation of Dense Fusion Network with Multimodal Residual (DFMR) for Multi-modal Sentiment Analysis(MSA) in native Indian Languages like Malayalam by integrating Multi-modal information from Multimedia. The model processes the textual, visual, and auditory modalities of the video to classify the sentiment into five categories.
Language: Jupyter Notebook - Size: 31.3 MB - Last synced at: 10 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

aryashah2k/SASBitathon-WinningSolution
1st Place solution for the SAS | GIM Bitathon, an annual Data Science Hackathon organized by SAS and Goa Institute of Management. The dataset worked on is the subset of the consumer complaints database provided by www.consumerfinance.gov
Language: Jupyter Notebook - Size: 39.6 MB - Last synced at: about 4 hours ago - Pushed at: over 3 years ago - Stars: 11 - Forks: 1

cybermatt/russian-names
Library for generation of russian names
Language: Python - Size: 628 KB - Last synced at: 6 days ago - Pushed at: almost 6 years ago - Stars: 24 - Forks: 2

Dibyakanti/AutoTNLI-code
This repository contains the official code for the paper : Realistic Data Augmentation Framework for Enhancing Tabular Reasoning.
Language: HTML - Size: 3.99 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 6 - Forks: 1

Blacksujit/Sentiment-Analysis
This project is a sentiment analysis model built to classify IMDB movie reviews as either positive or negative using the **IMDB dataset**. It uses various machine learning models and deep learning techniques to handle the text data.
Language: Jupyter Notebook - Size: 42.9 MB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

divkakwani/webcorpus
Generate large textual corpora for almost any language by crawling the web
Language: Python - Size: 44.9 MB - Last synced at: 28 days ago - Pushed at: over 3 years ago - Stars: 8 - Forks: 11

bothub-it/bothub
Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages
Language: Makefile - Size: 1.17 MB - Last synced at: 21 days ago - Pushed at: over 2 years ago - Stars: 35 - Forks: 5

PritamDeb68/Public-Datasets
All the Dataset realted to machine learning, Deep learning, NLP and Data Science will be uploaded Here.
Size: 273 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

AtheerAlzhrani/nlp_projects
NLP projects, which I worked on utilising different natural language processing libraries's.
Language: Jupyter Notebook - Size: 104 KB - Last synced at: 29 days ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

ElizaLo/Question-Answering-based-on-SQuAD Fork of gauthierdmn/question_answering
Question Answering System using BiDAF Model on SQuAD v2.0
Language: Python - Size: 7.27 MB - Last synced at: 3 months ago - Pushed at: over 4 years ago - Stars: 25 - Forks: 27

anirudhsom/CAPP-Dataset
Official repository for "Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning".
Language: Jupyter Notebook - Size: 7.78 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

maryamesh/Emotion-Detection-and-Sentiment-Analysis-using-NLP
This project applies BERT for emotion detection and sentiment analysis, utilizing a dataset of annotated documents to classify various emotions from text. The main.ipynb file contains the complete code, outputs, and results.
Language: Jupyter Notebook - Size: 458 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

griff4692/clin-sum
Analysis of Hospital-Course Summaries
Language: Python - Size: 336 KB - Last synced at: 2 months ago - Pushed at: almost 4 years ago - Stars: 4 - Forks: 1

AYUSHSURYAVANSHI/Fake-and-Real-News-NLP-Project
In today's digital age, misinformation spreads rapidly, significantly impacting public perception and decision-making. This project employs word embeddings using spaCy to effectively distinguish between fake and real news, enhancing the accuracy of information verification and contributing to the fight against misinformation. 📰🔍💡
Language: Jupyter Notebook - Size: 11.1 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

gpt-tester/ChatGPT-test-dataset-01
a small test dataset for use with OpenAI's ChatGPT
Size: 47.9 KB - Last synced at: 9 months ago - Pushed at: over 2 years ago - Stars: 34 - Forks: 11

AYUSHSURYAVANSHI/SMS-Spam-Collection-NLP-Project
The SMS Spam Collection v.1 📱 is a curated dataset consisting of 5,574 SMS messages in English, meticulously categorized as either legitimate (ham) or spam. This corpus serves as a valuable resource for research in SMS spam detection and filtering.🔍💬
Language: Python - Size: 209 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

ogpetrov/sakha-nlp
Various tools and data for Sakha language NLP.
Language: HTML - Size: 17.6 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

D3struf/NLP-Chatbot
A Virtual Assistant integrated in TUP Website using NLP and Naive Bayes Algorithm
Language: Jupyter Notebook - Size: 6.07 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 1

trisongz/pylines
Simplifying parsing of large jsonline files in NLP Workflows
Language: Python - Size: 244 KB - Last synced at: 8 days ago - Pushed at: about 3 years ago - Stars: 12 - Forks: 1

Karan-Malik/WordEmbeddings
Creating Word Embeddings using Keras
Language: Jupyter Notebook - Size: 24.5 MB - Last synced at: about 2 months ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

afrisenti-semeval/afrisent-semeval-2023
AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/
Language: Jupyter Notebook - Size: 33 MB - Last synced at: 11 months ago - Pushed at: over 1 year ago - Stars: 38 - Forks: 38

saakolch/procedure_of_extracting_data
Data preprocessing and training on Drug Review Dataset using Hugging Face library
Language: Jupyter Notebook - Size: 38.8 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

fzehracetin/turkish-question-answering
We extracted 5,000 question-answer pairs from Turkish Wikipedia and fine-tuned Turkish BERT, ALBERT, ELECTRA for the question-answering task.
Language: Jupyter Notebook - Size: 1.52 MB - Last synced at: 10 months ago - Pushed at: over 3 years ago - Stars: 6 - Forks: 1

gkiril/benchie
Comprehensive evaluation framework for Open Information Extraction.
Language: Python - Size: 340 KB - Last synced at: 10 months ago - Pushed at: almost 3 years ago - Stars: 38 - Forks: 8

SamDineshSD777/Sentiment-Analysis-on-Product-Reviews
Sentiment Analysis on Product Reviews ( Project Associated with Zummit Infolabs ).
Size: 2.4 MB - Last synced at: 12 months ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 1

ArmanBehnam/NLP
Natural language processing including Datasets,Farsi NLP, Automated Essay Scoring, Automatic Speech Recognition and etc.
Language: Jupyter Notebook - Size: 512 KB - Last synced at: 12 months ago - Pushed at: over 4 years ago - Stars: 3 - Forks: 0

tagtog/BBC-News-Dataset
🍃BBC-News-Dataset in anndoc (tagtog) format
Language: HTML - Size: 2.81 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

shikhirsingh/PersonNameRecognizer4j
Used to identify if the string contains a name of a human
Language: Java - Size: 22 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

josherich/nlp-dataset-explorer
NLP datasets explorer
Language: Vue - Size: 119 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

hicte/moin
A dataset of Moin Persian 🇮🇷 dictionary 📖 words.
Size: 265 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

mehrdad-dev/Battle-of-the-Wordsmiths
Official github repository: Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard (dataset)
Language: Python - Size: 614 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1

matt-seb-ho/WikiWhy
WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.
Language: Python - Size: 28.2 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 41 - Forks: 1

JasonShao55/Chinese_Metaphor_Explanation
An annotated Chinese metaphor dataset
Language: Python - Size: 71.8 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 0

uma-pi1/OPIEC-pipeline
Language: Java - Size: 59.3 MB - Last synced at: 4 days ago - Pushed at: about 3 years ago - Stars: 14 - Forks: 2

BrunoGianetti/MyNLPProjects
My project storage in NLP
Language: Jupyter Notebook - Size: 30.9 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

hellohaptik/multi-task-NLP
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Language: Python - Size: 7.46 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 358 - Forks: 54

vaibhav0000patel/Topical-Sentiment-Analysis
ML model that recognizes how much the text is related to data of a particular topic which the model is trained with. Modular structure of the code makes it easier to understand and modify it. Here, the model classify the text if it is crime related or not..
Language: Python - Size: 483 KB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

INK-USC/CommonGen
A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning
Language: Python - Size: 107 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 136 - Forks: 23

guhhhhaa/4675-scifi
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
Size: 113 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 277 - Forks: 50

quincyliang/nlp-public-dataset
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集,中英文机器翻译数据集, 中文分词数据集
Language: Python - Size: 12.9 MB - Last synced at: over 1 year ago - Pushed at: about 4 years ago - Stars: 320 - Forks: 75

dkulagin/kartaslov
Открытые лингвистические датасеты: тональный словарь русского языка КартаСловСент, датасет по семантике, ассоциативный граф и датасет по орфографическим ошибкам и опечаткам.
Size: 20.1 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 346 - Forks: 50

gcunhase/AMICorpusXML
Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus
Language: Python - Size: 9.48 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 52 - Forks: 29

mtala3t/Identify-the-Sentiments-AV-NLP-Contest
This project is submitted as python implementation in the contest of Analytics Vidhya called "Identify the Sentiments". I enjoyed the joining of this competition and all its process. This submited solution got the rank 118 in the public leaderboard.
Language: Python - Size: 7.61 MB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 7 - Forks: 2

INK-USC/XCSR
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"
Language: Python - Size: 60.7 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 20 - Forks: 2

jrgpulido/pd18is5d
Language: Roff - Size: 11.1 MB - Last synced at: over 1 year ago - Pushed at: almost 6 years ago - Stars: 2 - Forks: 13

codemugger/BERT-Model-for-Fake-News-Classification-
Obtaining news online has become the new normal for many Singaporeans in the information age. The ease of discovering and sharing news with different news sources battling to control the narrative has led to a drastic increase in misinformation. Therefore we need to differentiate between real and fake news, and this project aims to design an application to prevent the spreading of fake news by employing the pre-trained BERT model to detect fake news.
Language: Jupyter Notebook - Size: 82.1 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

aman-17/BERT-Semantic-Similarity-Flask-App
Flask app for Semantic Similarity of sentences using BERT model.
Language: CSS - Size: 6.06 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

cedspam/text_dataset_streaming
Language: Python - Size: 72.3 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

dellison/NLIDatasets.jl
Julia interface to datasets for natural language inference
Language: Julia - Size: 48.8 KB - Last synced at: about 2 months ago - Pushed at: about 5 years ago - Stars: 4 - Forks: 0

claudiu1989/Synonyms-detection
Experiments with word2vec embeddings for synonyms detection, for the Romanian language.
Language: Python - Size: 137 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

INK-USC/TriggerNER
TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
Language: Python - Size: 2.22 MB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 170 - Forks: 19

utahnlp/infotabs-code
Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.
Language: Python - Size: 127 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 17 - Forks: 7

NakulLakhotia/Text-Classification-of-SMS-as-Spam-or-Non-Spam
Classifying a SMS as spam or non-spam using Natural Language Processing (NLP) and Machine Learning
Language: Python - Size: 295 KB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

Delta-Sigma/urdu-stopwords
A list containing Urdu stopwords.
Size: 25.4 KB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 4 - Forks: 11

gcunhase/ArXivAbsTitleDataset
Extract Abstract and Title Dataset from arXiv articles
Language: Python - Size: 14 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 5 - Forks: 0

Text-Mining/Ferdowsi-Annotated-Academic-Linguistic-Corpus
دو پیکره زبانی مربوط به مجموعه مقالات دانشگاه فردوسی مشهد
Size: 57.6 MB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 1 - Forks: 1

DravidianNLP/Datasets
This repository hosts all the datasets published in Dravidian Languages.
Size: 11.7 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

cjiang2/VDCNN
Implementation of Very Deep Convolutional Neural Network for Text Classification
Language: Python - Size: 42 KB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 171 - Forks: 40

turkish-nlp-suite/Vitamins-Supplements-Reviews
Repo for Turkish sentiment analysis dataset, "Vitamins and Supplements Customer Reviews"
Size: 7.48 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

turkish-nlp-suite/Vitamins-Supplements-NER-dataset
Repo for Turkish Vitamins and Supplements NER dataset.
Size: 558 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 1

Abdullahw72/E-Commerce-Chatbot
Chatbot for E-Commerce Related Questions
Language: Jupyter Notebook - Size: 1.24 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 1

xtea/chinese_medical_words
手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。
Size: 1.33 MB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 85 - Forks: 31

U-11-Agar/timeseries-analysis
time series data analysis on real time data and csv files
Language: Jupyter Notebook - Size: 73 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

sammitjain/loksabha-questions
Questions asked in the Lok Sabha - collection and analysis of trends. Creating the dataset from scratch.
Language: Jupyter Notebook - Size: 80.7 MB - Last synced at: 8 days ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 1

Yu-billie/NLP-Project-CUAI-1H23
NLP Projects in CUAI 1H23
Language: Jupyter Notebook - Size: 3.09 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

apple-fritter/ploop.rs
➿Loop through a TSV file and pass columns of data to an external program. Written in Rust.
Language: Rust - Size: 55.7 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

HuynhXuanLam-IT44/BERT-Covid-Sentiment-Classification
Applying and Understanding an Advanced, Novel Deep Learning Approach
Language: Jupyter Notebook - Size: 2.55 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

MiSaengg/gunhee-RnD-space
R&D for datasets for book genres
Language: Jupyter Notebook - Size: 17.1 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

poethan/AlphaMWE
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations
Size: 265 KB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 3 - Forks: 2

jamesohortle/loanwords_gairaigo
English loanwords in Japanese
Language: Python - Size: 17 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 14 - Forks: 1

guhhhhaa/wula-scifi
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
Size: 199 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 49 - Forks: 9

turkish-nlp-suite/BeyazPerde-Movie-Reviews
Repo for Turkish movie reviews dataset.
Size: 10.5 MB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

secsilm/zi-dataset
汉字数据集,包括汉字的相关信息,例如笔画数、部首、拼音、英文释义/同义词等。
Size: 1.57 MB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 50 - Forks: 8

Bohdan-Khomtchouk/NERO-nlp
NERO-nlp is a PyPI package for biomedical Named Entity (Recognition) Ontology
Language: Python - Size: 29.9 MB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 4 - Forks: 1

anudeepvanjavakam1/disaster_response_NLP
This Project is part of Data Science Nanodegree Program by Udacity in collaboration with Figure Eight. The dataset contains pre-labelled tweet and messages from real-life disaster events. The project aim is to build a Natural Language Processing (NLP) model to categorize messages on a real time basis.
Language: Jupyter Notebook - Size: 7.67 MB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

kelvin-jiang/FreebaseQA
The release of the FreebaseQA data set (NAACL 2019).
Size: 7.8 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 59 - Forks: 1

PranavNV/Nationality-Prejudice-in-Text-Generation
This project focuses on the analysis of text generation models such as GPT-2 to identify and understand populistic behaviors or biases against various nationality.
Size: 20.1 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

maxent-ai/Datasets 📦
datasets with text data for use in NLP, Text analysis, information extraction, ML research.
Language: Jupyter Notebook - Size: 45.7 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 15 - Forks: 3

INK-USC/RiddleSense
RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge
Language: Python - Size: 16.3 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 7 - Forks: 1

mzhukovaucsb/emoji_gestures
Research project “Gesture Emoji Twitter Corpus”. Project description, data collection pipeline (tweepy), data preprocessing functions (regex, nltk), 2 datasets for Russian and English published in open access.
Language: Jupyter Notebook - Size: 125 MB - Last synced at: almost 2 years ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0
