GitHub topics: nlp-datasets

Repositories

StonyBrookNLP/appworld

🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Paper.

Language: Python - Size: 5.16 MB - Last synced at: 1 day ago - Pushed at: about 1 month ago - Stars: 212 - Forks: 22

uma-pi1/OPIEC

Reading the data from OPIEC - an Open Information Extraction corpus

Language: Java - Size: 237 KB - Last synced at: 4 days ago - Pushed at: about 6 years ago - Stars: 38 - Forks: 6

irfnrdh/Awesome-Indonesia-NLP

Resource NLP & Bahasa

Size: 52.7 KB - Last synced at: 10 days ago - Pushed at: over 5 years ago - Stars: 269 - Forks: 67

selimfirat/bilkent-turkish-writings-dataset

Compilation of Turkish writings dataset that promotes creativity, content, composition, grammar, spelling and punctuation.

Language: Python - Size: 41.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 50 - Forks: 2

fido-ai/ua-datasets

A collection of datasets for Ukrainian language

Language: Python - Size: 2.08 MB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 57 - Forks: 2

grammarly/ua-gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Language: Macaulay2 - Size: 18 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 261 - Forks: 22

AndyTheFactory/romanian-nlp-datasets

A list of Romanian NLP Datasets

Size: 215 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 48 - Forks: 8

brooks-code/fuzzy-carnival

dailyword: discover a new word, once a day, straight from your terminal.

Language: Shell - Size: 1000 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

StonyBrookNLP/appworld-leaderboard

🌍 Leaderboard Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL2024

Language: Python - Size: 127 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 3 - Forks: 1

liutiedong/goat

a Fine-tuned LLaMA that is Good at Arithmetic Tasks

Language: Jupyter Notebook - Size: 863 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 177 - Forks: 17

Niger-Volta-LTI/yoruba-text

Yorùbá language training text for NLP, ASR and TTS tasks

Language: Python - Size: 76.2 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 76 - Forks: 26

Pzoom522/HistSumm

Code and data for "Summarising Historical Text in Modern Languages" (EACL 2021)

Language: Jupyter Notebook - Size: 237 KB - Last synced at: 3 months ago - Pushed at: about 4 years ago - Stars: 72 - Forks: 9

mihail911/nlp-library

curated collection of papers for the nlp practitioner 📖👩‍🔬

Size: 63.5 KB - Last synced at: 3 months ago - Pushed at: almost 5 years ago - Stars: 1,075 - Forks: 91

Natural language processing pet project. It includes data web scraping, lemmatizing, stemming, and working with related words (hyponyms, hypernyms, meronyms, holonyms). This specific code gathers all data from chosen pages of the Suspilne (Суспільне) webpage. Next, the data is manipulated and processed for future analysis

Language: Python - Size: 48.5 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

aajanki/finnish-nlp-datasets

Open Finnish NLP datasets

Size: 30.3 KB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 14 - Forks: 1

Robert-Morabito/STOP

Repository for the paper STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions (EMNLP 2024)

Language: Python - Size: 375 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

abdelazizharane/Corpus-Chadian-Languages

Corpus of Chadian languages - Corpus des langues locales tchadiennes

Size: 7.9 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

JadynHax/scpscraper

A Python library designed for scraping data from the SCP wiki.

Language: Python - Size: 216 KB - Last synced at: about 1 month ago - Pushed at: over 4 years ago - Stars: 15 - Forks: 4

Jpzinn654/qa-portuguese-v1

This is a split 500 thousands rows of a dataset from hugging face in portuguese to train NLP's for Question-and-Answering

Language: Python - Size: 4.88 KB - Last synced at: 2 months ago - Pushed at: 7 months ago - Stars: 3 - Forks: 0

RaiBP/incidental-bilingualism

Python program for detecting unintentional bilingual and translation instances in NLP datasets.

Language: Python - Size: 266 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

LIAAD/PT-Pump-Up

Hub for the Portuguese language NLP Resources

Language: PHP - Size: 8.37 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 0

christosojan/MSA_in_Indian_Languages

Implementation of Dense Fusion Network with Multimodal Residual (DFMR) for Multi-modal Sentiment Analysis(MSA) in native Indian Languages like Malayalam by integrating Multi-modal information from Multimedia. The model processes the textual, visual, and auditory modalities of the video to classify the sentiment into five categories.

Language: Jupyter Notebook - Size: 31.3 MB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

aryashah2k/SASBitathon-WinningSolution

1st Place solution for the SAS | GIM Bitathon, an annual Data Science Hackathon organized by SAS and Goa Institute of Management. The dataset worked on is the subset of the consumer complaints database provided by www.consumerfinance.gov

Language: Jupyter Notebook - Size: 39.6 MB - Last synced at: 2 days ago - Pushed at: over 3 years ago - Stars: 11 - Forks: 1

cybermatt/russian-names

Library for generation of russian names

Language: Python - Size: 628 KB - Last synced at: 6 days ago - Pushed at: about 6 years ago - Stars: 24 - Forks: 2

Dibyakanti/AutoTNLI-code

This repository contains the official code for the paper : Realistic Data Augmentation Framework for Enhancing Tabular Reasoning.

Language: HTML - Size: 3.99 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 6 - Forks: 1

Blacksujit/Sentiment-Analysis

This project is a sentiment analysis model built to classify IMDB movie reviews as either positive or negative using the **IMDB dataset**. It uses various machine learning models and deep learning techniques to handle the text data.

Language: Jupyter Notebook - Size: 42.9 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

divkakwani/webcorpus

Generate large textual corpora for almost any language by crawling the web

Language: Python - Size: 44.9 MB - Last synced at: 24 days ago - Pushed at: over 3 years ago - Stars: 8 - Forks: 11

bothub-it/bothub

Bothub is an open platform for predicting, training and sharing NLP datasets in multiple languages

Language: Makefile - Size: 1.17 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 35 - Forks: 5

PritamDeb68/Public-Datasets

All the Dataset realted to machine learning, Deep learning, NLP and Data Science will be uploaded Here.

Size: 273 KB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

AtheerAlzhrani/nlp_projects

NLP projects, which I worked on utilising different natural language processing libraries's.

Language: Jupyter Notebook - Size: 104 KB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

ElizaLo/Question-Answering-based-on-SQuAD Fork of gauthierdmn/question_answering

Question Answering System using BiDAF Model on SQuAD v2.0

Language: Python - Size: 7.27 MB - Last synced at: about 2 months ago - Pushed at: almost 5 years ago - Stars: 25 - Forks: 27

anirudhsom/CAPP-Dataset

Official repository for "Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning".

Language: Jupyter Notebook - Size: 7.78 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

maryamesh/Emotion-Detection-and-Sentiment-Analysis-using-NLP

This project applies BERT for emotion detection and sentiment analysis, utilizing a dataset of annotated documents to classify various emotions from text. The main.ipynb file contains the complete code, outputs, and results.

Language: Jupyter Notebook - Size: 458 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

griff4692/clin-sum

Analysis of Hospital-Course Summaries

Language: Python - Size: 336 KB - Last synced at: 4 months ago - Pushed at: about 4 years ago - Stars: 4 - Forks: 1

AYUSHSURYAVANSHI/Fake-and-Real-News-NLP-Project

In today's digital age, misinformation spreads rapidly, significantly impacting public perception and decision-making. This project employs word embeddings using spaCy to effectively distinguish between fake and real news, enhancing the accuracy of information verification and contributing to the fight against misinformation. 📰🔍💡

Language: Jupyter Notebook - Size: 11.1 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

gpt-tester/ChatGPT-test-dataset-01

a small test dataset for use with OpenAI's ChatGPT

Size: 47.9 KB - Last synced at: 11 months ago - Pushed at: over 2 years ago - Stars: 34 - Forks: 11

AYUSHSURYAVANSHI/SMS-Spam-Collection-NLP-Project

The SMS Spam Collection v.1 📱 is a curated dataset consisting of 5,574 SMS messages in English, meticulously categorized as either legitimate (ham) or spam. This corpus serves as a valuable resource for research in SMS spam detection and filtering.🔍💬

Language: Python - Size: 209 KB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

ogpetrov/sakha-nlp

Various tools and data for Sakha language NLP.

Language: HTML - Size: 17.6 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

D3struf/NLP-Chatbot

A Virtual Assistant integrated in TUP Website using NLP and Naive Bayes Algorithm

Language: Jupyter Notebook - Size: 6.07 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 1

trisongz/pylines

Simplifying parsing of large jsonline files in NLP Workflows

Language: Python - Size: 244 KB - Last synced at: 8 days ago - Pushed at: over 3 years ago - Stars: 12 - Forks: 1

Karan-Malik/WordEmbeddings

Creating Word Embeddings using Keras

Language: Jupyter Notebook - Size: 24.5 MB - Last synced at: 4 months ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 0

afrisenti-semeval/afrisent-semeval-2023

AfriSenti-SemEval Shared Task 12: Sentiment Analysis for African languages : https://afrisenti-semeval.github.io/

Language: Jupyter Notebook - Size: 33 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 38 - Forks: 38

saakolch/procedure_of_extracting_data

Data preprocessing and training on Drug Review Dataset using Hugging Face library

Language: Jupyter Notebook - Size: 38.8 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

fzehracetin/turkish-question-answering

We extracted 5,000 question-answer pairs from Turkish Wikipedia and fine-tuned Turkish BERT, ALBERT, ELECTRA for the question-answering task.

Language: Jupyter Notebook - Size: 1.52 MB - Last synced at: 12 months ago - Pushed at: almost 4 years ago - Stars: 6 - Forks: 1

gkiril/benchie

Comprehensive evaluation framework for Open Information Extraction.

Language: Python - Size: 340 KB - Last synced at: 12 months ago - Pushed at: about 3 years ago - Stars: 38 - Forks: 8

SamDineshSD777/Sentiment-Analysis-on-Product-Reviews

Sentiment Analysis on Product Reviews ( Project Associated with Zummit Infolabs ).

Size: 2.4 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 1

ArmanBehnam/NLP

Natural language processing including Datasets,Farsi NLP, Automated Essay Scoring, Automatic Speech Recognition and etc.

Language: Jupyter Notebook - Size: 512 KB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 3 - Forks: 0

tagtog/BBC-News-Dataset

🍃BBC-News-Dataset in anndoc (tagtog) format

Language: HTML - Size: 2.81 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

josherich/nlp-dataset-explorer

NLP datasets explorer

Language: Vue - Size: 119 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

hicte/moin

A dataset of Moin Persian 🇮🇷 dictionary 📖 words.

Size: 265 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

mehrdad-dev/Battle-of-the-Wordsmiths

Official github repository: Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard (dataset)

Language: Python - Size: 614 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1

matt-seb-ho/WikiWhy

WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000+ "why" question-answer-rationale triplets.

Language: Python - Size: 28.2 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 41 - Forks: 1

JasonShao55/Chinese_Metaphor_Explanation

An annotated Chinese metaphor dataset

Language: Python - Size: 71.8 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 0

uma-pi1/OPIEC-pipeline

Language: Java - Size: 59.3 MB - Last synced at: 4 days ago - Pushed at: over 3 years ago - Stars: 14 - Forks: 2

BrunoGianetti/MyNLPProjects

My project storage in NLP

Language: Jupyter Notebook - Size: 30.9 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

hellohaptik/multi-task-NLP

multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.

Language: Python - Size: 7.46 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 358 - Forks: 54

vaibhav0000patel/Topical-Sentiment-Analysis

ML model that recognizes how much the text is related to data of a particular topic which the model is trained with. Modular structure of the code makes it easier to understand and modify it. Here, the model classify the text if it is crime related or not..

Language: Python - Size: 483 KB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

INK-USC/CommonGen

A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning

Language: Python - Size: 107 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 136 - Forks: 23

guhhhhaa/4675-scifi

chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料

Size: 113 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 277 - Forks: 50

quincyliang/nlp-public-dataset

Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集，中英文机器翻译数据集, 中文分词数据集

Language: Python - Size: 12.9 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 320 - Forks: 75

dkulagin/kartaslov

Открытые лингвистические датасеты: тональный словарь русского языка КартаСловСент, датасет по семантике, ассоциативный граф и датасет по орфографическим ошибкам и опечаткам.

Size: 20.1 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 346 - Forks: 50

gcunhase/AMICorpusXML

Extracts Transcript and Summary (Abstractive and Extractive) from the AMI Meeting Corpus

Language: Python - Size: 9.48 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 52 - Forks: 29

mtala3t/Identify-the-Sentiments-AV-NLP-Contest

This project is submitted as python implementation in the contest of Analytics Vidhya called "Identify the Sentiments". I enjoyed the joining of this competition and all its process. This submited solution got the rank 118 in the public leaderboard.

Language: Python - Size: 7.61 MB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 7 - Forks: 2

INK-USC/XCSR

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Language: Python - Size: 60.7 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 20 - Forks: 2

jrgpulido/pd18is5d

Language: Roff - Size: 11.1 MB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 2 - Forks: 13

codemugger/BERT-Model-for-Fake-News-Classification-

Obtaining news online has become the new normal for many Singaporeans in the information age. The ease of discovering and sharing news with different news sources battling to control the narrative has led to a drastic increase in misinformation. Therefore we need to differentiate between real and fake news, and this project aims to design an application to prevent the spreading of fake news by employing the pre-trained BERT model to detect fake news.

Language: Jupyter Notebook - Size: 82.1 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

aman-17/BERT-Semantic-Similarity-Flask-App

Flask app for Semantic Similarity of sentences using BERT model.

Language: CSS - Size: 6.06 MB - Last synced at: 19 days ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

cedspam/text_dataset_streaming

Language: Python - Size: 72.3 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

dellison/NLIDatasets.jl

Julia interface to datasets for natural language inference

Language: Julia - Size: 48.8 KB - Last synced at: 18 days ago - Pushed at: over 5 years ago - Stars: 4 - Forks: 0

claudiu1989/Synonyms-detection

Experiments with word2vec embeddings for synonyms detection, for the Romanian language.

Language: Python - Size: 137 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

INK-USC/TriggerNER

TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)

Language: Python - Size: 2.22 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 170 - Forks: 19

utahnlp/infotabs-code

Implementation of the semi-structured inference model in our ACL 2020 paper, INFOTABS: Inference on Tables as Semi-structured Data.

Language: Python - Size: 127 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 17 - Forks: 7

NakulLakhotia/Text-Classification-of-SMS-as-Spam-or-Non-Spam

Classifying a SMS as spam or non-spam using Natural Language Processing (NLP) and Machine Learning

Language: Python - Size: 295 KB - Last synced at: almost 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

Delta-Sigma/urdu-stopwords

A list containing Urdu stopwords.

Size: 25.4 KB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 4 - Forks: 11

gcunhase/ArXivAbsTitleDataset

Extract Abstract and Title Dataset from arXiv articles

Language: Python - Size: 14 MB - Last synced at: almost 2 years ago - Pushed at: over 5 years ago - Stars: 5 - Forks: 0

Text-Mining/Ferdowsi-Annotated-Academic-Linguistic-Corpus

دو پیکره زبانی مربوط به مجموعه مقالات دانشگاه فردوسی مشهد

Size: 57.6 MB - Last synced at: almost 2 years ago - Pushed at: about 4 years ago - Stars: 1 - Forks: 1

DravidianNLP/Datasets

This repository hosts all the datasets published in Dravidian Languages.

Size: 11.7 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

cjiang2/VDCNN

Implementation of Very Deep Convolutional Neural Network for Text Classification

Language: Python - Size: 42 KB - Last synced at: over 1 year ago - Pushed at: almost 3 years ago - Stars: 171 - Forks: 40

turkish-nlp-suite/Vitamins-Supplements-Reviews

Repo for Turkish sentiment analysis dataset, "Vitamins and Supplements Customer Reviews"

Size: 7.48 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

turkish-nlp-suite/Vitamins-Supplements-NER-dataset

Repo for Turkish Vitamins and Supplements NER dataset.

Size: 558 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 1

Abdullahw72/E-Commerce-Chatbot

Chatbot for E-Commerce Related Questions

Language: Jupyter Notebook - Size: 1.24 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 1

xtea/chinese_medical_words

手工整理医疗行业词汇、术语等语料。可用于语音识别、对话系统等各类nlp模型训练。

Size: 1.33 MB - Last synced at: almost 2 years ago - Pushed at: about 5 years ago - Stars: 85 - Forks: 31

U-11-Agar/timeseries-analysis

time series data analysis on real time data and csv files

Language: Jupyter Notebook - Size: 73 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

sammitjain/loksabha-questions

Questions asked in the Lok Sabha - collection and analysis of trends. Creating the dataset from scratch.

Language: Jupyter Notebook - Size: 80.7 MB - Last synced at: 2 months ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 1

Yu-billie/NLP-Project-CUAI-1H23

NLP Projects in CUAI 1H23

Language: Jupyter Notebook - Size: 3.09 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

apple-fritter/ploop.rs

➿Loop through a TSV file and pass columns of data to an external program. Written in Rust.

Language: Rust - Size: 55.7 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

HuynhXuanLam-IT44/BERT-Covid-Sentiment-Classification

Applying and Understanding an Advanced, Novel Deep Learning Approach

Language: Jupyter Notebook - Size: 2.55 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

MiSaengg/gunhee-RnD-space

R&D for datasets for book genres

Language: Jupyter Notebook - Size: 17.1 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

poethan/AlphaMWE

AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations

Size: 265 KB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 3 - Forks: 2

jamesohortle/loanwords_gairaigo

English loanwords in Japanese

Language: Python - Size: 17 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 14 - Forks: 1

guhhhhaa/wula-scifi

chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料

Size: 199 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 49 - Forks: 9

turkish-nlp-suite/BeyazPerde-Movie-Reviews

Repo for Turkish movie reviews dataset.

Size: 10.5 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

secsilm/zi-dataset

汉字数据集，包括汉字的相关信息，例如笔画数、部首、拼音、英文释义/同义词等。

Size: 1.57 MB - Last synced at: over 2 years ago - Pushed at: almost 5 years ago - Stars: 50 - Forks: 8

Bohdan-Khomtchouk/NERO-nlp

NERO-nlp is a PyPI package for biomedical Named Entity (Recognition) Ontology

Language: Python - Size: 29.9 MB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 4 - Forks: 1

anudeepvanjavakam1/disaster_response_NLP

This Project is part of Data Science Nanodegree Program by Udacity in collaboration with Figure Eight. The dataset contains pre-labelled tweet and messages from real-life disaster events. The project aim is to build a Natural Language Processing (NLP) model to categorize messages on a real time basis.

Language: Jupyter Notebook - Size: 7.67 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

kelvin-jiang/FreebaseQA

The release of the FreebaseQA data set (NAACL 2019).

Size: 7.8 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 59 - Forks: 1

PranavNV/Nationality-Prejudice-in-Text-Generation

This project focuses on the analysis of text generation models such as GPT-2 to identify and understand populistic behaviors or biases against various nationality.

Size: 20.1 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

maxent-ai/Datasets 📦

datasets with text data for use in NLP, Text analysis, information extraction, ML research.

Language: Jupyter Notebook - Size: 45.7 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 15 - Forks: 3

INK-USC/RiddleSense

RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge

Language: Python - Size: 16.3 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 7 - Forks: 1

mzhukovaucsb/emoji_gestures

Research project “Gesture Emoji Twitter Corpus”. Project description, data collection pipeline (tweepy), data preprocessing functions (regex, nltk), 2 datasets for Russian and English published in open access.

Language: Jupyter Notebook - Size: 125 MB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 2 - Forks: 0

Related Keywords

nlp-datasets 142 nlp 82 nlp-machine-learning 37 natural-language-processing 26 dataset 24 nlp-resources 20 python 17 machine-learning 15 python3 12 datasets 12 deep-learning 11 nlp-library 9 corpus 9 data-science 8 sentiment-analysis 8 pytorch 7 bert 7 natural-language-understanding 7 sentiment-classification 7 corpus-data 7 text-classification 6 question-answering 6 information-extraction 5 corpus-linguistics 5 bert-model 5 llm 5 wikipedia 5 data 4 nlp-keywords-extraction 4 chinese-nlp 4 text-processing 4 sp-es 4 transformer 4 nli 4 nltk 4 machinelearning 3 turkish-nlp 3 roberta 3 news 3 flask 3 ai 3 natural-language-generation 3 keras 3 nltk-python 3 acl2020 3 turkce-veriseti 3 language 3 language-model 3 large-language-models 3 acl 3 text-mining 3 tables 3 acl-2024 3 corpus-tools 3 corpus-processing 3 deep-neural-networks 3 nlg-dataset 3 chatgpt 3 open-information-extraction 3 chatbot 3 pypi 2 bert-embeddings 2 gpt 2 linguistics 2 java 2 naive-bayes-classifier 2 semi-structured-data 2 rust 2 openai 2 sentiment-analysis-dataset 2 dataset-generation 2 medical-nlp 2 chatgpt-api 2 artificial-intelligence 2 nlp-parsing 2 gpt-3 2 linguistic-analysis 2 infotabs 2 challenge 2 inference 2 lstm-neural-networks 2 india 2 text-generation 2 twitter 2 database 2 sentiment 2 tensorflow 2 preprocessing 2 keras-tensorflow 2 jupyter-notebook 2 natural-language-inference 2 dravidian-languages 2 stopwords 2 portuguese-language 2 turkce-nlp 2 turkish-nlp-dataset 2 llms 2 intent-classification 2 language-learning 2 named-entity-recognition 2