An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: text-cleaning

adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Language: Python - Size: 33.8 MB - Last synced at: about 21 hours ago - Pushed at: about 1 month ago - Stars: 4,151 - Forks: 289

HandcartCactus/obsidian-remove-newlines

A plugin for Obsidian.md which removes newlines and blank lines from selected or pasted text.

Language: HTML - Size: 7.59 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 9 - Forks: 0

jfilter/clean-text

🧹 Python package for text cleaning

Language: Python - Size: 157 KB - Last synced at: 7 days ago - Pushed at: almost 2 years ago - Stars: 975 - Forks: 79

wisupai/e2m

E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.

Language: Jupyter Notebook - Size: 21.7 MB - Last synced at: 8 days ago - Pushed at: 8 months ago - Stars: 1,065 - Forks: 53

blmoistawinde/HarvestText

文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法

Language: Python - Size: 4.27 MB - Last synced at: 10 days ago - Pushed at: 11 months ago - Stars: 2,504 - Forks: 336

hscspring/pnlp

NLP预/后处理工具。

Language: Python - Size: 106 KB - Last synced at: 13 days ago - Pushed at: 22 days ago - Stars: 29 - Forks: 6

reZach/grammarify

Grammarify is a npm package that safely cleans up text that has mispellings, improper capitalization, lexical illusions, among other things.

Language: JavaScript - Size: 412 KB - Last synced at: 5 days ago - Pushed at: over 2 years ago - Stars: 73 - Forks: 10

rhnfzl/SqueakyCleanText

Clean your Text for Statistical ML and Language Model

Language: Python - Size: 1.04 MB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 4 - Forks: 0

johnjago/deformat

Remove extra whitespace from text.

Language: HTML - Size: 163 KB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 6 - Forks: 1

sharejing/Takin

A Python toolkit for file processing, text cleaning and data splitting. 文件处理,文本清洗和数据划分的python工具包。

Language: Python - Size: 2.42 MB - Last synced at: 11 days ago - Pushed at: over 2 years ago - Stars: 32 - Forks: 7

trinker/textclean

Tools for cleaning and normalizing text data

Language: R - Size: 23.8 MB - Last synced at: 17 days ago - Pushed at: over 3 years ago - Stars: 248 - Forks: 26

Youssef155/Sentiment_Analysis

Sentiment Analysis For Restaurant Reviews

Language: Jupyter Notebook - Size: 83 KB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

AndyTheFactory/article-extraction-dataset

Article title, authors, date and body extraction dataset.

Language: HTML - Size: 31.9 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1

mim-solutions/mim_nlp

A Python package with ready-to-use models for various NLP tasks and text preprocessing utilities. The implementation allows fine-tuning.

Language: Jupyter Notebook - Size: 413 KB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 2 - Forks: 0

dewanakl/aman

Filter sederhana untuk kata kotor menggunakan regex.

Language: PHP - Size: 71.3 KB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 0 - Forks: 2

umapornp/textprepro

👀 Everything Everyway All At Once Text Preprocessing for Natural Language Processing.

Language: Python - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

mrqadeer/text_prettifier

Python library designed to clean and preprocess text data by removing unwanted elements such as HTML tags, URLs, numbers, special characters, emojis, contractions, and stopwords. It offers flexible functionality, including options to return text in lowercase and as a list of tokens.

Language: Python - Size: 13.7 KB - Last synced at: 29 days ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

alinapetukhova/textcl

Text preprocessing package for use in NLP tasks https://pypi.org/project/textcl/

Language: Python - Size: 913 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 11 - Forks: 4

Shanmukhi1920/Text_Classification

Developed an NLP system using Gradio and Hugging Face to classify disaster tweets with both machine learning (ML) and deep learning (DL) models.

Language: Jupyter Notebook - Size: 8.23 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Ankur3107/nlp_preprocessing

Text Preprocessing Package includes cleaning, tokenization, dataset preparation ...etc

Language: JavaScript - Size: 5.19 MB - Last synced at: 8 months ago - Pushed at: over 4 years ago - Stars: 17 - Forks: 7

Infinitode/ValX

ValX is an open-source Python package for text cleaning tasks, including profanity detection and removal. Now also includes sensitive information detection, and removal.

Language: Python - Size: 36.1 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

bhattbhavesh91/clean-text-demo

Tutorial on Clean-Text which is a Python package for text cleaning

Language: Jupyter Notebook - Size: 19.5 KB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 2 - Forks: 1

currentslab/extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

Language: HTML - Size: 421 MB - Last synced at: 11 months ago - Pushed at: over 1 year ago - Stars: 181 - Forks: 19

Gopalkholade/Language-Detection

Language-Detection

Language: Jupyter Notebook - Size: 1.56 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

NaquibAlam/NLP-with-Disaster-Tweets-Kaggle

Contains the code for this competition, https://www.kaggle.com/c/nlp-getting-started/, hosted on Kaggle

Size: 600 KB - Last synced at: 12 months ago - Pushed at: about 4 years ago - Stars: 1 - Forks: 0

net-wizard/end2end-nlp

End 2 End NLP project with python

Language: Jupyter Notebook - Size: 3.18 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

dataiku/dss-plugin-nlp-preparation

Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼

Language: Python - Size: 17.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 23 - Forks: 6

Aalaa4444/Text_Processing-and-Unique_Word_Extraction_fromHTML

Extract text content from an HTML page, process it, and extract unique words from the processed text. This notebook utilizes various text processing techniques including cleaning, normalization, tokenization, lemmatization or stemming, and stop words removal.

Language: Jupyter Notebook - Size: 12.7 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

enginestein/CleanPhi

A natural language processing framework to clean sentences and texts.

Language: Python - Size: 139 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 1

lprtk/pyTCTK

Python Text Cleaning ToolKit library (pyTCTK)

Language: Python - Size: 21.5 KB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 1 - Forks: 0

doorooful/CapstoneProject

Capstone Design Project(Senior at Seoultech, ITM)

Language: JavaScript - Size: 76.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

lprtk/nlp-amazon-customer-reviews

Sentiment analysis, text mining, topic modeling & sentiment prediction

Language: Jupyter Notebook - Size: 6.5 MB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 2

SaurabhPoman96/Resume_Screening_with_NLP

The recommendation that recommends the right candidates to the recruiters to a job applicantion. The content is the personal information and their job desires. Implementation of a recommender system based using filtering techniques and Natural language processing to recommend top jobs based on similarity.

Language: Jupyter Notebook - Size: 1.29 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

jradha11/sentiment-analysis-nlp

Sentiment Analysis of Restaurant Reviews using NLP

Language: Jupyter Notebook - Size: 59.6 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 3 - Forks: 6

1994nikunj/nlp-toolkit-desktop-app

The code is a collection of NLP analyses, including text cleaning, most common words, n-grams generation, co-occurrence matrix generation, wordcloud generation, topic modeling (using Latent Dirichlet Allocation), and general text statistics.

Language: Python - Size: 251 KB - Last synced at: 5 months ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

Aayushpatel007/topicrankpy

A Python package to get useful information from documents using TopicRank Algorithm.

Language: Python - Size: 72.3 KB - Last synced at: about 17 hours ago - Pushed at: almost 2 years ago - Stars: 16 - Forks: 3

SayamAlt/News-Category-Classification

Successfully developed a news category classification model using fine-tuned BERT which can accurately classify any news text into its respective category i.e. Politics, Business, Technology and Entertainment.

Language: Jupyter Notebook - Size: 3.69 MB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

odeibarredo/Text-Mining-LOTR-movies-dialogue-

Analysis of the dialogue from the Lord of the Rings movie trilogy.

Language: R - Size: 5.76 MB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 1

sharmaroshan/Text-Classification

This is a Project Assignment where I have Learned to Classify the Different Texts Using Clustering Techniques. Natural Language Processing and Clustering both of these Concepts are Being Used. I have Used K-means Clustering Techniques to Implement the Problem.

Language: HTML - Size: 88.9 KB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 1

cwwdaniel/invoice-text-classification

Semantic Enrichment, Data Augmentation and Deep Learning for Boosting Invoice Text Classification Performance: A Novel Natural Language Processing Strategy

Size: 68.4 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

hrushikesh-dhumal/nlp

Boilerplate natural language processing

Language: Jupyter Notebook - Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 1

ketchley/tdm-workshop

Workshop materials for 'Fundamentals of Text and Data Mining'

Size: 33 MB - Last synced at: over 1 year ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 1

YashSDholam/Tripadvisor-Hotel-Review-Sentiment-Analysis-using-LSTM-Neural-Network

In this project, I utilized the TripAdvisor Hotel Review dataset from Kaggle to perform sentiment analysis on hotel reviews. The main objective was to build a predictive model using LSTM (Long Short-Term Memory) neural networks to classify hotel reviews as positive or negative based on their textual content.

Language: Jupyter Notebook - Size: 6.48 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

I-Am-Timothy-Williams/RNN-in-NLP

Repo with basic start on Recurrent Neural Networks, Word2Vec, Doc2Vec, TFIDF vectors and NLP basics

Language: Python - Size: 364 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ilos-vigil/scl-2020-product-detection

4th place (top 1%) solution for Shopee Code League 2020 - Product Detection

Language: Jupyter Notebook - Size: 13.3 MB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 7 - Forks: 7

ArbazAnalytics/Cleaning_Text_Data_NLP

Performed text cleaning steps in Natural Language Processing | Uploading One of my college Assignment

Language: Jupyter Notebook - Size: 1.95 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

showmik/TidyText

🖹 Offline Text Cleaner and Formatter

Language: C# - Size: 293 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

MD-Ryhan/NLP-Preprocesing

This repository contains code for preprocessing natural language data for use in NLP applications.

Language: Jupyter Notebook - Size: 10.7 KB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

Shawn91/DocTor

A tabular/list/plain text cleaner

Language: JavaScript - Size: 2.22 MB - Last synced at: 6 days ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

ecomp-shONgit/text-normalisation

JS / Python3 / PHP Lib to work with UTF8 polytonic greek and latin

Language: JavaScript - Size: 330 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 10 - Forks: 1

sagepublishing/text_cleaning

Corpora and scripts for cleaning political science texts. Scripts are translated into transformations that support SAGE Texti.

Language: Python - Size: 30.4 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 5 - Forks: 1

Rumeysakeskin/Preprocessing-Turkish-Text-Data

Preprocessing Turkish text data with cleaning (punctuations, special, accented and unicode characters) and normalizing (numbers, abbreviations)

Language: Jupyter Notebook - Size: 14.6 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

Abhayparashar31/crazytext

A Simple Easy To Use Text Cleaning Package For NLP Built In Python. It Can Clean and Analyze Your Text Data In One Line of Code.

Language: Python - Size: 48.8 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

amansrivastava17/text-preprocess-python

Text preprocessing tools in python.

Language: Python - Size: 39.1 KB - Last synced at: about 2 years ago - Pushed at: about 7 years ago - Stars: 24 - Forks: 7

ternaus/ternaus-cleantext

Cleans text as in the CLIP model

Language: Python - Size: 4.88 KB - Last synced at: 9 days ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 1

Bonniface/Text-CLeaning-And-Classification

Text classification is a widely used natural language processing task in different business problems. Given a statement or document, the task involves assigning to it an appropriate category from a pre-defined set of categories. The dataset of choice determines the set of categories. Text classification has applications in emotion classification, n

Language: Jupyter Notebook - Size: 8.34 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

fernandosola/textpp-ptbr

Common Text Pre-Processing for Portuguese

Language: Python - Size: 64.5 KB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 5 - Forks: 1

YongWookHa/kor-text-preprocess

Korean text data preprocess toolkit for NLP

Language: Python - Size: 39.1 KB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 16 - Forks: 2

seroetr/preprocess_seroetr

Preprocess Package for https://bit.ly/intro_nlp (Text cleaning and preprocessing example)

Language: Python - Size: 11.7 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

Jasani-Parth/Emotion-Detection-Form-Text

Language: Jupyter Notebook - Size: 83 KB - Last synced at: almost 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

Nikoletos-K/Offensive-Comment-Classifier

😈😇🗨️ Multiple ways, to classify comments as insults or neutral

Language: Jupyter Notebook - Size: 22.4 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 1 - Forks: 0

ilos-vigil/indonesian-document-clustering

Indonesian News and Article Clustering with K-Means++

Language: Jupyter Notebook - Size: 43.3 MB - Last synced at: almost 2 years ago - Pushed at: almost 5 years ago - Stars: 4 - Forks: 0

krisograbek/text-preprocessing

Text preprocessing in Python. Libs include string, re, nltk, spacy, gensim, textblob, unidecode, autocorrect, pyspellchecker

Language: Jupyter Notebook - Size: 81.1 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 1

ilos-vigil/scl-2020-sentiment-analysis

12th place (top 4%) solution for Shopee Code League 2020 - Sentiment Analysis

Language: Jupyter Notebook - Size: 117 KB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 3

garthmortensen/past_present_future

Past, Present, Future work.

Language: Jupyter Notebook - Size: 45.1 MB - Last synced at: almost 2 years ago - Pushed at: almost 4 years ago - Stars: 1 - Forks: 0

chris-bbrs/pdf-merging-and-scraping

PDF merging and scraping for nlp use

Language: Jupyter Notebook - Size: 8.79 KB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

NijatZeynalov/Cleaning-Text-NLTK

Cleaning Text Manually and with NLTK.

Language: Jupyter Notebook - Size: 36.1 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0

mengjie514/Twitter-Sentiment-Analysis-with-Python---Part-II

PhD project - part II

Language: Jupyter Notebook - Size: 20.5 KB - Last synced at: 5 months ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

Related Keywords
text-cleaning 68 nlp 30 text-preprocessing 14 python 13 natural-language-processing 13 text-processing 10 text-mining 9 machine-learning 7 text-classification 6 sentiment-analysis 5 text-normalization 5 text-extraction 5 preprocessing 4 deep-learning 4 web-scraping 4 nlp-machine-learning 3 scraping 3 data-visualization 3 text-cleaner 3 regex 3 text 3 nltk 3 lemmatization 3 nlp-library 3 named-entity-recognition 3 tf-idf 3 bert-embeddings 2 news 2 data-science 2 datasets 2 stemming 2 jupyter-notebook 2 text-analysis 2 profanity-filter 2 nltk-library 2 python3 2 tokenization 2 bag-of-words 2 language-detection 2 exploratory-data-analysis 2 data-preprocessing 2 tfidf 2 bert 2 feature-engineering 2 topic-modeling 2 sentiment-classification 2 text-tokenization 2 html-to-markdown 2 readability 2 news-crawler 2 news-aggregator 2 llm 2 article-extractor 2 html2text 2 corpus-tools 2 corpus-builder 2 user-generated-content 2 short-text 1 semantic-enrichment 1 multi-class-classification 1 linear-svm 1 glove-embeddings 1 bi-lstm 1 crawler 1 pandas 1 numpy 1 text-data-augmentation 1 wordnet-library 1 python-script 1 digitalhumanities 1 fundamentals 1 historical-data 1 ocr 1 tdm 1 textanalysis 1 lstm-neural-networks 1 prediction 1 rag 1 classification 1 data-analysis 1 n-grams 1 network-visualization 1 wordcloud-generator 1 email-parsing 1 graph-algorithms 1 hierarchical-clustering 1 keyphrase-extraction 1 keywords-extraction 1 network-x 1 pagerank-python 1 phone-parse 1 spacy 1 textrank 1 topicrank 1 fine-tuning-bert 1 model-evaluation 1 frequent-word 1 lordoftherings 1 lotr 1 wordcloud 1