GitHub topics: extract-text
lifegpc/msg-tool
Tools for export and import scripts
Language: Rust - Size: 130 KB - Last synced at: about 14 hours ago - Pushed at: about 15 hours ago - Stars: 0 - Forks: 0

loganlinn/copy-text-of-selected-area-shortcut
Apple Shortcut to copy text of selected area (screenshot) to clipboard
Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

KINGPIN707/PDF-Highlight-Extractor
A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.
Language: Python - Size: 81.1 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Language: HTML - Size: 930 KB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 320 - Forks: 38

saidsef/tika-document-to-text
Apache Tika to extract text and metadata from various file types, deployable via Docker, Kubernetes or ArgoCD
Language: JavaScript - Size: 611 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 5 - Forks: 3

dbashford/textract
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
Language: HTML - Size: 5.09 MB - Last synced at: about 2 hours ago - Pushed at: over 2 years ago - Stars: 1,669 - Forks: 194

BitMiracle/Docotic.Pdf.Samples
C# and VB.NET samples for Docotic.Pdf library
Language: Visual Basic .NET - Size: 53.6 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 78 - Forks: 39

shelfio/tika-text-extract
Extract text from a document by Apache Tika
Language: TypeScript - Size: 354 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 17 - Forks: 6

datalogics/pdf-rest-api-samples
pdfRest API Toolkit is a REST API service for processing PDF documents, made by developers, for developers. Rapidly integrate PDF workflows with your existing projects and applications, simply and seamlessly. Get started for free in seconds.
Language: Java - Size: 13.7 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 26 - Forks: 10

devmehq/extract-text
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
Language: HTML - Size: 6.53 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 20 - Forks: 1

rlayers/pawpaw
Text Processing & Segmentation Framework
Language: Python - Size: 2.52 MB - Last synced at: 18 days ago - Pushed at: 3 months ago - Stars: 22 - Forks: 3

BaseMax/ExtractWord
Extract word(s) from the lines of the file.
Language: PHP - Size: 23.4 KB - Last synced at: 3 days ago - Pushed at: about 6 years ago - Stars: 4 - Forks: 1

Zoltanar/Happy-Reader
VNDB explorer and VNR-like text hooker.
Language: C# - Size: 64.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 24 - Forks: 1

sgerwk/pdftoroff
view pdf on X11 and the Linux framebuffer; resize pdf; convert pdf to text, html, TeX, groff
Language: C - Size: 1.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 20 - Forks: 1

ApryseSDK/pdftron-document-search
Build search across multiple documents client-side in your file storage
Language: JavaScript - Size: 73.8 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 46 - Forks: 12

KevM/tikaondotnet
Use the Java Tika text extraction library on the .NET platform
Language: Rich Text Format - Size: 155 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 206 - Forks: 76

opensemanticsearch/open-semantic-etl
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Language: Python - Size: 615 KB - Last synced at: 24 days ago - Pushed at: over 2 years ago - Stars: 268 - Forks: 72

lu4p/cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Language: Go - Size: 215 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 98 - Forks: 16

ropensci/rtika
R Interface to Apache Tika
Language: R - Size: 133 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 54 - Forks: 8

zetahernandez/pdf-to-text
Read pdf files on javascript
Language: JavaScript - Size: 36.1 KB - Last synced at: 19 days ago - Pushed at: over 5 years ago - Stars: 79 - Forks: 32

ropensci-archive/fulltext 📦
:warning: ARCHIVED :warning: Search across and get full text for OA & closed journals
Language: R - Size: 6.1 MB - Last synced at: 23 days ago - Pushed at: almost 3 years ago - Stars: 271 - Forks: 47

bhattbhavesh91/google-vision-api-for-ocr-demo
Repo which contains a small demo to Extract Text from image OCR using Google Vision API in Python
Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 25 - Forks: 54

ahmedkhemiri95/PDFs-TextExtract
Multiple and Large PDF Documents Text Extraction.
Language: Python - Size: 11.3 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 128 - Forks: 65

islom-pardaboyev/image_to_text_converter
A React-based web app that extracts text from images using Tesseract.js. Upload an image, and the app will process it automatically. Supports manual text extraction as well. 🚀
Language: TypeScript - Size: 36.1 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

djsudduth/joplin-plugin-paragraph-extractor
Extract specific paragraphs out of Joplin notes using keywords, hashtags or custom tags similar to Logseq block references. Also, refresh extracted notes if source notes change.
Language: TypeScript - Size: 443 KB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

maxim2266/OCR
A collection of tools for OCR (optical character recognition).
Language: C - Size: 72.3 KB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 30 - Forks: 0

Zeeshanahmad4/NLP-Pdf-Minning-Extracting-text-from-pdf
NLP Pdf Minning Extracting text from pdf
Language: Python - Size: 2.86 MB - Last synced at: 2 months ago - Pushed at: about 5 years ago - Stars: 3 - Forks: 1

OpenJarbas/simple_NER
simple rule based named entity recognition
Language: Python - Size: 2.1 MB - Last synced at: 16 days ago - Pushed at: over 3 years ago - Stars: 43 - Forks: 9

ropensci/unrtf
Wrapper for 'unrtf' utility to extract text from RTF documents
Language: C - Size: 125 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 15 - Forks: 0

nojimage/twitter-text-php Fork of ngnpope/twitter-text-php
Twitter text processing library (auto linking and extraction of usernames, lists and hashtags). Based on the Ruby and Java implementations by Matt Sanford
Language: PHP - Size: 518 KB - Last synced at: 8 months ago - Pushed at: almost 2 years ago - Stars: 117 - Forks: 21

ropensci/antiword
R wrapper for antiword utility
Language: C - Size: 299 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 55 - Forks: 4

hstanleycrow/EasyPHPArticleExtractor
Free PHP library to extract the main content from an article post or news post, including images and HTML
Language: PHP - Size: 39.1 KB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 2

andreysaf/JS-PDFTextDiff
Extract text from two PDFs and compare any of the differences
Language: CSS - Size: 123 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 1

majd-kontar/pdf-highlight-extractor
Language: Python - Size: 30.3 KB - Last synced at: 3 months ago - Pushed at: about 3 years ago - Stars: 3 - Forks: 0

greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
Language: Clojure - Size: 19.5 KB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 18 - Forks: 0

xChivalrouSx/PdfToText
An application to extract text from pdf files
Language: C# - Size: 89.8 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

TwistAtom/ZWSP-Tool
ZWSP-Tool is a powerful toolkit that allows to manipulate zero width spaces quickly and easily. ZWSP-Tool allows in particular to detect, clean, hide, extract and bruteforce a text containing zero width spaces.
Language: Python - Size: 147 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 11 - Forks: 0

AllanCameron/PDFR
An R package to extract text from pdf.
Language: C++ - Size: 15.4 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 33 - Forks: 1

Nanamare/ocr-android
Sample ocr using opencv (just toy project) since 2017..
Language: Java - Size: 150 MB - Last synced at: 4 months ago - Pushed at: over 6 years ago - Stars: 3 - Forks: 1

jalal246/corename
Automatically extracts packages root name for monorepos
Language: JavaScript - Size: 111 KB - Last synced at: 12 days ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0

emmeryn/hocr-turtletext
A gem that parses positional text from hOCR output and provides convenience methods to find text.
Language: Ruby - Size: 17.6 KB - Last synced at: 24 days ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

vestiment-sh/zebrapad-scripts
Scripts engineered for R&D to extract text from audio, video, and websites necessary to improve their 'Unfold' app algorithm
Language: Python - Size: 7.81 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

thatcherclough/StegEmbed
A stenography program that can embed and extract text into and out of the pixels of an image.
Language: Java - Size: 1.77 MB - Last synced at: over 2 years ago - Pushed at: almost 5 years ago - Stars: 5 - Forks: 2

Boadzie/textman
Octical Character Recognition app that extracts Text from images built with FastAPI, Tailwindcss and Pytesseract
Language: HTML - Size: 2.28 MB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 7 - Forks: 1

docongminh/scan-document
Extract text from image using Pytesseract
Language: Python - Size: 106 MB - Last synced at: 25 days ago - Pushed at: about 5 years ago - Stars: 2 - Forks: 0

groupdocs-free-consulting/Extract-Text-from-Microsoft-PowerPoint-Presentation-using-JAVA-SDK-of-GroupDocs.Parser-REST-API
This is GroupDocs free consulting project that helps you to extract Text from Microsoft PowerPoint Presentation PPTX/PPT using GroupDocs.Parser Cloud SDK for JAVA. https://www.groupdocs.cloud
Language: Java - Size: 260 KB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

dulekang1025/My-Web-Crawler
My web crawling tool using Scrapy.
Language: Python - Size: 4.88 KB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

BaseMax/SmartFilter
A Smart Filtering to keep and remove the character or words of the text. (SOON)
Language: PHP - Size: 102 KB - Last synced at: 3 days ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 2

golovers/xtract
extract text from html
Language: Go - Size: 3.91 KB - Last synced at: 12 months ago - Pushed at: almost 7 years ago - Stars: 1 - Forks: 0
