An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: extract-text

lifegpc/msg-tool

Tools for export and import scripts

Language: Rust - Size: 130 KB - Last synced at: about 14 hours ago - Pushed at: about 15 hours ago - Stars: 0 - Forks: 0

loganlinn/copy-text-of-selected-area-shortcut

Apple Shortcut to copy text of selected area (screenshot) to clipboard

Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

KINGPIN707/PDF-Highlight-Extractor

A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.

Language: Python - Size: 81.1 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

pd3f/pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Language: HTML - Size: 930 KB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 320 - Forks: 38

saidsef/tika-document-to-text

Apache Tika to extract text and metadata from various file types, deployable via Docker, Kubernetes or ArgoCD

Language: JavaScript - Size: 611 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 5 - Forks: 3

dbashford/textract

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

Language: HTML - Size: 5.09 MB - Last synced at: about 2 hours ago - Pushed at: over 2 years ago - Stars: 1,669 - Forks: 194

BitMiracle/Docotic.Pdf.Samples

C# and VB.NET samples for Docotic.Pdf library

Language: Visual Basic .NET - Size: 53.6 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 78 - Forks: 39

shelfio/tika-text-extract

Extract text from a document by Apache Tika

Language: TypeScript - Size: 354 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 17 - Forks: 6

datalogics/pdf-rest-api-samples

pdfRest API Toolkit is a REST API service for processing PDF documents, made by developers, for developers. Rapidly integrate PDF workflows with your existing projects and applications, simply and seamlessly. Get started for free in seconds.

Language: Java - Size: 13.7 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 26 - Forks: 10

devmehq/extract-text

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

Language: HTML - Size: 6.53 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 20 - Forks: 1

rlayers/pawpaw

Text Processing & Segmentation Framework

Language: Python - Size: 2.52 MB - Last synced at: 18 days ago - Pushed at: 3 months ago - Stars: 22 - Forks: 3

BaseMax/ExtractWord

Extract word(s) from the lines of the file.

Language: PHP - Size: 23.4 KB - Last synced at: 3 days ago - Pushed at: about 6 years ago - Stars: 4 - Forks: 1

Zoltanar/Happy-Reader

VNDB explorer and VNR-like text hooker.

Language: C# - Size: 64.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 24 - Forks: 1

sgerwk/pdftoroff

view pdf on X11 and the Linux framebuffer; resize pdf; convert pdf to text, html, TeX, groff

Language: C - Size: 1.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 20 - Forks: 1

ApryseSDK/pdftron-document-search

Build search across multiple documents client-side in your file storage

Language: JavaScript - Size: 73.8 MB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 46 - Forks: 12

KevM/tikaondotnet

Use the Java Tika text extraction library on the .NET platform

Language: Rich Text Format - Size: 155 MB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 206 - Forks: 76

opensemanticsearch/open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

Language: Python - Size: 615 KB - Last synced at: 24 days ago - Pushed at: over 2 years ago - Stars: 268 - Forks: 72

lu4p/cat

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

Language: Go - Size: 215 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 98 - Forks: 16

ropensci/rtika

R Interface to Apache Tika

Language: R - Size: 133 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 54 - Forks: 8

zetahernandez/pdf-to-text

Read pdf files on javascript

Language: JavaScript - Size: 36.1 KB - Last synced at: 19 days ago - Pushed at: over 5 years ago - Stars: 79 - Forks: 32

ropensci-archive/fulltext 📦

:warning: ARCHIVED :warning: Search across and get full text for OA & closed journals

Language: R - Size: 6.1 MB - Last synced at: 23 days ago - Pushed at: almost 3 years ago - Stars: 271 - Forks: 47

bhattbhavesh91/google-vision-api-for-ocr-demo

Repo which contains a small demo to Extract Text from image OCR using Google Vision API in Python

Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: about 2 months ago - Pushed at: almost 4 years ago - Stars: 25 - Forks: 54

ahmedkhemiri95/PDFs-TextExtract

Multiple and Large PDF Documents Text Extraction.

Language: Python - Size: 11.3 MB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 128 - Forks: 65

islom-pardaboyev/image_to_text_converter

A React-based web app that extracts text from images using Tesseract.js. Upload an image, and the app will process it automatically. Supports manual text extraction as well. 🚀

Language: TypeScript - Size: 36.1 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

djsudduth/joplin-plugin-paragraph-extractor

Extract specific paragraphs out of Joplin notes using keywords, hashtags or custom tags similar to Logseq block references. Also, refresh extracted notes if source notes change.

Language: TypeScript - Size: 443 KB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

maxim2266/OCR

A collection of tools for OCR (optical character recognition).

Language: C - Size: 72.3 KB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 30 - Forks: 0

Zeeshanahmad4/NLP-Pdf-Minning-Extracting-text-from-pdf

NLP Pdf Minning Extracting text from pdf

Language: Python - Size: 2.86 MB - Last synced at: 2 months ago - Pushed at: about 5 years ago - Stars: 3 - Forks: 1

OpenJarbas/simple_NER

simple rule based named entity recognition

Language: Python - Size: 2.1 MB - Last synced at: 16 days ago - Pushed at: over 3 years ago - Stars: 43 - Forks: 9

ropensci/unrtf

Wrapper for 'unrtf' utility to extract text from RTF documents

Language: C - Size: 125 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 15 - Forks: 0

nojimage/twitter-text-php Fork of ngnpope/twitter-text-php

Twitter text processing library (auto linking and extraction of usernames, lists and hashtags). Based on the Ruby and Java implementations by Matt Sanford

Language: PHP - Size: 518 KB - Last synced at: 8 months ago - Pushed at: almost 2 years ago - Stars: 117 - Forks: 21

ropensci/antiword

R wrapper for antiword utility

Language: C - Size: 299 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 55 - Forks: 4

hstanleycrow/EasyPHPArticleExtractor

Free PHP library to extract the main content from an article post or news post, including images and HTML

Language: PHP - Size: 39.1 KB - Last synced at: 1 day ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 2

andreysaf/JS-PDFTextDiff

Extract text from two PDFs and compare any of the differences

Language: CSS - Size: 123 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 1

majd-kontar/pdf-highlight-extractor

Language: Python - Size: 30.3 KB - Last synced at: 3 months ago - Pushed at: about 3 years ago - Stars: 3 - Forks: 0

greed2411/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

Language: Clojure - Size: 19.5 KB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 18 - Forks: 0

xChivalrouSx/PdfToText

An application to extract text from pdf files

Language: C# - Size: 89.8 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

TwistAtom/ZWSP-Tool

ZWSP-Tool is a powerful toolkit that allows to manipulate zero width spaces quickly and easily. ZWSP-Tool allows in particular to detect, clean, hide, extract and bruteforce a text containing zero width spaces.

Language: Python - Size: 147 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 11 - Forks: 0

AllanCameron/PDFR

An R package to extract text from pdf.

Language: C++ - Size: 15.4 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 33 - Forks: 1

Nanamare/ocr-android

Sample ocr using opencv (just toy project) since 2017..

Language: Java - Size: 150 MB - Last synced at: 4 months ago - Pushed at: over 6 years ago - Stars: 3 - Forks: 1

jalal246/corename

Automatically extracts packages root name for monorepos

Language: JavaScript - Size: 111 KB - Last synced at: 12 days ago - Pushed at: almost 3 years ago - Stars: 2 - Forks: 0

emmeryn/hocr-turtletext

A gem that parses positional text from hOCR output and provides convenience methods to find text.

Language: Ruby - Size: 17.6 KB - Last synced at: 24 days ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

vestiment-sh/zebrapad-scripts

Scripts engineered for R&D to extract text from audio, video, and websites necessary to improve their 'Unfold' app algorithm

Language: Python - Size: 7.81 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

thatcherclough/StegEmbed

A stenography program that can embed and extract text into and out of the pixels of an image.

Language: Java - Size: 1.77 MB - Last synced at: over 2 years ago - Pushed at: almost 5 years ago - Stars: 5 - Forks: 2

Boadzie/textman

Octical Character Recognition app that extracts Text from images built with FastAPI, Tailwindcss and Pytesseract

Language: HTML - Size: 2.28 MB - Last synced at: over 2 years ago - Pushed at: over 4 years ago - Stars: 7 - Forks: 1

docongminh/scan-document

Extract text from image using Pytesseract

Language: Python - Size: 106 MB - Last synced at: 25 days ago - Pushed at: about 5 years ago - Stars: 2 - Forks: 0

groupdocs-free-consulting/Extract-Text-from-Microsoft-PowerPoint-Presentation-using-JAVA-SDK-of-GroupDocs.Parser-REST-API

This is GroupDocs free consulting project that helps you to extract Text from Microsoft PowerPoint Presentation PPTX/PPT using GroupDocs.Parser Cloud SDK for JAVA. https://www.groupdocs.cloud

Language: Java - Size: 260 KB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

dulekang1025/My-Web-Crawler

My web crawling tool using Scrapy.

Language: Python - Size: 4.88 KB - Last synced at: about 2 years ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

BaseMax/SmartFilter

A Smart Filtering to keep and remove the character or words of the text. (SOON)

Language: PHP - Size: 102 KB - Last synced at: 3 days ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 2

golovers/xtract

extract text from html

Language: Go - Size: 3.91 KB - Last synced at: 12 months ago - Pushed at: almost 7 years ago - Stars: 1 - Forks: 0