Topic: "text-extraction"
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Language: Python - Size: 33.8 MB - Last synced at: about 2 hours ago - Pushed at: about 1 month ago - Stars: 4,151 - Forks: 289

miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
Language: Python - Size: 1.57 MB - Last synced at: 6 months ago - Pushed at: 11 months ago - Stars: 3,518 - Forks: 530

unidoc/unipdf
Golang PDF library for creating and processing PDF files (pure go)
Language: Go - Size: 123 MB - Last synced at: 12 days ago - Pushed at: 28 days ago - Stars: 2,746 - Forks: 264

Goldziher/kreuzberg
A text extraction library supporting PDFs, images, office documents and more
Language: Python - Size: 12.3 MB - Last synced at: 6 days ago - Pushed at: 10 days ago - Stars: 1,761 - Forks: 61

chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Language: Python - Size: 31.5 MB - Last synced at: 11 days ago - Pushed at: 14 days ago - Stars: 1,572 - Forks: 242

whitelok/image-text-localization-recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Size: 333 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 949 - Forks: 233

miso-belica/jusText
Heuristic based boilerplate removal tool
Language: Python - Size: 1.02 MB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 765 - Forks: 83

unidoc/unidoc
This repository has moved! https://github.com/unidoc/unipdf
Language: Go - Size: 29.3 MB - Last synced at: 11 months ago - Pushed at: almost 6 years ago - Stars: 705 - Forks: 87

ICIJ/datashare
A self-hosted search engine for documents.
Language: Java - Size: 395 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 626 - Forks: 57

ropensci/pdftools
Text Extraction, Rendering and Converting of PDF Documents
Language: C++ - Size: 1.05 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 529 - Forks: 71

cdown/srt
A simple library and set of tools for parsing, modifying, and composing SRT files.
Language: Python - Size: 406 KB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 499 - Forks: 45

shixzie/nlp 📦
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
Language: Go - Size: 50.8 KB - Last synced at: 9 months ago - Pushed at: over 7 years ago - Stars: 389 - Forks: 34

flairNLP/fundus
A very simple news crawler with a funny name
Language: Python - Size: 21.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 368 - Forks: 88

iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
Language: Python - Size: 374 KB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 340 - Forks: 45

pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Language: HTML - Size: 930 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 314 - Forks: 39

py-pdf/benchmarks
Benchmarking PDF libraries
Language: Python - Size: 3.73 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 269 - Forks: 15

bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Language: HTML - Size: 604 KB - Last synced at: 12 days ago - Pushed at: 12 months ago - Stars: 204 - Forks: 25

weareprestatech/hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Language: Python - Size: 16.5 MB - Last synced at: 7 days ago - Pushed at: 4 months ago - Stars: 186 - Forks: 9

SapienzaNLP/extend
Entity Disambiguation as text extraction (ACL 2022)
Language: Python - Size: 71.3 KB - Last synced at: 4 months ago - Pushed at: about 3 years ago - Stars: 177 - Forks: 13

skylander86/lambda-text-extractor
AWS Lambda functions to extract text from various binary formats.
Language: Python - Size: 111 MB - Last synced at: 5 months ago - Pushed at: about 7 years ago - Stars: 173 - Forks: 42

vsymbol/CUTIE
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Language: Python - Size: 2.87 MB - Last synced at: 18 days ago - Pushed at: over 2 years ago - Stars: 157 - Forks: 77

archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Language: Scala - Size: 39.5 MB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 143 - Forks: 32

sambitdash/PDFIO.jl
PDF Reader Library for Native Julia.
Language: Julia - Size: 24.5 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 131 - Forks: 14

vaites/php-apache-tika
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Language: PHP - Size: 13.8 MB - Last synced at: 9 days ago - Pushed at: about 1 month ago - Stars: 116 - Forks: 23

victorqribeiro/ocr
Simple app to extract text from pictures using Tesseract
Language: HTML - Size: 256 KB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 105 - Forks: 9

lu4p/cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Language: Go - Size: 215 KB - Last synced at: 12 days ago - Pushed at: over 1 year ago - Stars: 98 - Forks: 16

jmriebold/BoilerPy3 Fork of mercuree/BoilerPy
Python port of Boilerpipe library
Language: Python - Size: 188 KB - Last synced at: 2 months ago - Pushed at: 8 months ago - Stars: 86 - Forks: 18

docwire/docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
Language: C++ - Size: 35.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 81 - Forks: 18

gamemaker1/office-text-extractor
Yet another library to extract text from MS Office and PDF files
Language: TypeScript - Size: 2.15 MB - Last synced at: 11 days ago - Pushed at: 9 months ago - Stars: 73 - Forks: 7

iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Language: Python - Size: 69.3 KB - Last synced at: 11 days ago - Pushed at: 5 months ago - Stars: 69 - Forks: 7

nainiayoub/pdf-text-data-extractor
PDF text data extraction web app with OCR for scanned documents
Language: Python - Size: 24.4 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 69 - Forks: 42

JonathanRaiman/wikipedia_ner
:book: Labeled examples from wiki dumps in Python
Language: Jupyter Notebook - Size: 108 KB - Last synced at: 15 days ago - Pushed at: over 8 years ago - Stars: 67 - Forks: 7

abhinaba-ghosh/any-text
Get text content from any file
Language: JavaScript - Size: 226 KB - Last synced at: 8 days ago - Pushed at: 8 months ago - Stars: 65 - Forks: 11

ckorzen/pdf-text-extraction-benchmark
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
Language: TeX - Size: 505 MB - Last synced at: 5 months ago - Pushed at: over 4 years ago - Stars: 65 - Forks: 11

iscc/mobi
python based software to unpack kindlegen generated ebooks
Language: Python - Size: 761 KB - Last synced at: 13 days ago - Pushed at: about 2 years ago - Stars: 62 - Forks: 9

rajesh-bhat/spark-ai-summit-2020-text-extraction
Language: Jupyter Notebook - Size: 105 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 58 - Forks: 33

fourdigits/wagtail_textract
Text extraction for Wagtail document search
Language: Python - Size: 1.02 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 34 - Forks: 14

pd3f/pd3f-core
📑 Python Package to reconstruct the original continuous text from PDFs with language models
Language: Jupyter Notebook - Size: 1.31 MB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 8

Goldziher/html-to-markdown
HTML to markdown converter
Language: Python - Size: 383 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 30 - Forks: 2

hscspring/pnlp
NLP预/后处理工具。
Language: Python - Size: 106 KB - Last synced at: 12 days ago - Pushed at: 21 days ago - Stars: 29 - Forks: 6

dwatteau/scummtr
Fan translation tools for LucasArts SCUMM games
Language: C++ - Size: 559 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 28 - Forks: 4

spences10/mcp-jinaai-reader
🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader
Language: JavaScript - Size: 116 KB - Last synced at: 6 days ago - Pushed at: 15 days ago - Stars: 25 - Forks: 3

Azure-Samples/doc-intelligence-in-a-box
The Doc Intelligence in-a-Box project leverages Azure AI Document Intelligence to extract data from PDF forms and store the data in a Azure Cosmos DB. This solution, part of the AI-in-a-Box framework by Microsoft Customer Engineers and Architects, ensures quality, efficiency, and rapid deployment of AI and ML solutions across various industries.
Language: Bicep - Size: 6.36 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 25 - Forks: 7

Altabeh/tesseract-ocr-wrapper
This is a highly efficient python wrapper for tesseract-ocr.
Language: Python - Size: 26.4 KB - Last synced at: 5 months ago - Pushed at: almost 3 years ago - Stars: 20 - Forks: 4

ingmarboeschen/JATSdecoder
A text extraction and manipulation toolset for NISO-JATS coded XML files
Language: R - Size: 2.94 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 19 - Forks: 1

OwenOrcan/YiraBot-Crawler
YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.
Language: Python - Size: 221 KB - Last synced at: 24 days ago - Pushed at: 5 months ago - Stars: 19 - Forks: 0

mknz/mirusan 📦
A PDF collection reader with built-in full-text search engine
Language: JavaScript - Size: 2.71 MB - Last synced at: over 1 year ago - Pushed at: almost 8 years ago - Stars: 19 - Forks: 0

amenezes/aiopytesseract
A Python asyncio wrapper for Tesseract-OCR.
Language: Python - Size: 2.13 MB - Last synced at: 8 months ago - Pushed at: about 1 year ago - Stars: 18 - Forks: 6

Aman-zishan/textextractor2.0
:fire: This web app extracts text in an image.
Language: JavaScript - Size: 53.9 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 18 - Forks: 12

greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
Language: Clojure - Size: 19.5 KB - Last synced at: 20 days ago - Pushed at: almost 5 years ago - Stars: 18 - Forks: 0

arshad-yaseen/ocr-llm
⚡️ Fast, ultra-accurate text extraction from any image or PDF—including challenging ones—with structured markdown output powered by vision models.
Language: TypeScript - Size: 2.88 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 17 - Forks: 2

dotfurther/OpenDiscoverSDK
.NET 8 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.
Language: C# - Size: 170 MB - Last synced at: 9 days ago - Pushed at: 4 months ago - Stars: 16 - Forks: 0

ad-freiburg/pdftotext-plus-plus
A fast and accurate command line tool for extracting text from PDF files.
Language: C++ - Size: 18.2 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 16 - Forks: 0

Arxa/video_text_detection
Bachelor Thesis | Text extraction from complex video scenes
Language: Java - Size: 197 MB - Last synced at: 6 months ago - Pushed at: about 6 years ago - Stars: 15 - Forks: 3

shelfio/apache-tika-lambda-layer
AWS Lambda layer containing latest version of Apache Tika
Language: Shell - Size: 327 MB - Last synced at: 20 days ago - Pushed at: 2 months ago - Stars: 14 - Forks: 6

asepmaulanaismail/pdf-to-txt-python
Simple pdf to text with python using PDFtk and PyPDF2
Language: Python - Size: 550 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 13 - Forks: 9

MRGRD56/textractor-translator
Translate visual novels in real time
Language: TypeScript - Size: 2.19 MB - Last synced at: 12 days ago - Pushed at: 5 months ago - Stars: 12 - Forks: 0

solworktech/zaje
Highlight/colourise command output, logfiles (and anything else really) based on regex pattern matching
Language: Go - Size: 8.07 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11 - Forks: 1

dotfurther/OpenDiscoverPlatformCaseStudy
Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.
Size: 5.93 MB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 11 - Forks: 0

andrealenzi11/py-poppleract
Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents
Language: Python - Size: 202 KB - Last synced at: 25 days ago - Pushed at: 6 months ago - Stars: 10 - Forks: 2

bmoscon/ArticleParse
Heuristic text extraction from news sites in Python3
Language: Python - Size: 29.3 KB - Last synced at: 5 days ago - Pushed at: over 7 years ago - Stars: 10 - Forks: 4

funinkina/Gnome-OCR-Screenshot
A simple python script to extarct text from screenshot in GNOME desktop environment using pytesseract.
Language: Python - Size: 453 KB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 9 - Forks: 0

TYPO3-Solr/ext-tika
A TYPO3 CMS extension that provides Apache Tika functionality
Language: PHP - Size: 2.16 MB - Last synced at: 16 days ago - Pushed at: 4 months ago - Stars: 9 - Forks: 29

heussd/pdftotext-go
Extract texts + their page numbers from PDF
Language: Go - Size: 229 KB - Last synced at: 8 months ago - Pushed at: 9 months ago - Stars: 8 - Forks: 2

yoshihikoueno/pdfminer-layout-scanner Fork of dpapathanasiou/pdfminer-layout-scanner
A more complete example of programming with PDFMiner, which continues where the default documentation stops
Language: Python - Size: 26.4 KB - Last synced at: 5 days ago - Pushed at: over 5 years ago - Stars: 8 - Forks: 4

IDisposable/IFilterExtractor
A simple component to extract just the text from any file that has an IFilter installed. Available as a C++ COM component and as a C# .NET library.
Language: C++ - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: about 8 years ago - Stars: 8 - Forks: 5

globality-corp/deboiler
Deboiler - Boilerplate Identification and Removal
Language: Python - Size: 1.42 MB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 0

ssciwr/AMMICO
AI-based Media and Misinformation Content Analysis Tool: Analyze text and images
Language: Python - Size: 93.9 MB - Last synced at: 8 days ago - Pushed at: 10 days ago - Stars: 6 - Forks: 3

AndyTheFactory/article-extraction-dataset
Article title, authors, date and body extraction dataset.
Language: HTML - Size: 31.9 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1

apyhub/apyhub.js
ApyHub SDK for Node.js is a library for accessing the ApyHub APIs.
Language: TypeScript - Size: 695 KB - Last synced at: 6 days ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 0

Fisseha-Estifanos/LLM-API
A repository to demonstrate some of the concepts behind large language models, transformer (foundation) models, in-context learning, and prompt engineering using open source large language models like Bloom and co:here.
Language: Jupyter Notebook - Size: 327 KB - Last synced at: 13 days ago - Pushed at: over 2 years ago - Stars: 6 - Forks: 1

manofstrong/sitescrapper
A PHP library to Scrape Websites from their Sitemaps and Extract Relevant Content from the Webpage and Upload to a Database
Language: PHP - Size: 112 KB - Last synced at: 7 months ago - Pushed at: over 5 years ago - Stars: 6 - Forks: 1

mkalus/tika-page-extractor 📦
Tika per page PDF extractor server returning content as JSON.
Language: Java - Size: 19.5 KB - Last synced at: about 2 years ago - Pushed at: about 9 years ago - Stars: 6 - Forks: 3

lihanghang/TecRoom
技术栈在线总结文档,包含编程语言、数据结构与算法、机器学习、数据库等。
Language: HTML - Size: 33.8 MB - Last synced at: 11 days ago - Pushed at: 5 months ago - Stars: 5 - Forks: 2

senurah/java-tess4J-ocr
Tess4J CLI OCR Tool is a command-line application that extracts text from images and PDFs using the Tess4J library, with support for multiple languages. The extracted text is automatically copied to the clipboard for easy access.
Language: Java - Size: 9.96 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 5 - Forks: 0

Kind-Unes/Flutter-Translation-Application
Flutter Android & iOS Translation Education Application. It utilizes ObjectBox as a local database and Google API for translations, and is powered by GEMINI-ULTRA for AI capabilities
Size: 4.24 MB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 5 - Forks: 0

myanmartools/myanmar-text-extractor-js
Burmese language (Myanmar text) extractor JavaScript library for word segmentation, text extraction or syllable break.
Language: TypeScript - Size: 1.12 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 2

YuDavidCao/Automation-manager
Automation manager
Language: Python - Size: 342 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 0

nayyhah/PDFAutomation-OCRTextRecognition
PDF Automation - OCR Text Recognition
Language: JavaScript - Size: 12.1 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 5 - Forks: 1

rajdeep2804/Automated_Invoice_Processing
The number of types of physical documents being digitized is on the increase. Medical bills, bank documents and personal documents are examples of such documents. Objective of this repo is to implement and understand such use cases with an example of extracting text information from invoice receipts.
Language: Jupyter Notebook - Size: 16.4 MB - Last synced at: 11 days ago - Pushed at: about 3 years ago - Stars: 5 - Forks: 0

carrliitos/NLPInformationExtraction 📦
My 2020 project focusing on NLP - Information Extraction
Language: Python - Size: 115 MB - Last synced at: almost 2 years ago - Pushed at: about 4 years ago - Stars: 5 - Forks: 1

scotthaleen/spark-hdfs-tika
spark hdfs tika
Language: Scala - Size: 1.95 KB - Last synced at: about 2 years ago - Pushed at: over 8 years ago - Stars: 5 - Forks: 3

alokkumary2j/RapidAutomaticKPExtractionRAKE
This repository contains my experiments with RAKE and its variants. RAKE is one of the most popular unsupervised approach for automatically extracting key-phrases/keywords from an unstructured data source like reviews, news, articles, documents etc.
Language: Jupyter Notebook - Size: 2.33 MB - Last synced at: over 1 year ago - Pushed at: almost 9 years ago - Stars: 5 - Forks: 0

FileFormatInfo/ff-pdf2txt
Simple server to extract text from a PDF
Language: Java - Size: 6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4 - Forks: 2

PT-Perkasa-Pilar-Utama/ppu-pdf
Pdf utilities for text extraction in digital and convert scanned pdf into canvas.
Language: TypeScript - Size: 38.4 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 4 - Forks: 0

GURSV/URL-summ
A URL summarizer, which summarizes the content of a URL with proper formatting. It uses 'sshleifer/distilbart-cnn-12-6', which is a distilled version of the BART model, specifically optimized for text summarization tasks, including CNN summarization.
Language: Python - Size: 112 KB - Last synced at: 22 days ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

atahanuz/yt2text
Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition model.
Size: 10.7 KB - Last synced at: 27 days ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 1

anhthuan1999/PhoBERT-Extraction
Extract vectors by setences and words with one layer or concat more layers
Language: Python - Size: 5.86 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 4 - Forks: 0

rosette-api-community/text-embeddings-sample
A little python code to show how to get similarity between word embeddings returned from the Rosette API's new /text-embedding endpoint.
Language: Python - Size: 2.93 KB - Last synced at: about 2 months ago - Pushed at: about 8 years ago - Stars: 4 - Forks: 0

GateNLP/wpextract
Create datasets from WordPress sites for research or archiving
Language: Python - Size: 1.94 MB - Last synced at: 8 days ago - Pushed at: 20 days ago - Stars: 3 - Forks: 0

nguyen-tho/ID-card-extract-module
Language: Python - Size: 59.8 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 3 - Forks: 0

WonhoZhung/ee474
EE474 Term Project
Language: Python - Size: 457 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 3 - Forks: 1

Nishant2018/Text-Extraction-OCR-OpenCV
Text extraction is the process of automatically extracting text from images or documents. Optical Character Recognition (OCR) is a technology that enables computers to convert images of text into machine-readable text.
Language: Jupyter Notebook - Size: 9.77 KB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 3 - Forks: 0

Shangamesh2805/TECH_OCULAR-
Smart eye glasses, it is an AI based audio transcriber for visually impaired. It helps visually challenged people to read books and novels by themselves even in the absence of guide.
Language: Python - Size: 572 KB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 3 - Forks: 2

KalyanM45/Optical-Character-Recognition
This project is a Python-based Optical Character Recognition (OCR) application using the EasyOCR library. It provides a convenient way to detect and recognize text in images, making it useful for a wide range of applications such as document processing, image captioning, and text extraction.
Language: Jupyter Notebook - Size: 720 KB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 3 - Forks: 0

Zeeshanahmad4/NLP-Pdf-Minning-Extracting-text-from-pdf
NLP Pdf Minning Extracting text from pdf
Language: Python - Size: 2.86 MB - Last synced at: 19 days ago - Pushed at: about 5 years ago - Stars: 3 - Forks: 1

RocktimRajkumar/CV-Grader
:runner: CV parser is a compiler or interpreter that converts the structured form of data into a structured form.
Language: Python - Size: 182 KB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 3 - Forks: 0

simonkeng/pdf_parser
Textual & numeric data extraction with Python using textract, easily shareable with Docker.
Language: C - Size: 15.6 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 3 - Forks: 1

lpuettmann/kindle-to-md
Transform Kindle clippings to Markdown, to be displayed on a Jekyll website
Language: Python - Size: 3.91 KB - Last synced at: about 2 years ago - Pushed at: over 8 years ago - Stars: 3 - Forks: 0

rithulkamesh/docproc
Opinionated and Sophisticated Document Region Analyzer.
Language: Python - Size: 161 KB - Last synced at: 8 days ago - Pushed at: 20 days ago - Stars: 2 - Forks: 0
