An open API service providing repository metadata for many open source software ecosystems.

Topic: "text-extraction"

adbar/trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Language: Python - Size: 33.8 MB - Last synced at: about 2 hours ago - Pushed at: about 1 month ago - Stars: 4,151 - Forks: 289

miso-belica/sumy

Module for automatic summarization of text documents and HTML pages.

Language: Python - Size: 1.57 MB - Last synced at: 6 months ago - Pushed at: 11 months ago - Stars: 3,518 - Forks: 530

unidoc/unipdf

Golang PDF library for creating and processing PDF files (pure go)

Language: Go - Size: 123 MB - Last synced at: 12 days ago - Pushed at: 28 days ago - Stars: 2,746 - Forks: 264

Goldziher/kreuzberg

A text extraction library supporting PDFs, images, office documents and more

Language: Python - Size: 12.3 MB - Last synced at: 6 days ago - Pushed at: 10 days ago - Stars: 1,761 - Forks: 61

chrismattmann/tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Language: Python - Size: 31.5 MB - Last synced at: 11 days ago - Pushed at: 14 days ago - Stars: 1,572 - Forks: 242

whitelok/image-text-localization-recognition

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

Size: 333 KB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 949 - Forks: 233

miso-belica/jusText

Heuristic based boilerplate removal tool

Language: Python - Size: 1.02 MB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 765 - Forks: 83

unidoc/unidoc

This repository has moved! https://github.com/unidoc/unipdf

Language: Go - Size: 29.3 MB - Last synced at: 11 months ago - Pushed at: almost 6 years ago - Stars: 705 - Forks: 87

ICIJ/datashare

A self-hosted search engine for documents.

Language: Java - Size: 395 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 626 - Forks: 57

ropensci/pdftools

Text Extraction, Rendering and Converting of PDF Documents

Language: C++ - Size: 1.05 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 529 - Forks: 71

cdown/srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

Language: Python - Size: 406 KB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 499 - Forks: 45

shixzie/nlp 📦

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

Language: Go - Size: 50.8 KB - Last synced at: 9 months ago - Pushed at: over 7 years ago - Stars: 389 - Forks: 34

flairNLP/fundus

A very simple news crawler with a funny name

Language: Python - Size: 21.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 368 - Forks: 88

iamarunbrahma/vision-parse

Parse PDFs into markdown using Vision LLMs

Language: Python - Size: 374 KB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 340 - Forks: 45

pd3f/pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Language: HTML - Size: 930 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 314 - Forks: 39

py-pdf/benchmarks

Benchmarking PDF libraries

Language: Python - Size: 3.73 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 269 - Forks: 15

bookieio/breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

Language: HTML - Size: 604 KB - Last synced at: 12 days ago - Pushed at: 12 months ago - Stars: 204 - Forks: 25

weareprestatech/hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

Language: Python - Size: 16.5 MB - Last synced at: 7 days ago - Pushed at: 4 months ago - Stars: 186 - Forks: 9

SapienzaNLP/extend

Entity Disambiguation as text extraction (ACL 2022)

Language: Python - Size: 71.3 KB - Last synced at: 4 months ago - Pushed at: about 3 years ago - Stars: 177 - Forks: 13

skylander86/lambda-text-extractor

AWS Lambda functions to extract text from various binary formats.

Language: Python - Size: 111 MB - Last synced at: 5 months ago - Pushed at: about 7 years ago - Stars: 173 - Forks: 42

vsymbol/CUTIE

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

Language: Python - Size: 2.87 MB - Last synced at: 18 days ago - Pushed at: over 2 years ago - Stars: 157 - Forks: 77

archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Language: Scala - Size: 39.5 MB - Last synced at: 7 days ago - Pushed at: about 1 year ago - Stars: 143 - Forks: 32

sambitdash/PDFIO.jl

PDF Reader Library for Native Julia.

Language: Julia - Size: 24.5 MB - Last synced at: 6 days ago - Pushed at: 2 months ago - Stars: 131 - Forks: 14

vaites/php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

Language: PHP - Size: 13.8 MB - Last synced at: 9 days ago - Pushed at: about 1 month ago - Stars: 116 - Forks: 23

victorqribeiro/ocr

Simple app to extract text from pictures using Tesseract

Language: HTML - Size: 256 KB - Last synced at: about 1 month ago - Pushed at: almost 4 years ago - Stars: 105 - Forks: 9

lu4p/cat

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

Language: Go - Size: 215 KB - Last synced at: 12 days ago - Pushed at: over 1 year ago - Stars: 98 - Forks: 16

jmriebold/BoilerPy3 Fork of mercuree/BoilerPy

Python port of Boilerpipe library

Language: Python - Size: 188 KB - Last synced at: 2 months ago - Pushed at: 8 months ago - Stars: 86 - Forks: 18

docwire/docwire

DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality

Language: C++ - Size: 35.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 81 - Forks: 18

gamemaker1/office-text-extractor

Yet another library to extract text from MS Office and PDF files

Language: TypeScript - Size: 2.15 MB - Last synced at: 11 days ago - Pushed at: 9 months ago - Stars: 73 - Forks: 7

iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

Language: Python - Size: 69.3 KB - Last synced at: 11 days ago - Pushed at: 5 months ago - Stars: 69 - Forks: 7

nainiayoub/pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

Language: Python - Size: 24.4 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 69 - Forks: 42

JonathanRaiman/wikipedia_ner

:book: Labeled examples from wiki dumps in Python

Language: Jupyter Notebook - Size: 108 KB - Last synced at: 15 days ago - Pushed at: over 8 years ago - Stars: 67 - Forks: 7

abhinaba-ghosh/any-text

Get text content from any file

Language: JavaScript - Size: 226 KB - Last synced at: 8 days ago - Pushed at: 8 months ago - Stars: 65 - Forks: 11

ckorzen/pdf-text-extraction-benchmark

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

Language: TeX - Size: 505 MB - Last synced at: 5 months ago - Pushed at: over 4 years ago - Stars: 65 - Forks: 11

iscc/mobi

python based software to unpack kindlegen generated ebooks

Language: Python - Size: 761 KB - Last synced at: 13 days ago - Pushed at: about 2 years ago - Stars: 62 - Forks: 9

rajesh-bhat/spark-ai-summit-2020-text-extraction

Language: Jupyter Notebook - Size: 105 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 58 - Forks: 33

fourdigits/wagtail_textract

Text extraction for Wagtail document search

Language: Python - Size: 1.02 MB - Last synced at: 7 days ago - Pushed at: over 1 year ago - Stars: 34 - Forks: 14

pd3f/pd3f-core

📑 Python Package to reconstruct the original continuous text from PDFs with language models

Language: Jupyter Notebook - Size: 1.31 MB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 32 - Forks: 8

Goldziher/html-to-markdown

HTML to markdown converter

Language: Python - Size: 383 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 30 - Forks: 2

hscspring/pnlp

NLP预/后处理工具。

Language: Python - Size: 106 KB - Last synced at: 12 days ago - Pushed at: 21 days ago - Stars: 29 - Forks: 6

dwatteau/scummtr

Fan translation tools for LucasArts SCUMM games

Language: C++ - Size: 559 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 28 - Forks: 4

spences10/mcp-jinaai-reader

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

Language: JavaScript - Size: 116 KB - Last synced at: 6 days ago - Pushed at: 15 days ago - Stars: 25 - Forks: 3

Azure-Samples/doc-intelligence-in-a-box

The Doc Intelligence in-a-Box project leverages Azure AI Document Intelligence to extract data from PDF forms and store the data in a Azure Cosmos DB. This solution, part of the AI-in-a-Box framework by Microsoft Customer Engineers and Architects, ensures quality, efficiency, and rapid deployment of AI and ML solutions across various industries.

Language: Bicep - Size: 6.36 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 25 - Forks: 7

Altabeh/tesseract-ocr-wrapper

This is a highly efficient python wrapper for tesseract-ocr.

Language: Python - Size: 26.4 KB - Last synced at: 5 months ago - Pushed at: almost 3 years ago - Stars: 20 - Forks: 4

ingmarboeschen/JATSdecoder

A text extraction and manipulation toolset for NISO-JATS coded XML files

Language: R - Size: 2.94 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 19 - Forks: 1

OwenOrcan/YiraBot-Crawler

YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.

Language: Python - Size: 221 KB - Last synced at: 24 days ago - Pushed at: 5 months ago - Stars: 19 - Forks: 0

mknz/mirusan 📦

A PDF collection reader with built-in full-text search engine

Language: JavaScript - Size: 2.71 MB - Last synced at: over 1 year ago - Pushed at: almost 8 years ago - Stars: 19 - Forks: 0

amenezes/aiopytesseract

A Python asyncio wrapper for Tesseract-OCR.

Language: Python - Size: 2.13 MB - Last synced at: 8 months ago - Pushed at: about 1 year ago - Stars: 18 - Forks: 6

Aman-zishan/textextractor2.0

:fire: This web app extracts text in an image.

Language: JavaScript - Size: 53.9 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 18 - Forks: 12

greed2411/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

Language: Clojure - Size: 19.5 KB - Last synced at: 20 days ago - Pushed at: almost 5 years ago - Stars: 18 - Forks: 0

arshad-yaseen/ocr-llm

⚡️ Fast, ultra-accurate text extraction from any image or PDF—including challenging ones—with structured markdown output powered by vision models.

Language: TypeScript - Size: 2.88 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 17 - Forks: 2

dotfurther/OpenDiscoverSDK

.NET 8 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.

Language: C# - Size: 170 MB - Last synced at: 9 days ago - Pushed at: 4 months ago - Stars: 16 - Forks: 0

ad-freiburg/pdftotext-plus-plus

A fast and accurate command line tool for extracting text from PDF files.

Language: C++ - Size: 18.2 MB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 16 - Forks: 0

Arxa/video_text_detection

Bachelor Thesis | Text extraction from complex video scenes

Language: Java - Size: 197 MB - Last synced at: 6 months ago - Pushed at: about 6 years ago - Stars: 15 - Forks: 3

shelfio/apache-tika-lambda-layer

AWS Lambda layer containing latest version of Apache Tika

Language: Shell - Size: 327 MB - Last synced at: 20 days ago - Pushed at: 2 months ago - Stars: 14 - Forks: 6

asepmaulanaismail/pdf-to-txt-python

Simple pdf to text with python using PDFtk and PyPDF2

Language: Python - Size: 550 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 13 - Forks: 9

MRGRD56/textractor-translator

Translate visual novels in real time

Language: TypeScript - Size: 2.19 MB - Last synced at: 12 days ago - Pushed at: 5 months ago - Stars: 12 - Forks: 0

solworktech/zaje

Highlight/colourise command output, logfiles (and anything else really) based on regex pattern matching

Language: Go - Size: 8.07 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11 - Forks: 1

dotfurther/OpenDiscoverPlatformCaseStudy

Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.

Size: 5.93 MB - Last synced at: 9 days ago - Pushed at: 11 months ago - Stars: 11 - Forks: 0

andrealenzi11/py-poppleract

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents

Language: Python - Size: 202 KB - Last synced at: 25 days ago - Pushed at: 6 months ago - Stars: 10 - Forks: 2

bmoscon/ArticleParse

Heuristic text extraction from news sites in Python3

Language: Python - Size: 29.3 KB - Last synced at: 5 days ago - Pushed at: over 7 years ago - Stars: 10 - Forks: 4

funinkina/Gnome-OCR-Screenshot

A simple python script to extarct text from screenshot in GNOME desktop environment using pytesseract.

Language: Python - Size: 453 KB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 9 - Forks: 0

TYPO3-Solr/ext-tika

A TYPO3 CMS extension that provides Apache Tika functionality

Language: PHP - Size: 2.16 MB - Last synced at: 16 days ago - Pushed at: 4 months ago - Stars: 9 - Forks: 29

heussd/pdftotext-go

Extract texts + their page numbers from PDF

Language: Go - Size: 229 KB - Last synced at: 8 months ago - Pushed at: 9 months ago - Stars: 8 - Forks: 2

yoshihikoueno/pdfminer-layout-scanner Fork of dpapathanasiou/pdfminer-layout-scanner

A more complete example of programming with PDFMiner, which continues where the default documentation stops

Language: Python - Size: 26.4 KB - Last synced at: 5 days ago - Pushed at: over 5 years ago - Stars: 8 - Forks: 4

IDisposable/IFilterExtractor

A simple component to extract just the text from any file that has an IFilter installed. Available as a C++ COM component and as a C# .NET library.

Language: C++ - Size: 42 KB - Last synced at: about 1 year ago - Pushed at: about 8 years ago - Stars: 8 - Forks: 5

globality-corp/deboiler

Deboiler - Boilerplate Identification and Removal

Language: Python - Size: 1.42 MB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 0

ssciwr/AMMICO

AI-based Media and Misinformation Content Analysis Tool: Analyze text and images

Language: Python - Size: 93.9 MB - Last synced at: 8 days ago - Pushed at: 10 days ago - Stars: 6 - Forks: 3

AndyTheFactory/article-extraction-dataset

Article title, authors, date and body extraction dataset.

Language: HTML - Size: 31.9 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1

apyhub/apyhub.js

ApyHub SDK for Node.js is a library for accessing the ApyHub APIs.

Language: TypeScript - Size: 695 KB - Last synced at: 6 days ago - Pushed at: almost 2 years ago - Stars: 6 - Forks: 0

Fisseha-Estifanos/LLM-API

A repository to demonstrate some of the concepts behind large language models, transformer (foundation) models, in-context learning, and prompt engineering using open source large language models like Bloom and co:here.

Language: Jupyter Notebook - Size: 327 KB - Last synced at: 13 days ago - Pushed at: over 2 years ago - Stars: 6 - Forks: 1

manofstrong/sitescrapper

A PHP library to Scrape Websites from their Sitemaps and Extract Relevant Content from the Webpage and Upload to a Database

Language: PHP - Size: 112 KB - Last synced at: 7 months ago - Pushed at: over 5 years ago - Stars: 6 - Forks: 1

mkalus/tika-page-extractor 📦

Tika per page PDF extractor server returning content as JSON.

Language: Java - Size: 19.5 KB - Last synced at: about 2 years ago - Pushed at: about 9 years ago - Stars: 6 - Forks: 3

lihanghang/TecRoom

技术栈在线总结文档,包含编程语言、数据结构与算法、机器学习、数据库等。

Language: HTML - Size: 33.8 MB - Last synced at: 11 days ago - Pushed at: 5 months ago - Stars: 5 - Forks: 2

senurah/java-tess4J-ocr

Tess4J CLI OCR Tool is a command-line application that extracts text from images and PDFs using the Tess4J library, with support for multiple languages. The extracted text is automatically copied to the clipboard for easy access.

Language: Java - Size: 9.96 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 5 - Forks: 0

Kind-Unes/Flutter-Translation-Application

Flutter Android & iOS Translation Education Application. It utilizes ObjectBox as a local database and Google API for translations, and is powered by GEMINI-ULTRA for AI capabilities

Size: 4.24 MB - Last synced at: 11 days ago - Pushed at: 8 months ago - Stars: 5 - Forks: 0

myanmartools/myanmar-text-extractor-js

Burmese language (Myanmar text) extractor JavaScript library for word segmentation, text extraction or syllable break.

Language: TypeScript - Size: 1.12 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 2

YuDavidCao/Automation-manager

Automation manager

Language: Python - Size: 342 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 0

nayyhah/PDFAutomation-OCRTextRecognition

PDF Automation - OCR Text Recognition

Language: JavaScript - Size: 12.1 MB - Last synced at: over 1 year ago - Pushed at: about 3 years ago - Stars: 5 - Forks: 1

rajdeep2804/Automated_Invoice_Processing

The number of types of physical documents being digitized is on the increase. Medical bills, bank documents and personal documents are examples of such documents. Objective of this repo is to implement and understand such use cases with an example of extracting text information from invoice receipts.

Language: Jupyter Notebook - Size: 16.4 MB - Last synced at: 11 days ago - Pushed at: about 3 years ago - Stars: 5 - Forks: 0

carrliitos/NLPInformationExtraction 📦

My 2020 project focusing on NLP - Information Extraction

Language: Python - Size: 115 MB - Last synced at: almost 2 years ago - Pushed at: about 4 years ago - Stars: 5 - Forks: 1

scotthaleen/spark-hdfs-tika

spark hdfs tika

Language: Scala - Size: 1.95 KB - Last synced at: about 2 years ago - Pushed at: over 8 years ago - Stars: 5 - Forks: 3

alokkumary2j/RapidAutomaticKPExtractionRAKE

This repository contains my experiments with RAKE and its variants. RAKE is one of the most popular unsupervised approach for automatically extracting key-phrases/keywords from an unstructured data source like reviews, news, articles, documents etc.

Language: Jupyter Notebook - Size: 2.33 MB - Last synced at: over 1 year ago - Pushed at: almost 9 years ago - Stars: 5 - Forks: 0

FileFormatInfo/ff-pdf2txt

Simple server to extract text from a PDF

Language: Java - Size: 6 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 4 - Forks: 2

PT-Perkasa-Pilar-Utama/ppu-pdf

Pdf utilities for text extraction in digital and convert scanned pdf into canvas.

Language: TypeScript - Size: 38.4 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 4 - Forks: 0

GURSV/URL-summ

A URL summarizer, which summarizes the content of a URL with proper formatting. It uses 'sshleifer/distilbart-cnn-12-6', which is a distilled version of the BART model, specifically optimized for text summarization tasks, including CNN summarization.

Language: Python - Size: 112 KB - Last synced at: 22 days ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

atahanuz/yt2text

Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition model.

Size: 10.7 KB - Last synced at: 27 days ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 1

anhthuan1999/PhoBERT-Extraction

Extract vectors by setences and words with one layer or concat more layers

Language: Python - Size: 5.86 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 4 - Forks: 0

rosette-api-community/text-embeddings-sample

A little python code to show how to get similarity between word embeddings returned from the Rosette API's new /text-embedding endpoint.

Language: Python - Size: 2.93 KB - Last synced at: about 2 months ago - Pushed at: about 8 years ago - Stars: 4 - Forks: 0

GateNLP/wpextract

Create datasets from WordPress sites for research or archiving

Language: Python - Size: 1.94 MB - Last synced at: 8 days ago - Pushed at: 20 days ago - Stars: 3 - Forks: 0

nguyen-tho/ID-card-extract-module

Language: Python - Size: 59.8 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 3 - Forks: 0

WonhoZhung/ee474

EE474 Term Project

Language: Python - Size: 457 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 3 - Forks: 1

Nishant2018/Text-Extraction-OCR-OpenCV

Text extraction is the process of automatically extracting text from images or documents. Optical Character Recognition (OCR) is a technology that enables computers to convert images of text into machine-readable text.

Language: Jupyter Notebook - Size: 9.77 KB - Last synced at: about 2 months ago - Pushed at: 10 months ago - Stars: 3 - Forks: 0

Shangamesh2805/TECH_OCULAR-

Smart eye glasses, it is an AI based audio transcriber for visually impaired. It helps visually challenged people to read books and novels by themselves even in the absence of guide.

Language: Python - Size: 572 KB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 3 - Forks: 2

KalyanM45/Optical-Character-Recognition

This project is a Python-based Optical Character Recognition (OCR) application using the EasyOCR library. It provides a convenient way to detect and recognize text in images, making it useful for a wide range of applications such as document processing, image captioning, and text extraction.

Language: Jupyter Notebook - Size: 720 KB - Last synced at: 12 months ago - Pushed at: almost 2 years ago - Stars: 3 - Forks: 0

Zeeshanahmad4/NLP-Pdf-Minning-Extracting-text-from-pdf

NLP Pdf Minning Extracting text from pdf

Language: Python - Size: 2.86 MB - Last synced at: 19 days ago - Pushed at: about 5 years ago - Stars: 3 - Forks: 1

RocktimRajkumar/CV-Grader

:runner: CV parser is a compiler or interpreter that converts the structured form of data into a structured form.

Language: Python - Size: 182 KB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 3 - Forks: 0

simonkeng/pdf_parser

Textual & numeric data extraction with Python using textract, easily shareable with Docker.

Language: C - Size: 15.6 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 3 - Forks: 1

lpuettmann/kindle-to-md

Transform Kindle clippings to Markdown, to be displayed on a Jekyll website

Language: Python - Size: 3.91 KB - Last synced at: about 2 years ago - Pushed at: over 8 years ago - Stars: 3 - Forks: 0

rithulkamesh/docproc

Opinionated and Sophisticated Document Region Analyzer.

Language: Python - Size: 161 KB - Last synced at: 8 days ago - Pushed at: 20 days ago - Stars: 2 - Forks: 0