GitHub topics: document-image-processing
Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Language: HTML - Size: 192 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 11,380 - Forks: 948

Layout-Parser/layout-parser
A Unified Toolkit for Deep Learning Based Document Image Analysis
Language: Python - Size: 58.3 MB - Last synced at: 19 days ago - Pushed at: 10 months ago - Stars: 5,256 - Forks: 498

jiangnanboy/Doc-Image-Tool
文档图像处理工具(Document image processing tool),包括漂白 / 文字方向矫正 / 清晰增强 / 笔记去噪美化 / 去阴影 / 扭曲矫正 / 切边增强(DocBleach / TextOrientationCorrection / DocSharpening / HandwritingDenoisingBeautifying / DocShadowRemoval / document_image_dewarping / DocTrimmingEnhancement)。
Language: Python - Size: 11.7 MB - Last synced at: 21 days ago - Pushed at: 9 months ago - Stars: 52 - Forks: 10

fh2019ustc/Awesome-Document-Image-Rectification
A comprehensive list of awesome document image rectification papers.
Size: 188 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 418 - Forks: 30

fh2019ustc/DocScanner
The official repo for “DocScanner: Robust Document Image Rectification with Progressive Learning”, IJCV, 2025.
Language: Python - Size: 17.3 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 193 - Forks: 18

fh2019ustc/DocGeoNet
The official code for “Geometric Representation Learning for Document Image Rectification”, ECCV, 2022.
Language: Python - Size: 10.5 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 81 - Forks: 2

jiangnanboy/docimg_tool
复杂背景图像漂白,文字方向矫正,清晰增强,笔记去噪美化,去阴影,扭曲矫正,去黑点以及切边增强。complex background image bleaching, text direction correction, clarity enhancement, note to blur beautification, shadow removal, distortion correction, black spots removal and cutting edge enhancement。
Size: 9.17 MB - Last synced at: 21 days ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

mx3123/Py-document-cropper
This script automates the process of extracting text from various file formats (images, PDFs, DOCX) using Optical Character Recognition (OCR) powered by Azure Cognitive Services. The script supports image preprocessing, text extraction, and uploading of the processed files to Google Cloud Storage (GCP).
Language: Python - Size: 9.77 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

tony-xlh/quality-evaluation-of-scanned-document-images
A web app evaluating the quality the scanned document images
Language: HTML - Size: 18.6 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 1

caltechlibrary/documentarist
Process Caltech Archives' digital documents and photos, and annotate each page or image with information about its contents
Language: Python - Size: 519 KB - Last synced at: about 2 months ago - Pushed at: about 3 years ago - Stars: 12 - Forks: 4

fh2019ustc/DocTr
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.
Language: Python - Size: 50.7 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 340 - Forks: 48

jchazalon/smartdoc15-ch1-pywrapper
Python wrapper to facilitate data manipulation for the SmartDoc 2015 - Challenge 1 Dataset.
Language: Jupyter Notebook - Size: 6.11 MB - Last synced at: 4 days ago - Pushed at: 12 months ago - Stars: 6 - Forks: 2

hpanwar08/detectron2 Fork of facebookresearch/detectron2
Detectron2 for Document Layout Analysis
Language: Python - Size: 4.53 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 178 - Forks: 62

YuanSiping/Similar-Document-Image-Retrieval-Dataset
Size: 321 MB - Last synced at: over 1 year ago - Pushed at: almost 7 years ago - Stars: 0 - Forks: 3

Transkribus/competitions
The ScriptNet / competitions site.
Language: Python - Size: 263 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 5 - Forks: 6

Nomiluks/Handwritting-OCR
Android App for English Handwritten Text Recognition
Language: Java - Size: 67.8 MB - Last synced at: over 2 years ago - Pushed at: over 7 years ago - Stars: 12 - Forks: 5

sfikas/sophia-trikoupi-handwritten-dataset
Sophia Trikoupi dataset (Collection of 46 handwritten, annotated pages)
Language: Python - Size: 70.1 MB - Last synced at: 3 months ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0
