An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: layout-analysis

UglyToad/PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)

Language: C# - Size: 180 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 2,067 - Forks: 263

opendatalab/MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

Language: Python - Size: 125 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 36,874 - Forks: 3,017

mittagessen/kraken

OCR engine for all the languages

Language: Python - Size: 28.5 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 841 - Forks: 141

breezedeus/Pix2Text

An Open-Source Python3 tool with SMALL models for recognizing layouts, tables, math formulas (LaTeX), and text in images, converting them into Markdown format. A free alternative to Mathpix, empowering seamless conversion of visual content into text-based representations. 80+ languages are supported.

Language: Jupyter Notebook - Size: 23.5 MB - Last synced at: 8 days ago - Pushed at: about 2 months ago - Stars: 2,470 - Forks: 227

MBAigner/PDFSegmenter

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

Language: Python - Size: 399 KB - Last synced at: 2 days ago - Pushed at: almost 5 years ago - Stars: 23 - Forks: 3

xushengfeng/eSearch-OCR

基于paddleOCR的nodejs库

Language: TypeScript - Size: 16.4 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 74 - Forks: 8

BobLd/DocumentLayoutAnalysis

Document Layout Analysis resources repos for development with PdfPig.

Language: C# - Size: 41.6 MB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 619 - Forks: 67

kotaro-kinoshita/yomitoku

Yomitoku is an AI-powered document image analysis package designed specifically for the Japanese language.

Language: Python - Size: 237 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 812 - Forks: 26

qyhou/curated-document-layout-analysis

A curated list of resources on Document Layout Analysis

Size: 10.7 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

mindspore-lab/mindocr

A toolbox of ocr models and algorithms based on MindSpore

Language: Python - Size: 20.6 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 273 - Forks: 60

RapidAI/RapidLayout

Analysis of Chinese and English layouts 中英文版面分析

Language: Python - Size: 774 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 212 - Forks: 14

Layout-Parser/layout-parser

A Unified Toolkit for Deep Learning Based Document Image Analysis

Language: Python - Size: 58.3 MB - Last synced at: about 1 month ago - Pushed at: 11 months ago - Stars: 5,256 - Forks: 498

AlecioP/forms-classifier

Classification of documents from an image

Language: Python - Size: 21.5 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

RapidAI/RapidDoc

📝 针对文档类图像做内容提取,将文档类图像一比一输出到Word或者Txt中,便于进一步使用或处理。后续计划支持输入PDF/图像,输出对应json格式、Txt格式、Word格式和Markdown格式。

Language: Python - Size: 781 KB - Last synced at: about 1 month ago - Pushed at: 8 months ago - Stars: 193 - Forks: 8

andreagemelli/doc2graph

Doc2Graph transforms documents into graphs and exploit a GNN to solve several tasks.

Language: Jupyter Notebook - Size: 466 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 121 - Forks: 20

JPLeoRX/detectron2-publaynet

Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset

Language: Python - Size: 7.76 MB - Last synced at: 22 days ago - Pushed at: about 2 years ago - Stars: 49 - Forks: 7

yoshihikoueno/pdfminer-layout-scanner Fork of dpapathanasiou/pdfminer-layout-scanner

A more complete example of programming with PDFMiner, which continues where the default documentation stops

Language: Python - Size: 26.4 KB - Last synced at: 11 days ago - Pushed at: almost 6 years ago - Stars: 7 - Forks: 4

BobLd/PdfPigMLNetBlockClassifier

Proof of concept of training a simple Region Classifier using PdfPig and ML.NET (LightGBM). The objective is to classify each text block in a pdf document page as either title, text, list, table and image.

Language: C# - Size: 1.1 MB - Last synced at: 4 days ago - Pushed at: over 5 years ago - Stars: 28 - Forks: 6

rithulkamesh/docproc

Opinionated and Sophisticated Document Region Analyzer.

Language: Python - Size: 219 KB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

novaladai/novaladai

Novalad offers a unified, centralized platform enabling organizations to extract meaningful data and perform advanced processing at high speed.

Size: 6.08 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

calfa-co/catalog-manuscripts-venice

Raw data of the Catalog of Armenian Manuscripts of Venice

Size: 7.34 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

aidayang/MinerU-OneClick

MinerU免安装部署一键启动整合包

Size: 49.8 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 7 - Forks: 0

ppaanngggg/yolo-doclaynet

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

Language: Python - Size: 44.9 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 86 - Forks: 16

gundasrikar/IC-Design-VCO-Cadence-Virtuoso-Digital-Circuit

Size: 15.6 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

jiangnanboy/layout_analysis4j

利用java-yolov8实现版面检测(Chinese layout detection),java-yolov8 is used to detect the layout of Chinese document images

Language: Java - Size: 201 KB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 25 - Forks: 9

calfa-co/Patrologia-Graeca

Main repository of the CGPG project for OCR and Text Analysis of the Patrologia Graeca

Language: GLSL - Size: 16.5 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

calfa-co/chi-know-po

HTR ground truth of the Chi-Know-Po project (Collex Persée)

Size: 2.17 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

Alapipapi/MinerU Fork of opendatalab/MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

Language: Python - Size: 103 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

ColinTr/LiveDesktopTranslator

Live capture your screen and replace textual elements with their translations

Language: Python - Size: 184 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

VRI-UFPR/ocrd-gbn Fork of qurator-spk/sbb_textline_detection

OCR-D compliant toolset for optical layout recognition on historical german-language documents published in Brazil

Language: Python - Size: 600 KB - Last synced at: 25 days ago - Pushed at: almost 4 years ago - Stars: 10 - Forks: 0

empressabyss/nordrassil

Nordrassil is a keyboard layout that provides an elegant and balanced typing experience by its use of a thumb-alpha, emphasis on middle fingers, deprioritisation of pinkies, and repeat key (or arcane keys).

Size: 149 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 9 - Forks: 0

calfa-co/rasam-dataset

An Open Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi (ICDAR 2021)

Size: 9.13 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 4 - Forks: 1

marinagiardinetti/layout-parser-newspapers

Layout Parser notebook Implementation & Re-trained model for Image detection and extraction

Language: Jupyter Notebook - Size: 17.6 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

NormXU/Layout2Graph

An official implementation of paper "Paragraph2Graph: A Language-independent GNN-based framework for layout analysis"

Language: Python - Size: 10.8 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 68 - Forks: 10

pleb631/PdfDet

PdfDet aims to simplify PDF layout detect tasks for users.

Language: Python - Size: 14.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

ixalodecte/filestruct

A python package to structure files using visual and style informations

Language: Python - Size: 45.9 KB - Last synced at: 19 days ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

os-climate/crrf-det

A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.

Language: C++ - Size: 6.63 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 3

dell-research-harvard/HJDataset

A Large Dataset of Historical Japanese Documents with Complex Layouts

Language: Jupyter Notebook - Size: 22.8 MB - Last synced at: about 1 year ago - Pushed at: almost 3 years ago - Stars: 27 - Forks: 4

MaitySubhajit/SelfDocSeg

[ICDAR 2023] SelfDocSeg: A self-supervised vision-based approach towards Document Segmentation (Oral)

Language: Python - Size: 13.4 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 18 - Forks: 2

matthewleechen/BritishPatents

Curating a dataset of British patents

Language: Jupyter Notebook - Size: 7.84 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

heshiming/paddlefish Fork of os-climate/crrf-det

A Python + C implementation for image-based PDF page layout analysis and content extraction.

Language: C++ - Size: 5.26 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 0

majeek/scribble-segmentation

This repository presents the code of the paper titled "Scribble Based Interactive Page Layout Segmentation Using Gabor Filter" published in ICFHR2016.

Language: C++ - Size: 8.75 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

GBN-DBP/ocrd-page-xml-draw

OCR-D wrapper for page-xml-draw

Language: Python - Size: 6.84 KB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

GBN-DBP/page-xml-draw

A powerful CLI tool for visualization and encoding of PAGE-XML files

Language: Python - Size: 18.4 MB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 6 - Forks: 1