document-analysis | Topic | Ecosyste.ms: Repos

opendatalab/MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

Language: Python - Size: 125 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 36,874 - Forks: 3,017

UglyToad/PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)

Language: C# - Size: 180 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 2,067 - Forks: 263

AlibabaResearch/AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Language: C++ - Size: 104 MB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 1,727 - Forks: 194

tstanislawek/awesome-document-understanding

A curated list of resources for Document Understanding (DU) topic

Size: 5.56 MB - Last synced at: 16 days ago - Pushed at: about 2 years ago - Stars: 1,416 - Forks: 160

DocumindHQ/documind

Open-source platform for extracting structured data from documents using AI.

Language: JavaScript - Size: 1020 KB - Last synced at: 14 days ago - Pushed at: about 2 months ago - Stars: 1,326 - Forks: 48

Yuliang-Liu/Curve-Text-Detector

This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

Language: Jupyter Notebook - Size: 27.9 MB - Last synced at: about 1 month ago - Pushed at: almost 5 years ago - Stars: 646 - Forks: 156

NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Language: Python - Size: 2.94 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 620 - Forks: 47

wenwenyu/PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

Language: Python - Size: 9.72 MB - Last synced at: 2 months ago - Pushed at: 11 months ago - Stars: 563 - Forks: 192

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

Language: Python - Size: 235 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 493 - Forks: 37

CybercentreCanada/assemblyline

AssemblyLine 4: File triage and malware analysis

Language: Python - Size: 246 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 317 - Forks: 18

jpWang/LiLT

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

Language: Python - Size: 1.36 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 282 - Forks: 34

lazyFrogLOL/llmdocparser

A package for parsing PDFs and analyzing their content using LLMs.

Language: Python - Size: 1.21 MB - Last synced at: 28 days ago - Pushed at: 11 months ago - Stars: 271 - Forks: 9

pandora-analysis/pandora

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

Language: Python - Size: 6.99 MB - Last synced at: 11 days ago - Pushed at: 12 days ago - Stars: 263 - Forks: 42

masyagin1998/robin

RObust document image BINarization

Language: Python - Size: 24.8 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 180 - Forks: 38

chriswolfvision/local_adaptive_binarization

Local adaptive image binarization

Language: C++ - Size: 135 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 121 - Forks: 25

mirabdullahyaser/Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.

Language: Python - Size: 11.3 MB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 119 - Forks: 59

aws-samples/amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding

Language: Python - Size: 3.71 MB - Last synced at: 25 days ago - Pushed at: 7 months ago - Stars: 96 - Forks: 25

ppaanngggg/yolo-doclaynet

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

Language: Python - Size: 44.9 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 86 - Forks: 16

monniert/docExtractor

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper

Language: Python - Size: 4.09 MB - Last synced at: 8 months ago - Pushed at: about 2 years ago - Stars: 85 - Forks: 10

anisha2102/docvqa

Document Visual Question Answering

Language: Python - Size: 146 KB - Last synced at: over 2 years ago - Pushed at: almost 5 years ago - Stars: 85 - Forks: 20

Xyntopia/pydoxtools

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

Language: Python - Size: 13.6 MB - Last synced at: 5 days ago - Pushed at: 10 months ago - Stars: 82 - Forks: 13

ZeningLin/ViBERTgrid-PyTorch

An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"

Language: Python - Size: 388 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 54 - Forks: 5

abdur75648/UTRNet-High-Resolution-Urdu-Text-Recognition

UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)

Language: Python - Size: 126 KB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 51 - Forks: 10

JPLeoRX/detectron2-publaynet

Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset

Language: Python - Size: 7.76 MB - Last synced at: 21 days ago - Pushed at: about 2 years ago - Stars: 49 - Forks: 7

ankanbhunia/AdverseBiNet

Improving Document Binarization via Adversarial Noise-Texture Augmentation (ICIP 2019)

Language: Python - Size: 1.37 MB - Last synced at: over 2 years ago - Pushed at: about 6 years ago - Stars: 40 - Forks: 9

aws-solutions/enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

Language: JavaScript - Size: 62.5 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 38 - Forks: 16

microsoft/synthetic-rag-index

Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.

Language: Python - Size: 137 MB - Last synced at: 3 days ago - Pushed at: 9 months ago - Stars: 31 - Forks: 5

swapnil-ahlawat/Document_Layout_Analysis-MonkAI

DL models that take a document image file as input, locate the position of paragraphs, lines, images, etc. with their labels and confidence scores.

Language: Jupyter Notebook - Size: 50.6 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 26 - Forks: 6

muhd-umer/pyramidtabnet

Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents

Language: Python - Size: 93 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 25 - Forks: 2

ihdia/docvisor

An open-source tool for visualisation of outputs of deep-learning models for document analysis tasks such as fully automatic, bounding box and OCR.

Language: Python - Size: 109 MB - Last synced at: 11 months ago - Pushed at: over 3 years ago - Stars: 19 - Forks: 4

huyhoang17/kuzushiji_recognition

[Late Submission] Solution for Kuzushiji recognition (Kaggle competition)

Language: Python - Size: 90 MB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 17 - Forks: 2

ad-freiburg/pdftotext-plus-plus

A fast and accurate command line tool for extracting text from PDF files.

Language: C++ - Size: 18.2 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 16 - Forks: 0

Retab-dev/retab

The developper starter pack for document processing

Language: Jupyter Notebook - Size: 17.9 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 15 - Forks: 1

AILab-UniFI/GNN-TableExtraction

Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"

Language: Python - Size: 121 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 15 - Forks: 2

bookalope/InDesign-CEP

Adobe CEP extension for InDesign to use the Bookalope cloud services.

Language: JavaScript - Size: 4.12 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 14 - Forks: 6

bookalope/Bookalope

Everything related to Bookalope and its REST API.

Language: Python - Size: 163 KB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 12 - Forks: 4

AymurAI/backend

This repository contains the backend API and machine learning models of AymurAI

Language: Jupyter Notebook - Size: 40.4 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 10 - Forks: 0

therealexpertai/nlapi-java

Java Client for the expert.ai Natural Language API

Language: Java - Size: 163 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 9 - Forks: 9

TUWien/ReadModules

CVL/READ Modules including Basic Layout Analysis and Writer Identification/Retrieval

Language: C++ - Size: 3.53 MB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 9 - Forks: 4

soduco/paper-ner-bench-das22

All the material (paper, code, dataset, results) of our DAS 2022 paper (OCR+NER benchmark)

Language: Jupyter Notebook - Size: 313 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 8 - Forks: 0

CXH-Research/StainRestorer

[WACV 2025] High-Fidelity Document Stain Removal via A Large-Scale Real-World Dataset and A Memory-Augmented Transformer

Language: Python - Size: 20.5 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 7 - Forks: 2

aidayang/MinerU-OneClick

MinerU免安装部署一键启动整合包

Size: 49.8 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 7 - Forks: 0

gr8monk3ys/paper-summarizer

A Python-based tool for summarizing research papers and articles using NLP techniques. Simplify complex content efficiently

Language: HTML - Size: 19.5 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 7 - Forks: 1

ZeroBone/OfficialEye

An advanced AI-powered generic document-analysis tool

Language: Python - Size: 25.5 MB - Last synced at: 25 days ago - Pushed at: about 1 year ago - Stars: 7 - Forks: 3

ethanhezhao/MetaLDA

The code for MetaLDA in ICDM 2017

Language: Java - Size: 2.86 MB - Last synced at: about 2 months ago - Pushed at: over 6 years ago - Stars: 7 - Forks: 4

TUWien/ReadFramework

The Core Framework for CVL/READ Modules

Language: C++ - Size: 26.4 MB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 6 - Forks: 7

nicolasfeyer/KWS-SIFT

Python code to perform keyword spotting using SIFT features

Language: Python - Size: 30.3 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 0

MBAigner/GraphConverter

A tool for creating a graph representation out of the content of PDFs or images.

Language: Python - Size: 486 KB - Last synced at: 10 days ago - Pushed at: almost 5 years ago - Stars: 5 - Forks: 0

omni-us/research-ContentDistillation-HTR

Source code for ICFHR20 "Distilling Content from Style for Handwritten Word Recognition"

Language: Python - Size: 330 KB - Last synced at: 2 months ago - Pushed at: about 5 years ago - Stars: 5 - Forks: 2

sohaib023/T-Truth

Labeling tool for Table Structures in Document Images.

Language: Java - Size: 4.68 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 4 - Forks: 3

ahmetkumass/contract-analyzer

Open-source tool for extracting and analyzing key information from legal contracts and documents with ease.

Language: Python - Size: 4.88 KB - Last synced at: 7 days ago - Pushed at: 9 months ago - Stars: 4 - Forks: 0

faizan1041/doc-understanding-gpt-langchain

Document understanding with GPT 3.5 integrated with Telegram

Language: Python - Size: 28.6 MB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 0

abdur75648/urdu-synth

High-quality synthetic text data generation for Urdu Text Recognition

Language: Python - Size: 291 MB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 4 - Forks: 1

Yosef-AlSabbah/Cloud-Based-Document-Analytics-Service-2

Cloud-based service for uploading, scraping, and managing PDF/DOCX documents. Features include title sorting, content search with highlights, rule-based classification, and storage stats. Integrated with cloud platforms for scalable document analytics.

Language: TypeScript - Size: 269 KB - Last synced at: 10 days ago - Pushed at: 25 days ago - Stars: 3 - Forks: 0

LATIS-DocumentAI-Team/DocumentAI-std

DocumentAI-std is a Python library designed to facilitate and standardize document analysis and processing tasks. It offers functionality for handling document elements, performing optical character recognition (OCR), and managing document datasets.

Language: Python - Size: 350 KB - Last synced at: 30 days ago - Pushed at: 4 months ago - Stars: 3 - Forks: 0

sohaib023/Truth-Py

Python module for handling XML files labelled using T-Truth tool.

Language: Python - Size: 19.5 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 3 - Forks: 2

qurator-spk/sbb_column_classifier

Get the number of columns for a document image

Language: Python - Size: 50.8 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 3 - Forks: 0

Schlafenhase/Document-Analyzer

CE-5505. Company document analysis w/ natural language processing for sensitive data detection. #Isaac

Language: C# - Size: 23 MB - Last synced at: about 2 years ago - Pushed at: about 4 years ago - Stars: 3 - Forks: 2

TTWJOE/dr-x-nlp-pipeline

A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization — all without cloud dependencies.

Language: Python - Size: 9.92 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2 - Forks: 0

bx0-0/CyberVisionAI

Cyber Vision AI is an award-winning, open-source AI assistant for cybersecurity, document analysis, and knowledge management. Built with advanced RAG, MindMap, and multi-agent AI, it empowers security professionals and researchers with unrestricted, ethical, and insightful tools.

Language: Python - Size: 11.9 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 2 - Forks: 0

ismailbokri/Combot

This project, Contract Compliance, was developed as part of a project for the Contract Lifecycle Management module in the Management Information Systems program at ESPRIT University. It focuses on automating the monitoring and enforcement of contractual obligations to ensure regulatory compliance and minimize operational risks.

Language: Python - Size: 6.17 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

ksm26/dr-x-nlp-pipeline

A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization — all without cloud dependencies.

Language: Python - Size: 9.92 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

rithulkamesh/docproc

Opinionated and Sophisticated Document Region Analyzer.

Language: Python - Size: 219 KB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

miku/grobidclient

A Go (golang) client for GROBID.

Language: Go - Size: 7.52 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

arsath-eng/RAG1-NVIDIA-GENAI

A powerful Retrieval Augmented Generation (RAG) application built with NVIDIA AI endpoints and Streamlit. This solution enables intelligent document analysis and question-answering using state-of-the-art language models, featuring multi-PDF processing, FAISS vector store integration, and advanced prompt engineering.

Language: Python - Size: 153 MB - Last synced at: 27 days ago - Pushed at: 8 months ago - Stars: 2 - Forks: 1

AlinaBaber/Document-Analysis-Identification-with-RAG-Vector-Database-and-Mistral-LLM

This Document Analysis pipeline is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning & AI models like RAG (Retrieval Augmented Generation) & Mistral LLM to efficiently extract, match, enrich, process document

Language: Python - Size: 14.8 MB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 2 - Forks: 0

JuanCarlosMartinezSevilla/MuRET-UserTool-deprecated

The objective of this repository is to provide MuRET's users a simple way to train deep learning models allowing an efficient transcription process.

Language: Python - Size: 733 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

moured/Document-Graphics-Digitization

official repo for the ICDAR 2023 paper "Line Graphics Digitization: A Step Towards Full Automation"

Size: 7.26 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

aquilu/muisca

Muisca: Modelo Unificado de Inteligencia Supervisada para la Computación y Aplicación. Una herramienta Streamlit para resumir y hacer preguntas sobre documentos en PDF y TXT utilizando modelos de lenguaje de última generación.

Language: Python - Size: 24.4 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

dev-luckymhz/AIVisionText-invoice-OCR-typescript

AIVisionText is an advanced document analysis platform that harnesses the power of artificial intelligence (AI) to revolutionize the way you manage and extract insights from documents.

Language: TypeScript - Size: 104 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 2

chulwoopack/Zone2OCR

Mapping a set of zones generated by a segmentation algorithm to the regions generated by OCR engine

Language: Python - Size: 38.4 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

MILE-IISc/DegradedWordsKannada

Benchmarking dataset of degraded word images (with character splits) in Kannada along with their associated ground truth Unicode text

Language: Shell - Size: 7.48 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 2 - Forks: 0

MILE-IISc/MergedSymbolsKannada

Benchmarking dataset of merged symbols in Kannada along with their associated ground truth Unicode text

Language: Shell - Size: 3.64 MB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 2 - Forks: 0

fredrikwahlberg/das2018

Code for the paper "Gaussian Process Classification as Metric Learning for Forensic Writer Identification", published at DAS 2018

Language: Python - Size: 20.5 KB - Last synced at: about 2 years ago - Pushed at: about 7 years ago - Stars: 2 - Forks: 0

baharsateli/Dissertation_Supplementary_Materials

Datasets, tools and results from my doctoral dissertation

Language: Shell - Size: 305 KB - Last synced at: over 2 years ago - Pushed at: over 7 years ago - Stars: 2 - Forks: 0

Rayyan9477/ocr-app

State-of-the-art Optical Character Recognition (OCR) with Vision Language Model (VLM) integration for enhanced accuracy and optimal document processing.

Language: TypeScript - Size: 23.9 MB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

SIYAKS-ARES/smart-doc-insight

AI destekli PDF sorgulama aracı. Ollama, LM Studio ve API'lar kullanarak PDF belgelerinizden doğal dille anında bilgi edinin.

Language: HTML - Size: 33 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 1 - Forks: 0

rk-vashista/pitch

A modern web application that analyzes pitch decks using multi-agent AI technology. Upload your pitch deck and get comprehensive feedback on structure, content, and potential improvements!

Language: Python - Size: 6.06 MB - Last synced at: 26 days ago - Pushed at: 27 days ago - Stars: 1 - Forks: 0

GautamBytes/IITM_HACKATHON

An AI-powered contract management tool using NLP and LLMs, achieving 95% accuracy in document analysis. The project significantly enhanced decision-making in contract management and showcased innovative use of AI technologies. Demo Video👇

Language: Python - Size: 2.59 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

mhdsedighi/DOC-Analyzer

Analyzing Many Documents with AI

Language: Python - Size: 324 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

acsenrafilho/cucaracha

A bureaucratic cockroach (cucaracha) assistent to help in document processing and analysis

Language: Python - Size: 6.44 MB - Last synced at: 7 days ago - Pushed at: 5 months ago - Stars: 1 - Forks: 1

BjornMelin/docmind-ai

DocMind AI is a powerful, open-source Streamlit application leveraging LangChain and local Large Language Models (LLMs) via Ollama for advanced document analysis. Analyze, summarize, and extract insights from a wide array of file formats—securely and privately, all offline.

Size: 32.2 KB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

DioCrafts/ai-book-summarizer

📚 AI-Powered Book PDF Knowledge Extractor & Summarizer Transform your PDF books into structured knowledge effortlessly! This tool leverages AI to analyze books page by page, extracting key insights, definitions, and concepts, and organizes them into Markdown summaries for easier study

Language: Python - Size: 29.6 MB - Last synced at: 4 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

Topic: "document-analysis"

george-gca/asreview-top2vec Fork of asreview/semantic-clusters