An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: document-parsing

Metedout-biographer66/dots.ocr-fix-demo

🖼️ Upload images to experience accurate multilingual OCR results with the enhanced dots.ocr model and fix processor loading issues seamlessly.

Language: Jupyter Notebook - Size: 1.33 MB - Last synced at: about 24 hours ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

opendataloader-project/opendataloader-pdf

Safe, Open, High-Performance — PDF for AI

Language: Java - Size: 51.8 MB - Last synced at: about 20 hours ago - Pushed at: about 21 hours ago - Stars: 758 - Forks: 36

Unstructured-IO/unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Language: HTML - Size: 194 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 13,204 - Forks: 1,082

run-llama/llama_cloud_services

Knowledge Agents and Management in the Cloud

Language: TypeScript - Size: 72.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4,201 - Forks: 459

docling-project/docling

Get your documents ready for gen AI

Language: Python - Size: 153 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 43,796 - Forks: 3,127

boyang-zhang1/DocAgent-Arena

Unbiased PDF parser comparison tool with battle mode, cost estimation, and RAG benchmarking.

Language: Python - Size: 695 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2 - Forks: 0

PaddlePaddle/PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

Language: Python - Size: 1.55 GB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 63,724 - Forks: 9,332

edenai/edenai-apis

Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines

Language: Python - Size: 173 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 461 - Forks: 68

pn98z4r66t-spec/alex-backend

AI-Powered Task Management Backend with Document Intelligence - Flask API with PDF/Word/Excel/PowerPoint parsing, httpOnly authentication, and comprehensive AI integration

Language: Python - Size: 289 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

syw2014/langparse

LangParse is a universal document parsing and text chunking engine for LLM or Agent applications — Documents In, Knowledge Out.

Size: 13.7 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 1 - Forks: 0

AdemBoukhris457/Documents-Parsing-Lab

Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

Language: Jupyter Notebook - Size: 161 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 72 - Forks: 8

NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Language: Python - Size: 351 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 975 - Forks: 88

Bharathyalagi/OCR-Document-parser

Smart OCR application built with Tesseract and Streamlit that extracts structured data from Inputs

Language: Python - Size: 29.3 KB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 2 - Forks: 0

Anmol-Baranwal/doc-parsing

Python scripts to parse and structure invoice data from PDFs using OpenAI, Anthropic and Invofox APIs

Language: Python - Size: 245 KB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

PRITHIVSAKTHIUR/dots.ocr-fix-demo

This Gradio application demonstrates the capabilities of the "dots.ocr" model, a powerful multilingual document parser.

Language: Jupyter Notebook - Size: 72.3 KB - Last synced at: 15 days ago - Pushed at: 30 days ago - Stars: 1 - Forks: 0

PRITHIVSAKTHIUR/DocScope-R1

A powerful multi-modal AI application that combines three state-of-the-art vision-language models for comprehensive image and video analysis. DocScope-R1 provides OCR capabilities, detailed scene understanding, and video content analysis through an intuitive Gradio interface.

Language: Python - Size: 3.33 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 1

OneOffTech/parxyval

Evaluation framework for document parsing

Language: Python - Size: 227 KB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Qleric-labs/Contract-extraction-assistant

Turn contract PDFs into structured data in seconds. Local-first extraction

Language: TypeScript - Size: 2.41 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

harishdeivanayagam/rowfill

Open-source spreadsheets platform for deep research and document processing

Language: TypeScript - Size: 1.32 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 362 - Forks: 20

J-sephB-lt-n/pdf-bank-statement-parser

Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data

Language: Python - Size: 65.4 KB - Last synced at: 22 days ago - Pushed at: about 1 year ago - Stars: 5 - Forks: 4

CodeByVish/Parsing-Tool

Universal document parsing pipeline — extract text from PDFs, PPTs, Excels & images (OCR), and bundle clean .txt files for AI knowledge bases.

Language: Python - Size: 354 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Vijaypal64328/ResumeXpert-Ai

AI-powered resume builder with job matching, ATS optimization, and cover letter generation using React, Node.js, TypeScript and Google AI.

Language: TypeScript - Size: 9 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

ramesh-852000/document-parser-ai

A collection of Python scripts to parse and structure data from PDFs and other documents using AI APIs like OpenAI, Anthropic, and Invofox.

Language: Python - Size: 670 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

docling-project/docling4j

Docling4j brings the functionalities of Docling in document understanding to Java® projects

Language: Java - Size: 32.2 KB - Last synced at: 2 months ago - Pushed at: 8 months ago - Stars: 16 - Forks: 0

enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Language: Python - Size: 20.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1,378 - Forks: 134

kevv1m/tikara

The metadata and text content extractor for almost every file type.

Size: 1000 Bytes - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

renswickd/document-parser-collection

This is a collection of various document parsers and hands-on to construct structured data for your RAG applications.

Language: Python - Size: 97.7 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

ziming/laravel-docparser

Docparser OCR Package for PHP Laravel

Language: PHP - Size: 37.1 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 3 - Forks: 0

Kathan-max/RAG-Enhanced-Chatbot-with-LoRA-Fine-Tuning

Transform your documents into intelligent conversations. This open-source RAG chatbot combines semantic search with fine-tuned language models (LLaMA, Qwen2.5VL-3B) to deliver accurate, context-aware responses from your own knowledge base. Join our community!

Language: Python - Size: 3.12 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

baughmann/tikara

The metadata and text content extractor for almost every file type.

Language: Python - Size: 161 MB - Last synced at: about 1 month ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

GiftMungmeeprued/document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

Size: 4.25 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 94 - Forks: 1

AhmedZeyadTareq/Llama-Parse-Content-Extraction

extract and analyze content from various file formats including PDFs, text files, and images.

Language: Python - Size: 0 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

MegrezAI/LeapRAG

LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.

Language: Python - Size: 8.86 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

azzubair01/zubairhub

ZubairHub is a Streamlit-based application that integrates various functionalities, including social graph visualization, object detection, document parsing, text extraction, generative AI interaction, and personal data transformation.

Language: Python - Size: 6.35 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 1

alexvargashn/doc23

Convert PDFs, DOCX, TXT & more into structured JSON trees using Python. Built for legal, institutional and NLP applications.

Language: Python - Size: 1.04 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

Mouez-Yazidi/Multilingual-Invoice-Parsing-with-LLaMA-4

Combining OCR for text extraction with LLMs for accurate, efficient document structuring.

Language: Python - Size: 401 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

acenji/ats

Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.

Language: JavaScript - Size: 14.5 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 3

cr4yfish/docling-js

Parsing Documents to one datatype (Typescript port of Docling) (NOT STARTED!)

Size: 23.4 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

rithulkamesh/docproc

Opinionated and Sophisticated Document Region Analyzer.

Language: Python - Size: 219 KB - Last synced at: 24 days ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

papercast-dev/papercast

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Language: Python - Size: 218 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 48 - Forks: 1

anyparser/anyparser_crewai

Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.

Language: Python - Size: 429 KB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0

qlfv/Docling-Testing

Repository for testing and demonstrating the capabilities of Docling for document conversion.

Language: HTML - Size: 18.4 MB - Last synced at: 7 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 2

imnotamr/English-to-French-app-using-STREAMLIT-

An interactive Streamlit app that translates English text and documents to French, featuring Google Translate API integration and text-to-speech functionality. Includes PDF and Word document translation.

Language: Python - Size: 72.3 KB - Last synced at: 8 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

karthik-monkey/quantgpt

AI-powered Financial Report Analysis Engine

Language: Python - Size: 11.6 MB - Last synced at: 11 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

dsidlo/pyreparse

Data Structure and Class to ease Parsing of Complex Documents.

Language: Python - Size: 150 KB - Last synced at: 23 days ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

flexidatalabs/flexidata

FlexiData is an open-source Python package designed for processing unstructured data.

Language: Python - Size: 22.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

Unstructured-IO/community 📦

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Size: 5.7 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 19 - Forks: 6

augustweinbren/PhraseSpeaker

PhraseSpeaker: Effortlessly dictate specific sections of text files with macOS's text-to-speech. Perfect for navigating and audibly extracting key content from large documents!

Language: Shell - Size: 3.91 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

Related Keywords