GitHub topics: document-parsing
docling-project/docling
Get your documents ready for gen AI
Language: Python - Size: 103 MB - Last synced at: about 1 hour ago - Pushed at: about 2 hours ago - Stars: 33,549 - Forks: 2,224

kevv1m/tikara
The metadata and text content extractor for almost every file type.
Size: 1000 Bytes - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Language: HTML - Size: 192 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 11,768 - Forks: 972

edenai/edenai-apis
Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines
Language: Python - Size: 158 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 453 - Forks: 69

run-llama/llama_cloud_services
Knowledge Agents and Management in the Cloud
Language: Python - Size: 49.3 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 4,031 - Forks: 432

docling-project/docling4j
Docling4j brings the functionalities of Docling in document understanding to Java® projects
Language: Java - Size: 32.2 KB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 12 - Forks: 0

PaddlePaddle/PaddleOCR
Awesome multilingual OCR and Document Parsing toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Language: Python - Size: 1.32 GB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 50,943 - Forks: 8,372

baughmann/tikara
The metadata and text content extractor for almost every file type.
Language: Python - Size: 161 MB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

ziming/laravel-docparser
Docparser OCR Package for PHP Laravel
Language: PHP - Size: 31.3 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 3 - Forks: 0

enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Language: Python - Size: 20.4 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 1,274 - Forks: 129

MegrezAI/LeapRAG
LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.
Language: Python - Size: 8.86 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

J-sephB-lt-n/pdf-bank-statement-parser
Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data
Language: Python - Size: 65.4 KB - Last synced at: 27 days ago - Pushed at: 8 months ago - Stars: 3 - Forks: 2

harishdeivanayagam/rowfill
Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers
Language: TypeScript - Size: 1.2 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 277 - Forks: 14

azzubair01/zubairhub
ZubairHub is a Streamlit-based application that integrates various functionalities, including social graph visualization, object detection, document parsing, text extraction, generative AI interaction, and personal data transformation.
Language: Python - Size: 6.35 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 1

alexvargashn/doc23
Convert PDFs, DOCX, TXT & more into structured JSON trees using Python. Built for legal, institutional and NLP applications.
Language: Python - Size: 1.04 MB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Mouez-Yazidi/Multilingual-Invoice-Parsing-with-LLaMA-4
Combining OCR for text extraction with LLMs for accurate, efficient document structuring.
Language: Python - Size: 401 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

acenji/ats
Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.
Language: JavaScript - Size: 14.5 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 3

rithulkamesh/docproc
Opinionated and Sophisticated Document Region Analyzer.
Language: Python - Size: 219 KB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

papercast-dev/papercast
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Language: Python - Size: 218 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 48 - Forks: 1

anyparser/anyparser_crewai
Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.
Language: Python - Size: 429 KB - Last synced at: 28 days ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

qlfv/Docling-Testing
Repository for testing and demonstrating the capabilities of Docling for document conversion.
Language: HTML - Size: 18.4 MB - Last synced at: 3 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 2

cr4yfish/docling-js
Parsing Documents to one datatype (Typescript port of Docling)
Size: 23.4 KB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

imnotamr/English-to-French-app-using-STREAMLIT-
An interactive Streamlit app that translates English text and documents to French, featuring Google Translate API integration and text-to-speech functionality. Includes PDF and Word document translation.
Language: Python - Size: 72.3 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

karthik-monkey/quantgpt
AI-powered Financial Report Analysis Engine
Language: Python - Size: 11.6 MB - Last synced at: 6 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

dsidlo/pyreparse
Data Structure and Class to ease Parsing of Complex Documents.
Language: Python - Size: 150 KB - Last synced at: 27 days ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

flexidatalabs/flexidata
FlexiData is an open-source Python package designed for processing unstructured data.
Language: Python - Size: 22.7 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

Unstructured-IO/community 📦
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Size: 5.7 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 19 - Forks: 6

augustweinbren/PhraseSpeaker
PhraseSpeaker: Effortlessly dictate specific sections of text files with macOS's text-to-speech. Perfect for navigating and audibly extracting key content from large documents!
Language: Shell - Size: 3.91 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0
