GitHub topics: document-parser
GiftMungmeeprued/document-parsers-list
A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.
Size: 4.25 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 94 - Forks: 1

liweiphys/layra
LAYRA—an enterprise-ready, out-of-the-box solution—unlocks next-generation intelligent systems powered by visual RAG and limitless visual multi-step agent workflow orchestration.
Language: TypeScript - Size: 29.2 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 768 - Forks: 75

run-llama/llama_cloud_services
Knowledge Agents and Management in the Cloud
Language: Python - Size: 53.6 MB - Last synced at: 1 day ago - Pushed at: 4 days ago - Stars: 4,046 - Forks: 433

graphlit/graphlit-client-typescript
TypeScript client for Graphlit Platform
Language: TypeScript - Size: 2.37 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 1

freeok/so-novel
小说下载|网文下载 | 网络小说
Language: Java - Size: 263 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 3,503 - Forks: 302

docling-project/docling
Get your documents ready for gen AI
Language: Python - Size: 114 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 34,040 - Forks: 2,268

graphlit/graphlit-client-python
Python client library for Graphlit Platform
Language: Python - Size: 1.97 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 13 - Forks: 2

infiniflow/ragflow
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
Language: Python - Size: 75.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 59,396 - Forks: 5,914

anastasiashpilka/blood-test-lab-report-ocr-pdf-image-to-excel-csv
Easily convert medical reports (PDF, DOCX, images) to structured tables. Powered by Google Gemini API.
Language: JavaScript - Size: 20.7 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

Besthope-Official/predoc
Preprocess document service for RAG (Retriveal Augumented Generation)
Language: Python - Size: 104 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 1

AkandindaJunior/Cloud-Services
If it’s not documented, it never happened. 📝 Please check my README.md for more details. 🔍
Size: 1000 Bytes - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

deepdoctection/deepdoctection
A Repo For Document AI
Language: Python - Size: 29 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 2,872 - Forks: 162

iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
Language: Python - Size: 374 KB - Last synced at: 2 days ago - Pushed at: 5 months ago - Stars: 395 - Forks: 54

Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Language: HTML - Size: 192 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 11,844 - Forks: 979

Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Language: Python - Size: 72.7 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 4,074 - Forks: 323

MaineDSA/voter_participation_extractor_portland
The City of Portland distributes voter participation info in PDF format. This makes it a CSV.
Language: Python - Size: 107 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

Vetrivel07/AI-Powered-Resume-Evaluator
An AI-powered resume evaluation app that compares a candidate’s resume with a job description using Google’s Gemini 1.5 Flash model to provide HR-style feedback and an ATS-style match scoring through a simple and interactive Streamlit interface.
Language: Python - Size: 62.5 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 3 - Forks: 0

Filimoa/open-parse
Improved file parsing for LLM’s
Language: Python - Size: 7.23 MB - Last synced at: 12 days ago - Pushed at: 8 months ago - Stars: 3,010 - Forks: 126

docling-project/docling4j
Docling4j brings the functionalities of Docling in document understanding to Java® projects
Language: Java - Size: 32.2 KB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 12 - Forks: 0

suwa-sh/local-RAG-backend
This is the backend for a RAG system that runs on Docker Compose. It registers documents in a wide range of file formats, which can be searched using the MCP server.
Language: Python - Size: 295 KB - Last synced at: 18 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

decisionfacts/semantic-ai
An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).
Language: Python - Size: 4.53 MB - Last synced at: 2 days ago - Pushed at: 12 months ago - Stars: 21 - Forks: 1

marieai/marie-ai
Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipelines (GenAI, LLM, VLLM) into your applications, supporting various tasks such as document cleanup, optical character recognition (OCR), classification, splitting, named entity recognition, and form processing
Language: Python - Size: 36.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 70 - Forks: 8

MegrezAI/LeapRAG
LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.
Language: Python - Size: 8.86 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

novaladai/novalad
Novalad offers a unified, centralized platform enabling organizations to extract meaningful data and perform advanced processing at high speed.
Language: Jupyter Notebook - Size: 2.64 MB - Last synced at: 20 days ago - Pushed at: 2 months ago - Stars: 15 - Forks: 0

has-abi/docparser
Extract text from your DOCX documents.
Language: Python - Size: 92.8 KB - Last synced at: 26 days ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 2

Gyanvir/DrParser
Dr.Parser 🩸📊 – AI-powered blood report parser that extracts and analyzes medical data from images/PDFs. Built with React, FastAPI, EasyOCR, and Gemini AI. 🚀 🔹 Local Setup Available | 🔹 Future Enhancements Planned | 🔹 Hackathon Project 👉 Clone, run, and explore the future of AI-driven healthcare!
Language: Python - Size: 64.4 MB - Last synced at: 1 day ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

hrbrmstr/docparser
🧰 Tools to Upload/Parse Documents to 'docparser' and Retrieve Extracted Results
Language: R - Size: 10.7 KB - Last synced at: 3 months ago - Pushed at: over 6 years ago - Stars: 4 - Forks: 0

papercast-dev/papercast
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Language: Python - Size: 218 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 48 - Forks: 1

anyparser/anyparser_crewai
Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.
Language: Python - Size: 429 KB - Last synced at: 6 days ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

RevanKumarD/LlaMarker
Your ultimate tool for effortlessly converting and parsing documents into clean, well-structured Markdown—fast, reliable, and 100% local! 💻✨
Language: Python - Size: 6.26 MB - Last synced at: 6 days ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

cr4yfish/docling-js
Parsing Documents to one datatype (Typescript port of Docling)
Size: 23.4 KB - Last synced at: 4 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

Clearedge-AI/clearedge
Build a RAG preprocessing pipeline
Language: Jupyter Notebook - Size: 24.7 MB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 0

shrimantasatpati/Document_Parser_using_AI
Parse documents using AI - any document converted to markdown suitable for RAG applications
Language: Jupyter Notebook - Size: 12.2 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

munenepeter/Case-Law-Search
A Simple Case parser and search
Language: PHP - Size: 573 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 1

JPLeoRX/opencv-text-deskew
Tutorial on how to deskew (straighten) text images
Language: Python - Size: 43 MB - Last synced at: 2 months ago - Pushed at: over 3 years ago - Stars: 51 - Forks: 16

urbanclap-engg/smart-docs-parser
An OCR based document parser to extract information from identity document images
Language: TypeScript - Size: 64.5 KB - Last synced at: 21 days ago - Pushed at: almost 3 years ago - Stars: 21 - Forks: 7

MidHunterX/Scholar-CAP
🎓 Set of powerful tools designed to streamline the extraction, parsing, and clean-up of data from docx and pdf forms. Saves time and eliminate manual data entry by automating the processing of structured data.
Language: Python - Size: 11.2 MB - Last synced at: 4 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

decisionfacts/df-extract
DF Extract Lib
Language: Python - Size: 29.3 KB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 14 - Forks: 0

InvoiceableAI/Invoiceable
The invoice, document, and résumé parser powered by AI.
Language: Python - Size: 43.9 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 0

atbasu/document-content-extractor
Python program that uses open ai apis to parse user specified content from text files
Language: Python - Size: 135 KB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

coderosh/docpa
A simple library that I use for web scraping. Uses htmlparser2 to parse dom.
Language: TypeScript - Size: 69.3 KB - Last synced at: 1 day ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

munenepeter/translate Fork of Abtez/translate 📦
A simple document uploader & parser
Language: Hack - Size: 202 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

brazilian-code/Resume_Parsing
Resume Parsing app to extract information using AI
Size: 25.8 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 3

agent87/IhuguraChatBotUX
Ihugure Chatbot Streamlit User Interface
Language: Python - Size: 4.05 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

lorenzbr/techStandards
Download and parse technical standard documents
Language: R - Size: 1.67 MB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

dills122/ShamWow
Who likes lawyers? Me either; scrub your PII with ShamWow
Language: C# - Size: 97.7 KB - Last synced at: 5 days ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 1

JayLohokare/docX-REST-API Fork of shubh0906/hackNY
Shubham's REST APIs made at hackNY
Language: JavaScript - Size: 5.86 KB - Last synced at: over 2 years ago - Pushed at: almost 7 years ago - Stars: 0 - Forks: 0

JayLohokare/docX
Convert documents into Quizes! Built at HackNY (Android + NodeJS + Alexa skill)
Language: Java - Size: 288 KB - Last synced at: over 2 years ago - Pushed at: almost 7 years ago - Stars: 0 - Forks: 0

buren/document_parser
Small Rails API app to parse documents.
Language: Ruby - Size: 30.3 KB - Last synced at: 4 months ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 1
