An open API service providing repository metadata for many open source software ecosystems.

Topic: "document-parser"

infiniflow/ragflow

RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs

Language: Python - Size: 94 MB - Last synced at: 5 days ago - Pushed at: 7 days ago - Stars: 70,173 - Forks: 7,615

docling-project/docling

Get your documents ready for gen AI

Language: Python - Size: 160 MB - Last synced at: 5 days ago - Pushed at: 7 days ago - Stars: 47,411 - Forks: 3,331

Unstructured-IO/unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Language: HTML - Size: 194 MB - Last synced at: 5 days ago - Pushed at: 8 days ago - Stars: 13,460 - Forks: 1,111

freeok/so-novel

小说下载|网文下载 | 网络小说

Language: Java - Size: 3.64 MB - Last synced at: 6 days ago - Pushed at: 9 days ago - Stars: 5,780 - Forks: 458

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Language: Python - Size: 41.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4,395 - Forks: 352

run-llama/llama_cloud_services

Knowledge Agents and Management in the Cloud

Language: TypeScript - Size: 84.4 MB - Last synced at: 3 days ago - Pushed at: 14 days ago - Stars: 4,221 - Forks: 468

Filimoa/open-parse

Improved file parsing for LLM’s

Language: Python - Size: 7.23 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 3,116 - Forks: 137

deepdoctection/deepdoctection

A Repo For Document AI

Language: Python - Size: 29.6 MB - Last synced at: 11 days ago - Pushed at: 13 days ago - Stars: 3,105 - Forks: 184

NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Language: Python - Size: 351 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 975 - Forks: 88

opendataloader-project/opendataloader-pdf

PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU

Language: Java - Size: 52.2 MB - Last synced at: about 18 hours ago - Pushed at: 2 days ago - Stars: 812 - Forks: 42

liweiphys/layra

LAYRA—an enterprise-ready, out-of-the-box solution—unlocks next-generation intelligent systems powered by visual RAG and limitless visual multi-step agent workflow orchestration.

Language: TypeScript - Size: 30.8 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 811 - Forks: 80

iamarunbrahma/vision-parse

Parse PDFs into markdown using Vision LLMs

Language: Python - Size: 360 KB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 452 - Forks: 64

GiftMungmeeprued/document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

Size: 4.25 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 94 - Forks: 1

marieai/marie-ai

Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipelines (GenAI, LLM, VLLM) into your applications, supporting various tasks such as document cleanup, optical character recognition (OCR), classification, splitting, named entity recognition, and form processing

Language: Python - Size: 40.6 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 79 - Forks: 9

LianjiaTech/bella-domify

文档解析(Document Parser),支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式,高效提取与解析内容,生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser,助力 RAG、知识库、全文检索等智能应用。

Language: Python - Size: 32.1 MB - Last synced at: 25 days ago - Pushed at: 30 days ago - Stars: 61 - Forks: 9

JPLeoRX/opencv-text-deskew

Tutorial on how to deskew (straighten) text images

Language: Python - Size: 43 MB - Last synced at: 8 months ago - Pushed at: almost 4 years ago - Stars: 51 - Forks: 16

papercast-dev/papercast

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Language: Python - Size: 218 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 48 - Forks: 1

graphlit/graphlit

Graphlit Platform

Size: 2.93 KB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 23 - Forks: 1

decisionfacts/semantic-ai

An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).

Language: Python - Size: 4.53 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 22 - Forks: 1

urbanclap-engg/smart-docs-parser

An OCR based document parser to extract information from identity document images

Language: TypeScript - Size: 64.5 KB - Last synced at: about 1 month ago - Pushed at: over 3 years ago - Stars: 21 - Forks: 7

graphlit/graphlit-client-python

Python client library for Graphlit Platform

Language: Python - Size: 2.86 MB - Last synced at: 6 days ago - Pushed at: 9 days ago - Stars: 16 - Forks: 2

novaladai/novalad

Novalad offers a unified, centralized platform enabling organizations to extract meaningful data and perform advanced processing at high speed.

Language: Jupyter Notebook - Size: 4.61 MB - Last synced at: 29 days ago - Pushed at: 4 months ago - Stars: 16 - Forks: 0

docling-project/docling4j

Docling4j brings the functionalities of Docling in document understanding to Java® projects

Language: Java - Size: 32.2 KB - Last synced at: 4 months ago - Pushed at: 9 months ago - Stars: 16 - Forks: 0

decisionfacts/df-extract

DF Extract Lib

Language: Python - Size: 29.3 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 14 - Forks: 0

Clearedge-AI/clearedge

Build a RAG preprocessing pipeline

Language: Jupyter Notebook - Size: 24.7 MB - Last synced at: 19 days ago - Pushed at: over 1 year ago - Stars: 13 - Forks: 0

has-abi/docparser

Extract text from your DOCX documents.

Language: Python - Size: 92.8 KB - Last synced at: 10 days ago - Pushed at: almost 2 years ago - Stars: 11 - Forks: 2

InvoiceableAI/Invoiceable

The invoice, document, and résumé parser powered by AI.

Language: Python - Size: 43.9 KB - Last synced at: almost 2 years ago - Pushed at: about 2 years ago - Stars: 11 - Forks: 0

privateai-com/docviz

Advanced document contents extraction with multiple output formats

Language: Python - Size: 122 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 6 - Forks: 0

Gyanvir/DrParser

Dr.Parser 🩸📊 – AI-powered blood report parser that extracts and analyzes medical data from images/PDFs. Built with React, FastAPI, EasyOCR, and Gemini AI. 🚀 🔹 Local Setup Available | 🔹 Future Enhancements Planned | 🔹 Hackathon Project 👉 Clone, run, and explore the future of AI-driven healthcare!

Language: Python - Size: 64.4 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 4 - Forks: 0

hrbrmstr/docparser

🧰 Tools to Upload/Parse Documents to 'docparser' and Retrieve Extracted Results

Language: R - Size: 10.7 KB - Last synced at: about 2 months ago - Pushed at: about 7 years ago - Stars: 4 - Forks: 0

Qubit02/BrainTrace

BrainTrace: GraphRAG-based on-device knowledge management system for secure and efficient document analysis

Language: Python - Size: 325 MB - Last synced at: 27 days ago - Pushed at: 29 days ago - Stars: 3 - Forks: 4

brazilian-code/Resume_Parsing

Resume Parsing app to extract information using AI

Size: 25.8 MB - Last synced at: almost 3 years ago - Pushed at: almost 4 years ago - Stars: 3 - Forks: 3

coderosh/docpa

A simple library that I use for web scraping. Uses htmlparser2 to parse dom.

Language: TypeScript - Size: 69.3 KB - Last synced at: 3 months ago - Pushed at: almost 4 years ago - Stars: 3 - Forks: 0

graphlit/graphlit-client-typescript

TypeScript client for Graphlit Platform

Language: TypeScript - Size: 4.41 MB - Last synced at: 7 days ago - Pushed at: 9 days ago - Stars: 2 - Forks: 2

CyrilDesch/SRAG

An Open-source Scala-based Hybrid RAG offering deep document understanding and audio processing. Built with a flexible architecture that lets you easily plug in different models or storage systems, stateless and scalable by design.

Language: Scala - Size: 399 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 1

Bharathyalagi/OCR-Document-parser

Smart OCR application built with Tesseract and Streamlit that extracts structured data from Inputs

Language: Python - Size: 29.3 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

pandaxbacon/AutoChunker

🪓 Lumberjack - AI-powered document parser with interactive tree editor. Transform PDFs, DOCX, PPTX into perfectly structured chunks for vector databases. 5 parsers, Firebase integration, live demo available.

Language: TypeScript - Size: 8.71 MB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

Juliofal4822/deepseek-ocr-multigpu-infer

🚀 Run efficient DeepSeek-OCR inference with Python scripts, supporting both single and multi-GPU setups for versatile performance on various hardware.

Language: Python - Size: 1.41 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

generalMG/DocumentParser

Python data pipeline for arXiv metadata: SQLAlchemy + Alembic schema, PostgreSQL storage, PDF download tracking, and optional PaddleOCR processing.

Language: Python - Size: 199 KB - Last synced at: 5 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 1

Pulkit12dhingra/automated-document-parser

A powerful and automated document parser built with LangChain for intelligent document processing. Automatically detects file types and uses appropriate loaders for PDF, DOCX, CSV, JSON, HTML, and more.

Language: Python - Size: 105 KB - Last synced at: 10 days ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

connectaman/deepseek-ocr-multigpu-infer

Efficient multi-GPU OCR inference framework leveraging parallel processes for accelerated token throughput and faster batch processing. Designed for scalable, high-performance optical character recognition workloads using PyTorch. Supports dynamic GPU assignment, optimized resource utilization, and easy integration for large-scale image datasets.

Language: Python - Size: 123 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

Besthope-Official/predoc

Preprocess document service for RAG (Retriveal Augumented Generation)

Language: Python - Size: 174 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 1

shijincai/fast360

The industry's first "Open Source OCR Arena," a free, no-login utility for one-click benchmarking of 7 top-tier models (Marker, MinerU, MonkeyOCR, Docling, Dolphin, OCRFlux, PP-StructureV3) on your PDF/image files, specializing in PDF-to-Markdown conversion.

Size: 3.78 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

anastasiashpilka/blood-test-lab-report-ocr-pdf-image-to-excel-csv

Easily convert medical reports (PDF, DOCX, images) to structured tables. Powered by Google Gemini API.

Language: JavaScript - Size: 21.5 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

MegrezAI/LeapRAG

LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.

Language: Python - Size: 8.86 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

anyparser/anyparser_crewai

Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.

Language: Python - Size: 429 KB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

RevanKumarD/LlaMarker

Your ultimate tool for effortlessly converting and parsing documents into clean, well-structured Markdown—fast, reliable, and 100% local! 💻✨

Language: Python - Size: 6.26 MB - Last synced at: 3 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

cr4yfish/docling-js

Parsing Documents to one datatype (Typescript port of Docling) (NOT STARTED!)

Size: 23.4 KB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

agent87/IhuguraChatBotUX

Ihugure Chatbot Streamlit User Interface

Language: Python - Size: 4.05 MB - Last synced at: almost 3 years ago - Pushed at: almost 3 years ago - Stars: 1 - Forks: 0

MaineDSA/voter_participation_extractor_portland

The City of Portland distributes voter participation info in PDF format. This makes it a CSV.

Language: Python - Size: 149 KB - Last synced at: 23 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

neka-nat/rfdetr-doclayout

RF-DETR for Docment Layout Analysis

Language: Python - Size: 1.46 MB - Last synced at: 20 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

lsjsj92/upstage-document-parser-playground

Upstage document parser playground(w/ Python streamlit)

Language: Python - Size: 3.4 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

adaan2006/fast360

🚀 Evaluate and compare OCR engines effortlessly on Fast360, your go-to platform for testing real-world performance on specific documents.

Size: 3.77 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Samu-53/Table-Detection

📄 Detect tables in images and extract Persian text with OCR using Python, OpenCV, and Tesseract. Simplify your data analysis and visualization.

Language: Jupyter Notebook - Size: 137 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

AkandindaJunior/Cloud-Services

If it’s not documented, it never happened. 📝 Please check my README.md for more details. 🔍

Size: 1000 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

AI-Enginner/Intelligent-Document-Processing

AI-powered data extraction tool that converts PDFs, images, and scanned documents into structured data in seconds.

Size: 5.86 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

arv-fazriansyah/ocr-gemini-ai-api

API OCR (Optical Character Recognition) berbasis Gemini AI dari Google, dengan output rapi berupa JSON terstruktur.

Language: HTML - Size: 259 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

suwa-sh/local-RAG-backend

This is the backend for a RAG system that runs on Docker Compose. It registers documents in a wide range of file formats, which can be searched using the MCP server.

Language: Python - Size: 313 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

atbasu/document-content-extractor

Python program that uses open ai apis to parse user specified content from text files

Language: Python - Size: 135 KB - Last synced at: 8 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

shrimantasatpati/Document_Parser_using_AI

Parse documents using AI - any document converted to markdown suitable for RAG applications

Language: Jupyter Notebook - Size: 12.2 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

munenepeter/Case-Law-Search

A Simple Case parser and search

Language: PHP - Size: 573 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 1

MidHunterX/Scholar-CAP

🎓 Set of powerful tools designed to streamline the extraction, parsing, and clean-up of data from docx and pdf forms. Saves time and eliminate manual data entry by automating the processing of structured data.

Language: Python - Size: 11.2 MB - Last synced at: 10 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

munenepeter/translate Fork of Abtez/translate 📦

A simple document uploader & parser

Language: Hack - Size: 202 KB - Last synced at: almost 3 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

buren/document_parser

Small Rails API app to parse documents.

Language: Ruby - Size: 30.3 KB - Last synced at: 4 months ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 1

lorenzbr/techStandards

Download and parse technical standard documents

Language: R - Size: 1.67 MB - Last synced at: almost 3 years ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 0

dills122/ShamWow

Who likes lawyers? Me either; scrub your PII with ShamWow

Language: C# - Size: 97.7 KB - Last synced at: 2 months ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 1

JayLohokare/docX

Convert documents into Quizes! Built at HackNY (Android + NodeJS + Alexa skill)

Language: Java - Size: 288 KB - Last synced at: almost 3 years ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0

JayLohokare/docX-REST-API Fork of shubh0906/hackNY

Shubham's REST APIs made at hackNY

Language: JavaScript - Size: 5.86 KB - Last synced at: almost 3 years ago - Pushed at: about 7 years ago - Stars: 0 - Forks: 0