An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: document-parser

infiniflow/ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

Language: TypeScript - Size: 72 MB - Last synced at: about 3 hours ago - Pushed at: about 10 hours ago - Stars: 49,724 - Forks: 4,688

Besthope-Official/predoc

Preprocess document service for RAG (Retriveal Augumented Generation)

Language: Python - Size: 18.6 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Language: Python - Size: 70 MB - Last synced at: 2 days ago - Pushed at: about 2 months ago - Stars: 3,833 - Forks: 305

Unstructured-IO/unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Language: HTML - Size: 192 MB - Last synced at: 3 days ago - Pushed at: 13 days ago - Stars: 10,915 - Forks: 907

Filimoa/open-parse

Improved file parsing for LLM’s

Language: Python - Size: 7.23 MB - Last synced at: 3 days ago - Pushed at: 5 months ago - Stars: 2,912 - Forks: 119

liweiphys/layra

LAYRA is a ready-to-use visual RAG system with a complete web UI built with Next.js and FastAPI, preserving document layout, tables, paragraphs, and graphical elements without any structural fragmentation.

Language: TypeScript - Size: 2.61 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 427 - Forks: 42

graphlit/graphlit-client-typescript

TypeScript client for Graphlit Platform

Language: TypeScript - Size: 829 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 1

graphlit/graphlit-client-python

Python client library for Graphlit Platform

Language: Python - Size: 2.09 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11 - Forks: 2

run-llama/llama_cloud_services

Knowledge Agents and Management in the Cloud

Language: Python - Size: 45.8 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 3,878 - Forks: 392

iamarunbrahma/vision-parse

Parse PDFs into markdown using Vision LLMs

Language: Python - Size: 374 KB - Last synced at: 7 days ago - Pushed at: 2 months ago - Stars: 340 - Forks: 45

AkandindaJunior/Cloud-Services

If it’s not documented, it never happened. 📝 Please check my README.md for more details. 🔍

Size: 1000 Bytes - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

MaineDSA/voter_participation_extractor_portland

The City of Portland distributes voter participation info in PDF format. This makes it a CSV.

Language: Python - Size: 88.9 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

docling-project/docling

Get your documents ready for gen AI

Language: Python - Size: 70.9 MB - Last synced at: 8 days ago - Pushed at: 10 days ago - Stars: 27,013 - Forks: 1,631

has-abi/docparser

Extract text from your DOCX documents.

Language: Python - Size: 92.8 KB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 2

deepdoctection/deepdoctection

A Repo For Document AI

Language: Python - Size: 22.4 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 2,778 - Forks: 153

Gyanvir/DrParser

Dr.Parser 🩸📊 – AI-powered blood report parser that extracts and analyzes medical data from images/PDFs. Built with React, FastAPI, EasyOCR, and Gemini AI. 🚀 🔹 Local Setup Available | 🔹 Future Enhancements Planned | 🔹 Hackathon Project 👉 Clone, run, and explore the future of AI-driven healthcare!

Language: Python - Size: 64.4 MB - Last synced at: 9 days ago - Pushed at: 22 days ago - Stars: 4 - Forks: 0

docling-project/docling4j

Docling4j brings the functionalities of Docling in document understanding to Java® projects

Language: Java - Size: 16.6 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

marieai/marie-ai

Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipelines (GenAI, LLM, VLLM) into your applications, supporting various tasks such as document cleanup, optical character recognition (OCR), classification, splitting, named entity recognition, and form processing

Language: Python - Size: 35.4 MB - Last synced at: 1 day ago - Pushed at: 21 days ago - Stars: 67 - Forks: 7

hrbrmstr/docparser

🧰 Tools to Upload/Parse Documents to 'docparser' and Retrieve Extracted Results

Language: R - Size: 10.7 KB - Last synced at: 12 days ago - Pushed at: over 6 years ago - Stars: 4 - Forks: 0

papercast-dev/papercast

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Language: Python - Size: 218 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 48 - Forks: 1

anyparser/anyparser_crewai

Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.

Language: Python - Size: 429 KB - Last synced at: 18 days ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

RevanKumarD/LlaMarker

Your ultimate tool for effortlessly converting and parsing documents into clean, well-structured Markdown—fast, reliable, and 100% local! 💻✨

Language: Python - Size: 6.26 MB - Last synced at: 14 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

decisionfacts/semantic-ai

An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).

Language: Python - Size: 4.53 MB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 20 - Forks: 1

cr4yfish/docling-js

Parsing Documents to one datatype (Typescript port of Docling)

Size: 23.4 KB - Last synced at: 24 days ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

Clearedge-AI/clearedge

Build a RAG preprocessing pipeline

Language: Jupyter Notebook - Size: 24.7 MB - Last synced at: 17 days ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 0

shrimantasatpati/Document_Parser_using_AI

Parse documents using AI - any document converted to markdown suitable for RAG applications

Language: Jupyter Notebook - Size: 12.2 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

munenepeter/Case-Law-Search

A Simple Case parser and search

Language: PHP - Size: 573 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 1

JPLeoRX/opencv-text-deskew

Tutorial on how to deskew (straighten) text images

Language: Python - Size: 43 MB - Last synced at: 21 days ago - Pushed at: about 3 years ago - Stars: 51 - Forks: 16

urbanclap-engg/smart-docs-parser

An OCR based document parser to extract information from identity document images

Language: TypeScript - Size: 64.5 KB - Last synced at: 6 days ago - Pushed at: over 2 years ago - Stars: 21 - Forks: 7

MidHunterX/Scholar-CAP

🎓 Set of powerful tools designed to streamline the extraction, parsing, and clean-up of data from docx and pdf forms. Saves time and eliminate manual data entry by automating the processing of structured data.

Language: Python - Size: 11.2 MB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

decisionfacts/df-extract

DF Extract Lib

Language: Python - Size: 29.3 KB - Last synced at: 3 days ago - Pushed at: about 1 year ago - Stars: 14 - Forks: 0

InvoiceableAI/Invoiceable

The invoice, document, and résumé parser powered by AI.

Language: Python - Size: 43.9 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 0

atbasu/document-content-extractor

Python program that uses open ai apis to parse user specified content from text files

Language: Python - Size: 123 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

coderosh/docpa

A simple library that I use for web scraping. Uses htmlparser2 to parse dom.

Language: TypeScript - Size: 69.3 KB - Last synced at: 12 months ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 0

munenepeter/translate Fork of Abtez/translate 📦

A simple document uploader & parser

Language: Hack - Size: 202 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

brazilian-code/Resume_Parsing

Resume Parsing app to extract information using AI

Size: 25.8 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 3 - Forks: 3

agent87/IhuguraChatBotUX

Ihugure Chatbot Streamlit User Interface

Language: Python - Size: 4.05 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

lorenzbr/techStandards

Download and parse technical standard documents

Language: R - Size: 1.67 MB - Last synced at: about 2 years ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

dills122/ShamWow

Who likes lawyers? Me either; scrub your PII with ShamWow

Language: C# - Size: 97.7 KB - Last synced at: 21 days ago - Pushed at: almost 6 years ago - Stars: 0 - Forks: 1

JayLohokare/docX-REST-API Fork of shubh0906/hackNY

Shubham's REST APIs made at hackNY

Language: JavaScript - Size: 5.86 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

JayLohokare/docX

Convert documents into Quizes! Built at HackNY (Android + NodeJS + Alexa skill)

Language: Java - Size: 288 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

buren/document_parser

Small Rails API app to parse documents.

Language: Ruby - Size: 30.3 KB - Last synced at: about 1 month ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 1