GitHub topics: document-processing

syw2014/langparse

LangParse is a universal document parsing and text chunking engine for LLM or Agent applications — Documents In, Knowledge Out.

Language: Python - Size: 76.2 KB - Last synced at: about 10 hours ago - Pushed at: about 11 hours ago - Stars: 3 - Forks: 0

ucbepic/docetl

A system for agentic LLM-powered data processing and ETL

Language: Python - Size: 62.2 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 3,102 - Forks: 323

belumume/claude-skills

Personal collection of Claude skills - growing as I discover patterns and solve real-world problems

Language: Python - Size: 93.8 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

SylphxAI/pdf-reader-mcp

📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage

Language: TypeScript - Size: 851 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 331 - Forks: 42

faisal-990/Latex_customiser_pdf_generator

A PyQt5 GUI tool for extracting and reassembling LaTeX documents. Load a .tex file, select the sections/equations/tables you want, and generate a new PDF with just those components. Perfect for creating custom documents from existing LaTeX files.

Language: Python - Size: 67.1 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

SenseiOguz/Dual-AI-Chat

🤖 Leverage dual AI models to generate precise and thoughtful responses using a flexible backend, enhancing interaction quality and reliability.

Language: TypeScript - Size: 199 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

jwill9999/Vector-DB-Service

A microservice that allows upload of documents from google services, and then embed them into a vector database.

Language: TypeScript - Size: 1.08 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

NoliNobdon/TriStage-RAG

🎯 Optimize retrieval with TriStage-RAG, a 3-stage pipeline that enhances document discovery while overcoming the limits of single-vector embeddings.

Language: Python - Size: 1.45 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

anisderoual/Document_Archiver_Korean-NLP_BERTClustering

📂 Extract, embed, cluster, and securely store Korean text from documents using BERT, enhancing research efficiency and organization.

Language: Python - Size: 11.7 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

apathi/PM_book_RAG

Retrieval-Augmented Generation (RAG) system for PMs. Ask Product Management questions and receive answers rooted in cutting-edge, battle-tested frameworks and industry-proven insights from the world's most authoritative and current PM resources.

Language: Python - Size: 537 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

lareira-digital/hecate-escriba

Escriba is a powerful document generation API with programmatic templates

Language: Python - Size: 207 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 1 - Forks: 0

quarkiverse/quarkus-docling

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem

Language: Java - Size: 134 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11 - Forks: 4

H0NEYP0T-466/Pen2PDF

⚡ Pen2PDF Suite – an all-in-one 🚀 productivity platform ✨ with 🤖 AI-powered text extraction (PDF/Images → Markdown 📝), 📅 smart timetable management (CSV/Excel import 📊), ✅ todo lists with subtasks📈, 🧠 AI-generated notes library 📚 and 💬 Isabella AI assistant (OpenAI/Microsoft/llama/Mistral/LongCat/Gemini models 🔄)for context-aware help 🧩.

Language: JavaScript - Size: 1.74 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 6 - Forks: 1

syncfusion/document-sdk-asp-net-mvc-demos

Explore the Syncfusion ASP.NET MVC demos featuring our advanced PDF, Word, Excel, and PowerPoint document processing libraries

Language: C# - Size: 102 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

IBM/docling-graph

Transform unstructured documents into validated, rich and queryable knowledge graphs.

Language: Python - Size: 31.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

ResetNetwork/n8n-nodes

A collection of custom n8n nodes for enhanced document processing, text splitting, and embeddings generation

Language: TypeScript - Size: 766 KB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 4

tommcd/doctk

A composable toolkit for structured document manipulation

Language: Python - Size: 221 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

Goblanch/Expediente-Index

Pequeña aplicación de escritorio (Tkinter + ttkbootstrap) para generar un Índice de Documentos a partir de todos los PDF de una carpeta, exportando a Word (.docx) y/o PDF (.pdf).

Language: Python - Size: 43.9 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

Poolchaos/artemis-insight

AI-powered document intelligence platform for automated PDF summarization with customizable templates

Language: Python - Size: 12.6 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

Delshi/image_walker

Smart file organization with metadata intelligence. Automatically sort and categorize your files using advanced filtering and plugin system.

Language: Python - Size: 5.29 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

arsalanafzal010/SmartRAG

📄 Enable smart conversations with documents, images, and audio files using this advanced Retrieval-Augmented Generation system.

Language: Python - Size: 1.39 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

zoepranataksm/Mind_Vault_AI

📚 Automate knowledge transfer with Mind Vault, an AI-driven system that converts unstructured data into searchable insights, enhancing onboarding and team efficiency.

Language: JavaScript - Size: 1.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

Valido-App/valido-app.github.io

Official website and download page for Valido - Professional PDF validation and data extraction tool for Windows

Language: HTML - Size: 367 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

MantisAI/sieves

Plug-and-play, zero-shot document processing pipelines.

Language: Python - Size: 2.86 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 113 - Forks: 8

Ai4GenXers/pdf-sentinel

Event-driven PDF to Markdown conversion for LLM workflows - 60x faster, zero idle resources

Language: Python - Size: 34.2 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

B-A-M-N/FlockParser

Distributed document RAG system with intelligent GPU/CPU orchestration. Auto-discovers heterogeneous nodes, routes workloads adaptively, and achieves 60x+ speedups through VRAM-aware load balancing. Privacy-first architecture with 4 interfaces (CLI, API, MCP, Web UI). Real distributed systems engineering, not just an API wrapper.

Language: Python - Size: 95.3 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 4 - Forks: 2

eclaire-labs/eclaire

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

Language: TypeScript - Size: 2.89 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 501 - Forks: 51

katrina-09/pdf-scraper

PDF scraper to extract text

Language: Python - Size: 7.81 KB - Last synced at: 12 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

deadhand777/doc-redaction

Document Redaction Automation Service

Language: Python - Size: 2.08 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 0

Rayyan9477/ocr-app

State-of-the-art Optical Character Recognition (OCR) with Vision Language Model (VLM) integration for enhanced accuracy and optimal document processing.

Language: TypeScript - Size: 23 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 3 - Forks: 0

martin-papy/qdrant-loader

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

Language: Python - Size: 26.8 MB - Last synced at: 10 days ago - Pushed at: 2 months ago - Stars: 15 - Forks: 9

KikuAI-Lab/DocStripper

🧹 DocStripper is a lightweight CLI utility that automatically cleans text documents

Language: Python - Size: 1.27 MB - Last synced at: 10 days ago - Pushed at: 17 days ago - Stars: 1 - Forks: 1

asbah-ramzan/HackRx-6.0-Intelligent-Query-Retrieval

🧠 Elevate document intelligence with HackRx 6.0, a powerful RAG system for extracting insights from complex files like PDFs and DOCX.

Language: Python - Size: 1.3 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 1 - Forks: 0

abhaydixit07/ayurguru-frontend

AyurGuru - Revolutionizing Wellness with Ayurveda and AI. AyurGuru is an AI-powered platform delivering Ayurvedic health solutions in real time. Users can consult a smart chatbot, upload medical reports for tailored insights, and explore comprehensive Ayurvedic blogs. Built with modern web technologies for a secure and seamless user experience.

Language: JavaScript - Size: 13.7 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 2

byerlikaya/SmartRAG

⚡ Production-ready .NET Standard 2.1 RAG library with 🤖 multi-AI provider support, 🏢 enterprise vector storage, 📄 intelligent document processing, and 🗄️ multi-database query coordination. 🌍 Cross-platform compatible.

Language: C# - Size: 11.7 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 6 - Forks: 2

watat83/document-chat-system

Open-source document chat platform with semantic search, RAG (Retrieval Augmented Generation), and multi-provider AI support (OpenRouter, OpenAI, ImageRouter).

Language: TypeScript - Size: 71.3 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 24 - Forks: 10

formkiq/formkiq-core

Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.

Language: Java - Size: 24.4 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 144 - Forks: 24

Keremunce/nodejs-pdf-extractor

Node.js + Express app that extracts plain text from uploaded PDFs, with a browser UI for manual tests and pdf-parse driving the extraction pipeline.

Language: HTML - Size: 13.7 KB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

jmanhype/DSPy-Multi-Document-Agents

An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

Language: Python - Size: 143 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 49 - Forks: 5

nezchan0/SecureCompress

Privacy-first image compressor. Resize, convert & compress images offline with DPI control and cm↔px conversion. No uploads, no tracking. 🔒

Language: HTML - Size: 21.5 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

aget-framework/template-document-processor-AGET

Production-ready template for creating document processing agents with LLM pipelines, security protocols, and multi-provider support

Language: Python - Size: 296 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

awslabs/rhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

Language: Python - Size: 32 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 96 - Forks: 14

jvahedi/doc-sqz

Able to convert pdfs and docx to text to be used in pipeline.

Language: Python - Size: 52.7 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

Cerno-AI/Cerno-Insight

High-performance RAG system for intelligent document Q&A with hybrid retrieval, GPU acceleration, and citation-backed answers. Upload docs, ask questions, get precise responses.

Language: Python - Size: 32.2 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 1 - Forks: 1

OlegCheban/WaterMarkIt

A lightweight, framework-agnostic Java library for adding watermarks to various file types, including PDFs and videos

Language: Java - Size: 3.22 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 21 - Forks: 21

B-A-M-N/FlockParser-legacy

Legacy version of FlockParser PDF processing system

Language: Python - Size: 3.02 MB - Last synced at: 11 days ago - Pushed at: 29 days ago - Stars: 1 - Forks: 0

Unsiloed-AI/Unsiloed-Parser

Language: Python - Size: 114 KB - Last synced at: 29 days ago - Pushed at: 30 days ago - Stars: 148 - Forks: 40

RafiBG/AIChatDiscordBotWeb

Local AI chat bot for Discord with web interface for start and configuration

Language: C# - Size: 1.07 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

xsukax/xsukax-CamScanner-PDF-Watermark-Remover

A robust, privacy-focused command-line utility that intelligently removes CamScanner watermarks from PDF documents and exports clean results to multiple formats including PDF, PNG, and multi-page TIFF.

Language: Python - Size: 26.4 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

RanitDERIA/tata-rfp

AI-powered RFP response platform that automates proposal generation using LlamaIndex and GPT-4. Extract questions from documents and generate contextual responses 80% faster with intelligent document indexing and team collaboration.

Language: TypeScript - Size: 2.92 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

syncfusion/document-sdk-uwp-demos

Explore the Syncfusion Universal Windows Platform demos featuring our advanced PDF, Word, Excel, and PowerPoint document processing libraries.

Language: C# - Size: 36.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

syncfusion/document-sdk-wpf-demos

this repository contains WPF demos for creating, reading, editing, and converting Excel, Word, PDF, and Presentation documents programmatically using Syncfusion .NET Document Processing libraries.

Language: C# - Size: 97.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 3

syncfusion/document-sdk-winforms-demos

this repository contains Windows Forms demos for creating, reading, editing, and converting Excel, Word, PDF, and Presentation documents programmatically using Syncfusion .NET Document Processing libraries.

Language: C# - Size: 55.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

syncfusion/document-sdk-blazor-demos

Explore the Syncfusion Blazor demos featuring our advanced PDF, Word, Excel, and PowerPoint document processing libraries.

Language: CSS - Size: 46.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

eklem/stopword-trainer

A module for creating stopword lists for any language, based on a set of documents.

Language: JavaScript - Size: 5.01 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 15 - Forks: 0

awslabs/project-lakechain

:zap: Cloud-native, AI-powered, document processing pipelines on AWS.

Language: TypeScript - Size: 177 MB - Last synced at: 4 days ago - Pushed at: 8 months ago - Stars: 185 - Forks: 26

hammad-haque/felice-legal-ai

Felice Legal AI - AI-powered personal injury document processing platform (eve.legal replacement)

Size: 0 Bytes - Last synced at: 16 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

ma3u/neo4j-agentframework

🚀 Hybrid RAG: Local Neo4j + BitNet.cpp RAG System and Azure SaaS deployment. Fast vector search, instant Docker deployment via GitHub Container Registry. Complete RAG pipeline with ultra-efficient LLMs for enterprise knowledge management.

Language: Python - Size: 35.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Asamaurdhava/Claria

Claria instantly transforms any complex document—legal contracts, medical reports, technical specs—into crystal-clear language anyone can understand, powered by Chrome's revolutionary built-in AI that runs entirely on-device for complete privacy.

Language: JavaScript - Size: 266 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Alijanloo/Pdf2Table

A Python library for extracting tables from PDF documents using computer vision and image processing techniques. It converts PDF pages to images, detects tables, recognizes their structure, and outputs clean data in JSON format.

Language: Python - Size: 2.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

mehmetaltinbas/ExtralyzUI

An AI-powered study platform where users upload documents (PDF, DOCX, TXT, ...) to get more understandable abstractive summaries of chosen length and auto-generated practice exercises (open-ended, multiple-choice, true-false, ...).

Language: TypeScript - Size: 637 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

manasbhansali27/chat-with-files

A lightweight local AI assistant that lets you chat with your files — PDFs, documents, images, videos, and code — using semantic search, embeddings, OCR, and multimodal LLMs. Optimized to run on modest GPUs (e.g., RTX 3050 4GB) without requiring heavy VRAM like ChatRTX.

Language: Python - Size: 31.3 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

parsee-ai/parsee-core

Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

Language: Python - Size: 1.23 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 75 - Forks: 1

indu-explores-data/Automated-Resume-Data-Extraction

Automated resume information extraction using NLP. The project extracts Name, Email, and Phone from TXT, DOCX, and PDF files using spaCy and regex. It converts unstructured data into structured formats, improving recruitment efficiency and enabling scalable candidate profiling.

Language: Jupyter Notebook - Size: 71.3 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

MyGovHub-Goodbye-World/backend-agent-mcp

AI-Powered Government Services Assistant - Serverless AWS Lambda function built for MyGovHub that intelligently handles Malaysian driving license renewals and TNB electricity bill payments through document OCR, AI chat responses, and secure payment processing.

Language: Python - Size: 484 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

xsukax/xsukax-Word-Document-Comparison-Tool

A powerful, privacy-focused web application for side-by-side comparison of Word documents with intelligent diff highlighting, comprehensive analytics, and multilingual support including Arabic and RTL languages.

Language: Python - Size: 27.3 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

KaramelBytes/docloom-cli

AI‑augmented document analysis and lightweight retrieval (Go) with OpenRouter and Ollama. Cross‑platform binaries, cost guardrails, and streaming.

Language: Go - Size: 128 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

akoutsop1909/pdf-to-txt-converter

A simple Java CLI tool for batch-converting PDF files to TXT format. Supports file filtering by filename wildcards and last modified date.

Language: Java - Size: 12.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

airiseworks/doc2md-api

📄 Convert DOCX, PDF, PPTX, and images to Markdown effortlessly with this secure API built in Python, featuring API key protection and Docker support.

Language: Python - Size: 1.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

nvisycom/run

Multimodal extraction runtime for the platform. Processes images, PDFs, and scanned documents to enable automated detection and removal of sensitive information.

Size: 35.2 KB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

ucbepic/TWIX

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

Language: Python - Size: 177 MB - Last synced at: 18 days ago - Pushed at: 6 months ago - Stars: 207 - Forks: 16

PSPDFKit/nutrient-document-engine-mcp-server

A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.

Language: TypeScript - Size: 25 MB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 56 - Forks: 1

jayll1303/table2html

A Python package that converts table images into HTML format using Object Detection model and OCR.

Language: Python - Size: 381 MB - Last synced at: about 1 month ago - Pushed at: 12 months ago - Stars: 7 - Forks: 0

felixdittrich92/docling-OCR-OnnxTR

OnnxTR OCR plugin for Docling

Language: Python - Size: 1.5 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 12 - Forks: 0

MyGovHub-Goodbye-World/document-ingestion-and-text-extraction

AI-powered document analysis service combining AWS Textract, Bedrock, and intelligent blur detection. Supports CLI and serverless Lambda API for Malaysian documents (licenses, receipts, ID cards, utility bills).

Language: Python - Size: 5.35 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

MrSpecks/365-QnA-Chatbot

General Question and Answer Chatbot using langChain

Language: Python - Size: 9.77 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

danielsobrado/OxideRAG

A Rust-first RAG toolkit that blends page indexing, mindmaps, and knowledge graphs to retrieve and reason over structured data, chats, emails, and PDFs.

Language: Rust - Size: 20.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

pandaxbacon/AutoChunker

🪓 Lumberjack - AI-powered document parser with interactive tree editor. Transform PDFs, DOCX, PPTX into perfectly structured chunks for vector databases. 5 parsers, Firebase integration, live demo available.

Language: TypeScript - Size: 8.71 MB - Last synced at: 25 days ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

easytocloud/Mac-letterhead

A macOS utility for merging letterhead templates with PDF and Markdown documents using a drag-and-drop interface

Language: Python - Size: 4.83 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Qleric-labs/Contract-extraction-assistant

Turn contract PDFs into structured data in seconds. Local-first extraction

Language: TypeScript - Size: 2.41 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Magnet-AI/Quanta

Advanced PDF layout analysis engine for extracting figures, tables, and structured content from complex engineering documents using computer vision and machine learning.

Language: Python - Size: 85.9 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

Language: Python - Size: 69.3 KB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 95 - Forks: 8

kili-technology/awesome-datasets

A comprehensive list of annotated training datasets classified by use case.

Size: 24.9 MB - Last synced at: 5 days ago - Pushed at: over 3 years ago - Stars: 35 - Forks: 6

MBAigner/PDFSegmenter

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

Language: Python - Size: 399 KB - Last synced at: 28 days ago - Pushed at: about 5 years ago - Stars: 22 - Forks: 3

jmragsdale/azure-blob-ai-doc-summarizer

Serverless AI document summarization using Azure Functions, Blob Storage, and Azure OpenAI. Automatically extract and summarize PDFs, DOCX, TXT, and Markdown files.

Language: Python - Size: 23.4 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

abdullahshafiq-20/ResumeTex

ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.

Language: JavaScript - Size: 163 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 37 - Forks: 5

aws-solutions/enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

Language: JavaScript - Size: 62.9 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 40 - Forks: 19

smart-models/Sentences-Chunker

Cutting-edge tool designed to intelligently segment text documents into optimally-sized chunks

Language: Python - Size: 1.98 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 6 - Forks: 0

saksham-1304/AskMyPDF

🤖 AI-Powered PDF Chat App | Dual AI Engine (Alchemyst + Gemini) | RAG Pipeline | Vector Search | MERN + TypeScript

Language: TypeScript - Size: 615 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 7 - Forks: 0

jasoncobra3/Floorplan-Dimractor

A sophisticated Python pipeline for automatically extracting dimensions and cabinet codes from architectural floorplan PDFs. This tool converts various dimension formats into standardized measurements and provides structured output with visualization capabilities.

Language: Python - Size: 2.16 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0