An open API service providing repository metadata for many open source software ecosystems.

Topic: "document-processing"

ucbepic/docetl

A system for agentic LLM-powered data processing and ETL

Language: Python - Size: 62.3 MB - Last synced at: 4 days ago - Pushed at: 7 days ago - Stars: 3,290 - Forks: 352

enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Language: Python - Size: 20.5 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1,378 - Forks: 134

eclaire-labs/eclaire

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

Language: TypeScript - Size: 3.07 MB - Last synced at: 19 days ago - Pushed at: 22 days ago - Stars: 706 - Forks: 64

SylphxAI/pdf-reader-mcp

📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage

Language: TypeScript - Size: 2.39 MB - Last synced at: 5 days ago - Pushed at: 8 days ago - Stars: 381 - Forks: 50

dhlab-epfl/dhSegment

Generic framework for historical document processing

Language: Python - Size: 5.89 MB - Last synced at: 6 months ago - Pushed at: over 4 years ago - Stars: 378 - Forks: 115

ucbepic/TWIX

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

Language: Python - Size: 174 MB - Last synced at: 27 days ago - Pushed at: 30 days ago - Stars: 209 - Forks: 17

awslabs/project-lakechain

:zap: Cloud-native, AI-powered, document processing pipelines on AWS.

Language: TypeScript - Size: 177 MB - Last synced at: 10 days ago - Pushed at: 9 months ago - Stars: 185 - Forks: 25

formkiq/formkiq-core

Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.

Language: Java - Size: 25 MB - Last synced at: about 6 hours ago - Pushed at: 2 days ago - Stars: 148 - Forks: 25

Unsiloed-AI/Unsiloed-Parser

Language: Python - Size: 114 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 148 - Forks: 40

Tele-AI/doc-ops-mcp

MCP server for seamless document format conversion and processing

Language: TypeScript - Size: 631 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 129 - Forks: 2

MantisAI/sieves

Plug-and-play document AI with zero-shot models.

Language: Python - Size: 3 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 120 - Forks: 8

awslabs/rhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

Language: Python - Size: 32.4 MB - Last synced at: 10 days ago - Pushed at: 20 days ago - Stars: 98 - Forks: 14

iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

Language: Python - Size: 69.3 KB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 95 - Forks: 8

parsee-ai/parsee-core

Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

Language: Python - Size: 1.24 MB - Last synced at: 4 days ago - Pushed at: 7 days ago - Stars: 79 - Forks: 1

steindani/pandoc-include

An include filter for Pandoc

Language: Haskell - Size: 9.77 KB - Last synced at: 18 days ago - Pushed at: about 5 years ago - Stars: 62 - Forks: 20

PSPDFKit/nutrient-document-engine-mcp-server

A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.

Language: TypeScript - Size: 25 MB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 56 - Forks: 1

jmanhype/DSPy-Multi-Document-Agents

An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

Language: Python - Size: 143 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 49 - Forks: 5

aws-solutions/enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

Language: JavaScript - Size: 62.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 40 - Forks: 19

abdullahshafiq-20/ResumeTex

ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.

Language: JavaScript - Size: 163 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 37 - Forks: 5

cburschka/lyx

Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

Language: C++ - Size: 616 MB - Last synced at: 8 months ago - Pushed at: almost 3 years ago - Stars: 36 - Forks: 7

kili-technology/awesome-datasets

A comprehensive list of annotated training datasets classified by use case.

Size: 24.9 MB - Last synced at: 11 days ago - Pushed at: over 3 years ago - Stars: 36 - Forks: 6

belumume/claude-skills

Personal collection of Claude skills - growing as I discover patterns and solve real-world problems

Language: Python - Size: 182 KB - Last synced at: 10 days ago - Pushed at: 12 days ago - Stars: 31 - Forks: 0

afrozas/proceedings

Semantic extraction from conference proceedings.

Language: Python - Size: 1.06 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 31 - Forks: 1

ucbepic/BARGAIN

Low-Cost LLM-Powered Data Processing with Theoretical Guarantees

Language: Python - Size: 18.9 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 28 - Forks: 3

watat83/document-chat-system

Open-source document chat platform with semantic search, RAG (Retrieval Augmented Generation), and multi-provider AI support (OpenRouter, OpenAI, ImageRouter).

Language: TypeScript - Size: 71.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 24 - Forks: 10

autollama/autollama

Anthropic's Contextual Retrieval implementation with visual chunk comparison. Preview context enrichment before/after embedding.

Language: HTML - Size: 21 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 24 - Forks: 0

MBAigner/PDFSegmenter

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

Language: Python - Size: 399 KB - Last synced at: 2 months ago - Pushed at: over 5 years ago - Stars: 22 - Forks: 3

OlegCheban/WaterMarkIt

A lightweight, framework-agnostic Java library for adding watermarks to various file types, including PDFs and videos

Language: Java - Size: 3.22 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 21 - Forks: 21

martin-papy/qdrant-loader

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

Language: Python - Size: 26.6 MB - Last synced at: 6 days ago - Pushed at: 11 days ago - Stars: 20 - Forks: 16

greed2411/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

Language: Clojure - Size: 19.5 KB - Last synced at: 8 months ago - Pushed at: over 5 years ago - Stars: 18 - Forks: 0

smart-models/Normalized-Semantic-Chunker

Cutting-edge tool that unlocks the full potential of semantic chunking

Language: Python - Size: 3.82 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 17 - Forks: 4

quarkiverse/quarkus-docling

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem

Language: Java - Size: 2.23 MB - Last synced at: 5 days ago - Pushed at: 8 days ago - Stars: 16 - Forks: 7

felixdittrich92/docling-OCR-OnnxTR

OnnxTR OCR plugin for Docling

Language: Python - Size: 1.5 MB - Last synced at: 10 days ago - Pushed at: 12 days ago - Stars: 15 - Forks: 0

eklem/stopword-trainer

A module for creating stopword lists for any language, based on a set of documents.

Language: JavaScript - Size: 5.01 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 15 - Forks: 0

Mellow-Artificial-Intelligence/open-xtract

Extract structured data from documents, images, audio, and video using LLMs.

Language: Python - Size: 645 KB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 14 - Forks: 2

byerlikaya/SmartRAG

⚡ Production-ready .NET Standard 2.1 RAG library with 🤖 multi-AI provider support, 🏢 enterprise vector storage, 📄 intelligent document processing, and 🗄️ multi-database query coordination. 🌍 Cross-platform compatible.

Language: C# - Size: 23.5 MB - Last synced at: 13 days ago - Pushed at: 14 days ago - Stars: 14 - Forks: 4

3DCF-Labs/doc2dataset

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

Language: Rust - Size: 515 KB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 14 - Forks: 1

thammuio/doc-genius-ai

DocGenius AI - Generative AI Chatbot for your Documents

Language: Python - Size: 4.48 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 13 - Forks: 6

aws-samples/sample-document-processing-with-amazon-bedrock-data-automation

This repository contains examples for customers to get started using Amazon Bedrock Data Automation. The samples focus mainly on document processing use cases

Language: Jupyter Notebook - Size: 9.46 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 12 - Forks: 10

vakharwalad23/mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

Language: TypeScript - Size: 876 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 10 - Forks: 1

aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai

This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.

Language: Jupyter Notebook - Size: 3.24 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 10 - Forks: 2

ResetNetwork/n8n-nodes

A collection of custom n8n nodes for enhanced document processing, text splitting, and embeddings generation

Language: TypeScript - Size: 907 KB - Last synced at: 28 days ago - Pushed at: about 1 month ago - Stars: 9 - Forks: 5

saksham-1304/AskMyPDF

🤖 AI-Powered PDF Chat App | Dual AI Engine (Alchemyst + Gemini) | RAG Pipeline | Vector Search | MERN + TypeScript

Language: TypeScript - Size: 615 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 7 - Forks: 0

aws-samples/idp-invoice-automation-using-bedrock-data-automation-cdk

Serverless Intelligent Document Processing (IDP) solution for invoice automation using Amazon Bedrock Data Automation. Features automated data extraction, annotation, and processing pipeline built with AWS services and CDK.

Language: Python - Size: 204 KB - Last synced at: 7 months ago - Pushed at: 11 months ago - Stars: 7 - Forks: 1

jayll1303/table2html

A Python package that converts table images into HTML format using Object Detection model and OCR.

Language: Python - Size: 381 MB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 7 - Forks: 0

H0NEYP0T-466/Pen2PDF

⚡ Pen2PDF Suite – an all-in-one 🚀 productivity platform ✨ with 🤖 AI-powered text extraction (PDF/Images → Markdown 📝), 📅 smart timetable management (CSV/Excel import 📊), ✅ todo lists with subtasks📈, 🧠 AI-generated notes library 📚 and 💬 Isabella AI assistant (OpenAI/Microsoft/llama/Mistral/LongCat/Gemini models 🔄)for context-aware help 🧩.

Language: JavaScript - Size: 1.74 MB - Last synced at: 27 days ago - Pushed at: 29 days ago - Stars: 6 - Forks: 1

smart-models/Sentences-Chunker

Cutting-edge tool designed to intelligently segment text documents into optimally-sized chunks

Language: Python - Size: 1.98 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

aidalinfo/extract-kit

Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

Language: TypeScript - Size: 1.27 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 6 - Forks: 0

drgsn/filefusion

FileFusion is a powerful file concatenation tool designed specifically for Large Language Model (LLM)

Language: Go - Size: 173 KB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 6 - Forks: 0

abdur75648/urdu-text-detection

Text line detection for Urdu OCR (UTRNet)

Language: Python - Size: 48.5 MB - Last synced at: 8 months ago - Pushed at: about 1 year ago - Stars: 6 - Forks: 1

jeanbaptisteb/doccleaner

A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.

Language: XSLT - Size: 81.1 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 6 - Forks: 2

r-uben/deepseek-ocr-cli

CLI tool for OCR using DeepSeek-OCR model via Ollama. Local processing with zero cloud dependencies.

Language: Python - Size: 1.19 MB - Last synced at: 5 days ago - Pushed at: 8 days ago - Stars: 5 - Forks: 1

m4nd0mb3/document-templater

Document Templater is a powerful tool for automated document generation. Streamline the process of creating standard documents, such as contracts, reports, and forms, using predefined templates. This repository contains the source code for Document Templater, allowing you to easily integrate this functionality into your projects and automate docs.

Language: JavaScript - Size: 579 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 0

B-A-M-N/FlockParser

Distributed document RAG system with intelligent GPU/CPU orchestration. Auto-discovers heterogeneous nodes, routes workloads adaptively, and achieves 60x+ speedups through VRAM-aware load balancing. Privacy-first architecture with 4 interfaces (CLI, API, MCP, Web UI). Real distributed systems engineering, not just an API wrapper.

Language: Python - Size: 95.3 MB - Last synced at: 9 days ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 2

CentralFloridaAttorney/zmongo_retriever

zmongo_toolbag contains an easy to use MongoDB wrapper with a Langchain Vector Search Retriever implementation

Language: Python - Size: 27.8 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 4 - Forks: 2

Addepto/graph_builder

Open-source toolkit to extract structured knowledge graphs from documents and tables — power analytics, digital twins, and AI-driven assistants.

Language: Python - Size: 163 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 0

BABIN-JOE/NeuroDoc

NeuroDoc is a powerful AI-based offline document summarization tool that leverages OCR and NLP to intelligently analyze PDFs and generate structured summaries. Built using Flask, this tool is designed to run completely offline and supports both text-based and scanned/image-based documents.

Language: Python - Size: 13.7 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

thearpankumar/GPUaccelerated-multilingual-RAG

GPU - vector DB - AI-powered document processing platform for financial services. Features intelligent question answering, multi-format document support (PDF, DOCX, Excel), vector search with Qdrant, and multiple LLM integrations (OpenAI, Gemini, GROQ). Built with FastAPI, optimized for performance with parallel processing and advanced RAG

Language: Python - Size: 76.3 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 1

AmadeusITGroup/docs2vecs

CLI that helps with docs splitting, embedding and exposing them in a seamless manner

Language: Python - Size: 1.54 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 6

Swiftgum/swiftgum

The user data connection layer for AI applications. Transform any source into LLM-ready markdown. Focus on your AI, not integrations.

Language: TypeScript - Size: 3.05 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 4 - Forks: 0

baughmann/tikara

The metadata and text content extractor for almost every file type.

Language: Python - Size: 161 MB - Last synced at: 28 days ago - Pushed at: 11 months ago - Stars: 4 - Forks: 0

syw2014/langparse

LangParse is a universal document parsing and text chunking engine for LLM or Agent applications — Documents In, Knowledge Out.

Language: Python - Size: 76.2 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

Rayyan9477/ocr-app

State-of-the-art Optical Character Recognition (OCR) with Vision Language Model (VLM) integration for enhanced accuracy and optimal document processing.

Language: TypeScript - Size: 23 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 3 - Forks: 0

eiceblue/Spire.Doc-for-C

Spire.Doc for C++ is a professional Word C++ library specifically designed for developers to create, read, write, convert, merge, split, and compare Word documents on any C++ platforms with fast and high-quality performance.

Language: C++ - Size: 371 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 3 - Forks: 1

jmndao/mongoose-ai

AI-powered Mongoose plugin for intelligent document processing

Language: TypeScript - Size: 581 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 3 - Forks: 0

ahnafnafee/local-llm-pdf-ocr

Convert scanned PDFs into searchable text locally using Vision LLMs (olmOCR). 100% private, offline, and free. Features a modern Web UI & CLI.

Language: Python - Size: 1.13 MB - Last synced at: about 8 hours ago - Pushed at: 2 days ago - Stars: 2 - Forks: 0

Sourish-Kanna/SmartAudit-LLM

Autonomous multi-agent system for intelligent invoice auditing using LLaMA 3 + Mistral. Features rule-based compliance checks, role-based summaries (Legal, Managerial, Accounting), and a React + FastAPI pipeline.

Language: Python - Size: 532 KB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 2 - Forks: 0

mohandshamada/AI_Drawings_Reader

AI-powered PDF OCR using Vision-Language Models. Extract text from technical drawings, blueprints, and complex documents with 6 AI providers (OpenAI, Gemini, Claude, local models). Features resume processing, batch operations, and multiple output formats.

Language: Python - Size: 370 KB - Last synced at: 3 days ago - Pushed at: 5 days ago - Stars: 2 - Forks: 1

xjustloveux/aspose-mcp-server

Aspose MCP Server - MCP 辦公文檔處理服務器 為 AI 助手提供 90 個辦公文檔處理工具。支援 Word、Excel、PowerPoint、PDF 及跨格式轉換。按需啟用、跨平台(Windows/Linux/macOS)、開箱即用。從 Releases 下載預編譯版本,配置授權檔案即可使用。

Language: C# - Size: 1.19 MB - Last synced at: 5 days ago - Pushed at: 8 days ago - Stars: 2 - Forks: 0

bcfeen/DocMine

Knowledge-centric document ingestion with stable IDs, provenance, entities, and exact recall

Language: Python - Size: 7.79 MB - Last synced at: 6 days ago - Pushed at: 9 days ago - Stars: 2 - Forks: 0

Rushi-Balapure/pdf_2_json_extractor

A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_to_json preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.

Language: Python - Size: 1.68 MB - Last synced at: 6 days ago - Pushed at: 9 days ago - Stars: 2 - Forks: 1

retkowsky/azure-content-understanding-ga

Azure Content Understanding demos notebooks

Language: Jupyter Notebook - Size: 24.4 MB - Last synced at: 18 days ago - Pushed at: 21 days ago - Stars: 2 - Forks: 0

abhaydixit07/ayurguru-frontend

AyurGuru - Revolutionizing Wellness with Ayurveda and AI. AyurGuru is an AI-powered platform delivering Ayurvedic health solutions in real time. Users can consult a smart chatbot, upload medical reports for tailored insights, and explore comprehensive Ayurvedic blogs. Built with modern web technologies for a secure and seamless user experience.

Language: JavaScript - Size: 13.7 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 2

Magnet-AI/Quanta

Advanced PDF layout analysis engine for extracting figures, tables, and structured content from complex engineering documents using computer vision and machine learning.

Language: Python - Size: 86 MB - Last synced at: 14 days ago - Pushed at: 2 months ago - Stars: 2 - Forks: 1

pandaxbacon/AutoChunker

🪓 Lumberjack - AI-powered document parser with interactive tree editor. Transform PDFs, DOCX, PPTX into perfectly structured chunks for vector databases. 5 parsers, Firebase integration, live demo available.

Language: TypeScript - Size: 8.71 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

trsdn/MistralDocAI-mcp

MCP (Model Context Protocol) server for document-to-Markdown conversion using Mistral AI OCR. Compatible with Claude Desktop and other MCP clients.

Language: Python - Size: 78.1 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

Daniel-codi/Concept_Curve_Embeddings_Indexation

Code to make any AI have unlimited context persistent memory. In the example, a software for any AI to read the Uniform Commercial Code of Michigan. A document of 220,000 tokens

Language: JavaScript - Size: 22.6 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

johnsirmon/clearcouncil

ClearCouncil: Automated tools for collecting, organizing, and embedding publicly available local state county council documents (minutes, agendas) into LLMs. Python, JS, and wget scripts included for easy data retrieval and integration.

Language: Python - Size: 134 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 2 - Forks: 2

Ftjjgfgh/scientific-pdf-translator

# Scientific PDF TranslatorThis project offers a high-quality translation system for scanned scientific PDFs, converting documents from English to French while preserving formatting and mathematical expressions. With features like OCR integration and LaTeX output, it ensures accurate and professional results for academic use. 🛠️📄

Language: Shell - Size: 30.3 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

Jayanth-MKV/advanced-rag-cookbooks

Advanced RAG Techniques and Projects

Language: Jupyter Notebook - Size: 4.25 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

mancrurod/Resume-Optimization

Personal project that automates resume adaptation using LLMs. Converts .docx resumes to Markdown, tailors them to job descriptions with GPT-4o-mini or Gemini, and exports clean HTML and PDF resumes — with built-in editing and logging features.

Language: Python - Size: 71.3 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 2 - Forks: 0

Danitilahun/Document-processing-Pdf-Structured-Data-Extractor

This project demonstrates how to extract structured information from PDF documents using a combination of Langchain, OpenAI models, and the DocLing library. It provides a framework for parsing PDFs and leveraging LLMs to identify and format key data points.

Language: Jupyter Notebook - Size: 64.5 KB - Last synced at: 4 months ago - Pushed at: 8 months ago - Stars: 2 - Forks: 1

kallebysantos/ocrlot

A distributed ocr engine 🐆

Language: Elixir - Size: 296 KB - Last synced at: 2 months ago - Pushed at: 11 months ago - Stars: 2 - Forks: 0

caltechlibrary/popstar

Phone-Oriented Processing SofTware for ARchives

Language: Makefile - Size: 49.2 MB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 0

Oneirocom/generative-intent-detection

Generative intent detection with Magick

Language: TypeScript - Size: 42 KB - Last synced at: 8 months ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

RPetitpierre/Generic_Semantic_Segmentation_of_Historical_Maps

Language: Jupyter Notebook - Size: 94.4 MB - Last synced at: almost 3 years ago - Pushed at: almost 4 years ago - Stars: 2 - Forks: 0

KikuAI-Lab/DocStripper

🧹 DocStripper is a lightweight CLI utility that automatically cleans text documents

Language: Python - Size: 1.32 MB - Last synced at: about 22 hours ago - Pushed at: 1 day ago - Stars: 1 - Forks: 2

OthmaneBlial/pdfsmarteditor

Stop paying for basic PDF tools. PDF Smart Editor is a free, open-source alternative to costly PDF platforms, offering powerful features with full local processing to keep your documents private and secure.

Language: TypeScript - Size: 398 KB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 1 - Forks: 0

samu9086/langparse

📄 Parse complex documents with ease using LangParse, a developer-friendly engine designed for efficient text chunking in LLM and agent applications.

Language: Python - Size: 1.37 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

arsalanafzal010/SmartRAG

📄 Enable smart conversations with documents, images, and audio files using this advanced Retrieval-Augmented Generation system.

Language: Python - Size: 1.39 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

syncfusion/document-sdk-uwp-demos

Explore the Syncfusion Universal Windows Platform demos featuring our advanced PDF, Word, Excel, and PowerPoint document processing libraries.

Language: C# - Size: 37.4 MB - Last synced at: 5 days ago - Pushed at: 8 days ago - Stars: 1 - Forks: 0

chirag127/ReadablePDF-AI-Documents-To-Speech-CLI-Tool

Converts PDFs into readable, spokable text using AI-driven parsing and TTS synthesis. Optimized for accessibility and fast batch processing.

Size: 48.8 KB - Last synced at: 10 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

gabriele-mastrapasqua/fastapi-ocr

FastAPI OCR service using tesseract or paddleOCR

Language: Python - Size: 87.9 KB - Last synced at: 11 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

NeelPatra/PDF2DocX_ConvertAI

A smart Jupyter Notebook tool that uses Google Gemini AI to convert complex PDFs into clean, unformatted DOCX files. Features intelligent rate limiting for free-tier APIs, interactive UI, and auto-removal of headers/footers.

Language: Jupyter Notebook - Size: 4.06 MB - Last synced at: 17 days ago - Pushed at: 21 days ago - Stars: 1 - Forks: 0

Hecate-Enterprise/hecate-escriba

Escriba is a powerful document generation API with programmatic templates

Language: Python - Size: 207 KB - Last synced at: 1 day ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Goblanch/Expediente-Index

Pequeña aplicación de escritorio (Tkinter + ttkbootstrap) para generar un Índice de Documentos a partir de todos los PDF de una carpeta, exportando a Word (.docx) y/o PDF (.pdf).

Language: Python - Size: 43.9 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

Ai4GenXers/pdf-sentinel

Event-driven PDF to Markdown conversion for LLM workflows - 60x faster, zero idle resources

Language: Python - Size: 34.2 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

deadhand777/doc-redaction

Document Redaction Automation Service

Language: Python - Size: 2.08 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

Pulkit12dhingra/automated-document-parser

A powerful and automated document parser built with LangChain for intelligent document processing. Automatically detects file types and uses appropriate loaders for PDF, DOCX, CSV, JSON, HTML, and more.

Language: Python - Size: 105 KB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0

asbah-ramzan/HackRx-6.0-Intelligent-Query-Retrieval

🧠 Elevate document intelligence with HackRx 6.0, a powerful RAG system for extracting insights from complex files like PDFs and DOCX.

Language: Python - Size: 1.3 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1 - Forks: 0