An open API service providing repository metadata for many open source software ecosystems.

Topic: "document-processing"

ucbepic/docetl

A system for agentic LLM-powered data processing and ETL

Language: Python - Size: 54.4 MB - Last synced at: 8 days ago - Pushed at: 15 days ago - Stars: 2,282 - Forks: 223

enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Language: Python - Size: 20.4 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 1,274 - Forks: 129

dhlab-epfl/dhSegment

Generic framework for historical document processing

Language: Python - Size: 5.89 MB - Last synced at: 4 months ago - Pushed at: almost 4 years ago - Stars: 374 - Forks: 115

ucbepic/TWIX

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

Language: Python - Size: 177 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 185 - Forks: 8

awslabs/project-lakechain

:zap: Cloud-native, AI-powered, document processing pipelines on AWS.

Language: TypeScript - Size: 177 MB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 179 - Forks: 25

formkiq/formkiq-core

A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Please 🌟 star to support our work!

Language: Java - Size: 20.4 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 130 - Forks: 19

awslabs/rhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

Language: Python - Size: 31.9 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 90 - Forks: 9

iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

Language: Python - Size: 69.3 KB - Last synced at: 7 days ago - Pushed at: 7 months ago - Stars: 84 - Forks: 7

parsee-ai/parsee-core

Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

Language: Python - Size: 1.24 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 66 - Forks: 1

steindani/pandoc-include

An include filter for Pandoc

Language: Haskell - Size: 9.77 KB - Last synced at: 22 days ago - Pushed at: over 4 years ago - Stars: 62 - Forks: 20

aws-solutions/enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

Language: JavaScript - Size: 62.5 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 38 - Forks: 16

cburschka/lyx

Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

Language: C++ - Size: 616 MB - Last synced at: about 2 months ago - Pushed at: over 2 years ago - Stars: 36 - Forks: 7

jmanhype/DSPy-Multi-Document-Agents

An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

Language: Python - Size: 135 KB - Last synced at: 5 days ago - Pushed at: 11 months ago - Stars: 34 - Forks: 3

kili-technology/awesome-datasets

A comprehensive list of annotated training datasets classified by use case.

Size: 24.9 MB - Last synced at: 5 days ago - Pushed at: almost 3 years ago - Stars: 34 - Forks: 6

abdullahshafiq-20/ResumeTex

ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.

Language: JavaScript - Size: 397 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 32 - Forks: 3

afrozas/proceedings

Semantic extraction from conference proceedings.

Language: Python - Size: 1.06 MB - Last synced at: about 1 year ago - Pushed at: almost 5 years ago - Stars: 31 - Forks: 1

MBAigner/PDFSegmenter

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

Language: Python - Size: 399 KB - Last synced at: 2 days ago - Pushed at: almost 5 years ago - Stars: 23 - Forks: 3

greed2411/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

Language: Clojure - Size: 19.5 KB - Last synced at: about 2 months ago - Pushed at: about 5 years ago - Stars: 18 - Forks: 0

eklem/stopword-trainer

A module for creating stopword lists for any language, based on a set of documents.

Language: JavaScript - Size: 6.16 MB - Last synced at: 2 days ago - Pushed at: 9 months ago - Stars: 15 - Forks: 0

smart-models/Normalized-Semantic-Chunker

Cutting-edge tool that unlocks the full potential of semantic chunking

Language: Python - Size: 3.76 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 13 - Forks: 4

vakharwalad23/mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

Language: TypeScript - Size: 866 KB - Last synced at: 12 days ago - Pushed at: 14 days ago - Stars: 9 - Forks: 1

aws-samples/sample-document-processing-with-amazon-bedrock-data-automation

This repository contains examples for customers to get started using Amazon Bedrock Data Automation. The samples focus mainly on document processing use cases

Language: Jupyter Notebook - Size: 9.09 MB - Last synced at: 25 days ago - Pushed at: 2 months ago - Stars: 9 - Forks: 5

felixdittrich92/docling-OCR-OnnxTR

OnnxTR OCR plugin for Docling

Language: Python - Size: 1.49 MB - Last synced at: about 2 hours ago - Pushed at: about 3 hours ago - Stars: 7 - Forks: 0

aws-samples/idp-invoice-automation-using-bedrock-data-automation-cdk

Serverless Intelligent Document Processing (IDP) solution for invoice automation using Amazon Bedrock Data Automation. Features automated data extraction, annotation, and processing pipeline built with AWS services and CDK.

Language: Python - Size: 204 KB - Last synced at: 25 days ago - Pushed at: 6 months ago - Stars: 7 - Forks: 1

drgsn/filefusion

FileFusion is a powerful file concatenation tool designed specifically for Large Language Model (LLM)

Language: Go - Size: 173 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 6 - Forks: 0

abdur75648/urdu-text-detection

Text line detection for Urdu OCR (UTRNet)

Language: Python - Size: 48.5 MB - Last synced at: 2 months ago - Pushed at: 9 months ago - Stars: 6 - Forks: 1

jeanbaptisteb/doccleaner

A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.

Language: XSLT - Size: 81.1 KB - Last synced at: about 1 year ago - Pushed at: almost 7 years ago - Stars: 6 - Forks: 2

aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai

This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.

Language: Jupyter Notebook - Size: 3.18 MB - Last synced at: 25 days ago - Pushed at: 3 months ago - Stars: 5 - Forks: 1

m4nd0mb3/document-templater

Document Templater is a powerful tool for automated document generation. Streamline the process of creating standard documents, such as contracts, reports, and forms, using predefined templates. This repository contains the source code for Document Templater, allowing you to easily integrate this functionality into your projects and automate docs.

Language: JavaScript - Size: 579 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 0

CentralFloridaAttorney/zmongo_retriever

Use data from MongoDB in LangChain, Llama and OpenAI

Language: Python - Size: 27.3 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 4 - Forks: 1

Swiftgum/swiftgum

The user data connection layer for AI applications. Transform any source into LLM-ready markdown. Focus on your AI, not integrations.

Language: TypeScript - Size: 3.05 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

AmadeusITGroup/docs2vecs

CLI that helps with docs splitting, embedding and exposing them in a seamless manner

Language: Python - Size: 1.51 MB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 3 - Forks: 5

baughmann/tikara

The metadata and text content extractor for almost every file type.

Language: Python - Size: 161 MB - Last synced at: 2 days ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

martin-papy/qdrant-loader

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration in development environments.

Language: Python - Size: 4.12 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 2 - Forks: 0

quarkiverse/quarkus-docling

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem

Language: Java - Size: 96.7 KB - Last synced at: 11 days ago - Pushed at: 17 days ago - Stars: 2 - Forks: 0

Jayanth-MKV/advanced-rag-cookbooks

Advanced RAG Techniques and Projects

Language: Jupyter Notebook - Size: 4.25 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

mancrurod/Resume-Optimization

Personal project that automates resume adaptation using LLMs. Converts .docx resumes to Markdown, tailors them to job descriptions with GPT-4o-mini or Gemini, and exports clean HTML and PDF resumes — with built-in editing and logging features.

Language: Python - Size: 71.3 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

Danitilahun/Document-processing-Pdf-Structured-Data-Extractor

This project demonstrates how to extract structured information from PDF documents using a combination of Langchain, OpenAI models, and the DocLing library. It provides a framework for parsing PDFs and leveraging LLMs to identify and format key data points.

Language: Jupyter Notebook - Size: 64.5 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 2 - Forks: 1

jayllfpt/table2html

A Python package that converts table images into HTML format using Object Detection model and OCR.

Language: Python - Size: 365 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

caltechlibrary/popstar

Phone-Oriented Processing SofTware for ARchives

Language: Makefile - Size: 49.2 MB - Last synced at: 3 months ago - Pushed at: 12 months ago - Stars: 2 - Forks: 0

Oneirocom/generative-intent-detection

Generative intent detection with Magick

Language: TypeScript - Size: 42 KB - Last synced at: about 2 months ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

RPetitpierre/Generic_Semantic_Segmentation_of_Historical_Maps

Language: Jupyter Notebook - Size: 94.4 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

adamrangwala/DirCity_Directory_Crop-out-with-Key-Lines

Turn Old City Directory scans into searchable data. Automated pipeline handles column detection, OCR processing, and accuracy evaluation for historical document digitization.

Language: Python - Size: 40.4 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 0

Ftjjgfgh/scientific-pdf-translator

# Scientific PDF TranslatorThis project offers a high-quality translation system for scanned scientific PDFs, converting documents from English to French while preserving formatting and mathematical expressions. With features like OCR integration and LaTeX output, it ensures accurate and professional results for academic use. 🛠️📄

Language: Shell - Size: 30.3 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 0

Rayyan9477/ocr-app

State-of-the-art Optical Character Recognition (OCR) with Vision Language Model (VLM) integration for enhanced accuracy and optimal document processing.

Language: TypeScript - Size: 23.9 MB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

Daniel-codi/Concept_Curve_Embeddings_Indexation

Code to make any AI have unlimited context persistent memory. In the example, a software for any AI to read the Uniform Commercial Code of Michigan. A document of 220,000 tokens

Language: JavaScript - Size: 22.6 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 1 - Forks: 0

natgluons/AI-docs-analyzer-API

Automate invoice analysis and identity verification, built with an open-source multimodal LLM and OCR (DocTR/TrOCR), using FastAPI, Supabase, PgVector, and Neo4j.

Language: Python - Size: 8.79 KB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

reinelt88/rag-chatbot-documents

This project implements a RAG (Retrieval-Augmented Generation) based chatbot that allows you to upload PDF documents, index them with embeddings, and ask questions about their content. It supports both OpenAI and Hugging Face models via the Inference API.

Language: Python - Size: 215 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

gs-ai/PDFProfessor

PDF Professor 2.0 extracts and processes PDF text, analyzed by Ollama for summarization, data extraction, and insights. More coming soon!

Language: Python - Size: 1.95 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

jromero132/pdf-splitter

PDF Splitter is a Python tool that takes a multi-page PDF file and splits it into individual PDF files, one for each page of the original document.

Language: Python - Size: 2.93 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

jromero132/pdf-merger

A Python utility for merging multiple PDFs and images into a single PDF file. This tool maintains aspect ratios, centers content on custom-sized pages (default A4), and supports recursive directory processing. Perfect for organizing documents and creating cohesive PDF compilations.

Language: Python - Size: 2.93 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

FayazK/Document-Metadata-Extractor

A Python tool that uses Google's Gemini AI to automatically extract structured metadata from PDF and DOCX documents, saving results to Excel for easy analysis and organizing raw responses as JSON files.

Language: Python - Size: 11.7 KB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

souvik03-136/TenderBot

Task

Language: Python - Size: 127 MB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

jcaperella29/Document_cleaning_CLI

A deep learning-based pipeline for cleaning scanned document images. Automatically removes noise, enhances text clarity, and optimizes images for OCR. 🚀

Language: MATLAB - Size: 94.5 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

Md-Emon-Hasan/LangChain

Powerful framework for building applications with Large Language Models (LLMs), enabling seamless integration with memory, agents, and external data sources.

Language: Jupyter Notebook - Size: 737 KB - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

acsenrafilho/cucaracha

A bureaucratic cockroach (cucaracha) assistent to help in document processing and analysis

Language: Python - Size: 6.44 MB - Last synced at: 8 days ago - Pushed at: 5 months ago - Stars: 1 - Forks: 1

kallebysantos/ocrlot

A distributed ocr engine 🐆

Language: Elixir - Size: 291 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

JDM-Github/debahra-efficio

DEHBARA (Efficio) is a React and Express-based web application designed to streamline service requests for DTI, SSS, and other document processing needs. It simplifies the process of requesting official papers and services, integrating cloud storage for efficient data management.

Language: TypeScript - Size: 13.3 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

Huang-lab/figure-extractor

Flask-based service using PDFFigures 2.0 to extract figures and tables from scholarly PDFs. Features REST API, CLI, Docker support, and JSON metadata output (~1.5s/page processing). Designed for document processing and RAG pipelines.

Language: Python - Size: 16.8 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

BjornMelin/pdfusion

A lightweight Python utility for effortlessly merging multiple PDF files into a single document.

Language: Python - Size: 40 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

Shahrom-S/BarsAI

AI assistant

Language: Python - Size: 11.2 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

dayang4321/MSc-Team-Project-CMPU9010-2023-24-Group-3

TU Dublin Computer Science MSc. Final Project Group 3 - Accessibilator

Language: Jupyter Notebook - Size: 100 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

johnsirmon/clearcouncil

ClearCouncil: Automated tools for collecting, organizing, and embedding publicly available local state county council documents (minutes, agendas) into LLMs. Python, JS, and wget scripts included for easy data retrieval and integration.

Language: Python - Size: 71.3 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 1

x1ao4/doc-merger

通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script

Language: Python - Size: 22.5 KB - Last synced at: 2 days ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

joseferrerh/invoices-leanautomation

This set of robots provides support for automatically obtaining information from invoices using docDigitizer API and keep track of the processed invoices on an Airtable repository

Language: RobotFramework - Size: 403 KB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 0

thoth2357/Watermark-removal

Program Helps remove watermark from a pdf document

Language: Python - Size: 3.91 KB - Last synced at: 4 months ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

anne27/Information-Retrieval

An implementation of basic IR techniques from scratch.

Language: Python - Size: 27.8 MB - Last synced at: about 1 year ago - Pushed at: about 6 years ago - Stars: 1 - Forks: 0

Jackojc/old-wotpp 📦

A document preprocessor that works in conjunction with tools like groff/troff & refer.

Language: C++ - Size: 60.5 KB - Last synced at: about 1 year ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 0

bylickilabs/pdfAnalyzer

PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.

Language: Python - Size: 0 Bytes - Last synced at: about 4 hours ago - Pushed at: about 4 hours ago - Stars: 0 - Forks: 0

Marbrgr/DocProc

AI-Powered processing platform with RAG-based Q&A capabilities

Size: 417 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

jonathanfavorite/RAGamuffin

A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage. Perfect for developers who need privacy-focused, offline-capable document search and AI-powered question answering without external API dependencies.

Language: C# - Size: 6.71 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 0 - Forks: 0

rgianordoli/rgianordoli

Architekturübersicht und Dokumentation eines modularen Systems zur automatisierten Dokumentverarbeitung.

Size: 167 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

rijwan10/rf001-gas-report-formatter-pro

This repository contains a report generation system that leverages the Claude API for natural language processing, allowing users to create professional reports efficiently. Explore the project to see how it streamlines report creation while maintaining high-quality standards. 🛠️📊

Language: JavaScript - Size: 33.2 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

kaptinka/-GiGTakaful-AI-Insurance

Advanced AI fraud detection for Takaful motor insurance claims. Automate analysis of police reports and estimates with OCR and real-time analytics. 🚀💻

Size: 4.88 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

ResetNetwork/n8n-nodes

A collection of custom n8n nodes for enhanced document processing, text splitting, and embeddings generation

Language: TypeScript - Size: 1.24 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

kevv1m/tikara

The metadata and text content extractor for almost every file type.

Size: 1000 Bytes - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

nsourlos/OCR_and_RAG

Tests of OCR and RAG with LLMs

Language: Jupyter Notebook - Size: 21.5 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

e-candeloro/Credem_Hack_2025

AI-powered document processing pipeline for Credem Hackathon 2025. Leverages Google Cloud AI services to intelligently extract, classify, and process HR documents through a robust ETL pipeline.

Language: Jupyter Notebook - Size: 13.9 MB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

amijadoamijado/rf001-gas-report-formatter-pro

McKinsey/BCG quality professional report generator using Google Apps Script. Automatically converts documents into professionally formatted PDFs with consulting-grade layouts. Serverless, secure, and enterprise-ready.

Language: JavaScript - Size: 26.4 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

Nattapolch/work-order-pdf-extractor

AI-powered Work Order PDF Extractor with OpenAI GPT-4 Vision integration for automated text extraction and file organization

Language: Python - Size: 22.5 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

josh-janse/pdf-to-markdown-extractor

Convert PDF documents to clean markdown using Google's Gemini API.

Language: JavaScript - Size: 18.6 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

ChayannFamali/AutoHR

AI-платформа автоматизации рекрутинга с интеграцией искусственного интеллекта для анализа резюме и оценки кандидатов. Система автоматически обрабатывает документы в форматах PDF/DOCX, извлекает навыки и опыт, а затем сопоставляет их с требованиями вакансий.

Language: HTML - Size: 653 KB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

Roberto-A-Cardenas/Intellidoc-Engine

Serverless OCR pipeline on AWS using Lambda, API Gateway, S3, and Textract. Accepts base64 PDFs and returns extracted text via API. Built with Terraform.

Language: HCL - Size: 1.79 MB - Last synced at: 26 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

Snoowp/VerbaSync

Intelligent document harmonization, using NLP, LLM's and ML. Three-stage clustering pipeline: Liberal clustering → BERT validation → LLM review. Advanced NLP/AI prototyping with async processing and cost optimization.

Language: Python - Size: 253 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

ejazalam831/rag-customer-support-chatbot

RAG-powered customer support chatbot using LangChain, LangGraph, and Mistral AI. An intelligent assistant that eliminates hallucinations by grounding responses in knowledge bases with conversation memory.

Language: Jupyter Notebook - Size: 3.24 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

michael-abdo/learning-lab-module

Advanced document processing and RAG implementation with MongoDB Vector Search, AWS integrations, and LLM-powered answer generation

Language: JavaScript - Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

AI-Data-Space/happymatrix-eco-assistant

AI-powered assistant for analyzing Engineering Change Orders (ECOs) using Google Gemini and RAG

Language: Jupyter Notebook - Size: 255 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Not70xic/web-ocr2

A fast, CLI-based tool to crawl a website path, download PDFs, OCR scanned files, and search text content using Whoosh indexing.

Language: Python - Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

diegoabeltran16/OpenPages-pipeline

Open-source tool for turning technical documents into AI-ready formats. Built for better access to knowledge.

Language: Python - Size: 1.78 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

AshrafulAlamShaqib/pdf-page-counter

Offline web app to count pages in PDF files using PDF.js

Language: JavaScript - Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

AhmedZeyadTareq/Smart-markdown-Extractor

A smart AI-powered application to extract, reorganize, and interact with file content, converting it into clean Markdown format using OpenAI and Streamlit.

Language: Python - Size: 5.86 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

0x22B9/ai-telegram-bot

AI Telegram bot using Gemini for chat, audio, and docs, with HuggingFace image gen. Deploy on Fly.io. Try it now!

Language: Python - Size: 233 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

credeed/credeed-pdf-to-markdown

Convert PDF to Markdown using AI, can be used for Agent to understand documents.

Language: Python - Size: 5.86 KB - Last synced at: 1 day ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Node0/timbermill

OCR-powered chat session renderer that slices long conversations into paginated, searchable PDFs

Size: 3.91 KB - Last synced at: 1 day ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

swiss-ai-center/layout-analysis-service

Layout Analysis Service detect part of an image-based document using PP-PicoDet.

Language: Python - Size: 9.99 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

easytocloud/Mac-letterhead

A macOS utility for merging letterhead templates with PDF and Markdown documents using a drag-and-drop interface

Language: Python - Size: 3.2 MB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

QuiddityAI/PDFerret

An all-in-one converter to make your files LLM-understandable

Language: HTML - Size: 32.1 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

aswinpradeepc/llmsearch

AI-powered search tool for querying financial reports, mutual fund documents, and market research using natural language. Built with FastAPI, Streamlit, OpenAI embeddings, and Pinecone vector search.

Language: Python - Size: 17.6 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

KrzysztofTybinka/DocMiner

RAG APi with OCR feature, with option to use local embeddings and language models for secure, offline document processing and intelligent retrieval.

Language: C# - Size: 547 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

adibshirazi/PDFMerger

PDF Merger Tool

Language: TypeScript - Size: 13.7 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0