Topic: "document-processing"
ucbepic/docetl
A system for agentic LLM-powered data processing and ETL
Language: Python - Size: 54.6 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 2,366 - Forks: 237

enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Language: Python - Size: 20.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 1,274 - Forks: 129

dhlab-epfl/dhSegment
Generic framework for historical document processing
Language: Python - Size: 5.89 MB - Last synced at: 24 days ago - Pushed at: about 4 years ago - Stars: 378 - Forks: 115

ucbepic/TWIX
TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents
Language: Python - Size: 177 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 185 - Forks: 8

awslabs/project-lakechain
:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
Language: TypeScript - Size: 177 MB - Last synced at: 24 days ago - Pushed at: 4 months ago - Stars: 180 - Forks: 26

formkiq/formkiq-core
A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Please ๐ star to support our work!
Language: Java - Size: 20.7 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 131 - Forks: 20

awslabs/rhubarb
A Python framework for multi-modal document understanding with Amazon Bedrock
Language: Python - Size: 32 MB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 93 - Forks: 11

iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Language: Python - Size: 69.3 KB - Last synced at: 8 days ago - Pushed at: 8 months ago - Stars: 86 - Forks: 8

parsee-ai/parsee-core
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
Language: Python - Size: 1.24 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 71 - Forks: 1

steindani/pandoc-include
An include filter for Pandoc
Language: Haskell - Size: 9.77 KB - Last synced at: 19 days ago - Pushed at: over 4 years ago - Stars: 62 - Forks: 20

PSPDFKit/nutrient-document-engine-mcp-server
A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.
Language: TypeScript - Size: 25 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 54 - Forks: 0

aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
Language: JavaScript - Size: 61.3 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 38 - Forks: 16

abdullahshafiq-20/ResumeTex
ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.
Language: JavaScript - Size: 689 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 37 - Forks: 4

cburschka/lyx
Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)
Language: C++ - Size: 616 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 36 - Forks: 7

kili-technology/awesome-datasets
A comprehensive list of annotated training datasets classified by use case.
Size: 24.9 MB - Last synced at: about 7 hours ago - Pushed at: about 3 years ago - Stars: 35 - Forks: 6

jmanhype/DSPy-Multi-Document-Agents
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
Language: Python - Size: 135 KB - Last synced at: about 8 hours ago - Pushed at: 11 months ago - Stars: 33 - Forks: 3

afrozas/proceedings
Semantic extraction from conference proceedings.
Language: Python - Size: 1.06 MB - Last synced at: over 1 year ago - Pushed at: about 5 years ago - Stars: 31 - Forks: 1

MBAigner/PDFSegmenter
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
Language: Python - Size: 399 KB - Last synced at: 29 days ago - Pushed at: almost 5 years ago - Stars: 23 - Forks: 3

greed2411/tokyo
tokyo, a REST API, when given any type of document ๐, Identifies mime-type ๐ง. Suggests extension ๐ฆ. Alas Extracts text ๐ช.
Language: Clojure - Size: 19.5 KB - Last synced at: 3 months ago - Pushed at: about 5 years ago - Stars: 18 - Forks: 0

eklem/stopword-trainer
A module for creating stopword lists for any language, based on a set of documents.
Language: JavaScript - Size: 6.16 MB - Last synced at: 3 days ago - Pushed at: 10 months ago - Stars: 15 - Forks: 0

smart-models/Normalized-Semantic-Chunker
Cutting-edge tool that unlocks the full potential of semantic chunking
Language: Python - Size: 3.76 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 13 - Forks: 4

aws-samples/sample-document-processing-with-amazon-bedrock-data-automation
This repository contains examples for customers to get started using Amazon Bedrock Data Automation. The samples focus mainly on document processing use cases
Language: Jupyter Notebook - Size: 9.46 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 12 - Forks: 10

vakharwalad23/mark-minion
The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.
Language: TypeScript - Size: 868 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 9 - Forks: 1

aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai
This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.
Language: Jupyter Notebook - Size: 3.22 MB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 7 - Forks: 1

felixdittrich92/docling-OCR-OnnxTR
OnnxTR OCR plugin for Docling
Language: Python - Size: 1.49 MB - Last synced at: 8 days ago - Pushed at: 27 days ago - Stars: 7 - Forks: 0

aws-samples/idp-invoice-automation-using-bedrock-data-automation-cdk
Serverless Intelligent Document Processing (IDP) solution for invoice automation using Amazon Bedrock Data Automation. Features automated data extraction, annotation, and processing pipeline built with AWS services and CDK.
Language: Python - Size: 204 KB - Last synced at: about 2 months ago - Pushed at: 6 months ago - Stars: 7 - Forks: 1

drgsn/filefusion
FileFusion is a powerful file concatenation tool designed specifically for Large Language Model (LLM)
Language: Go - Size: 173 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 6 - Forks: 0

abdur75648/urdu-text-detection
Text line detection for Urdu OCR (UTRNet)
Language: Python - Size: 48.5 MB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 6 - Forks: 1

jeanbaptisteb/doccleaner
A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.
Language: XSLT - Size: 81.1 KB - Last synced at: over 1 year ago - Pushed at: about 7 years ago - Stars: 6 - Forks: 2

martin-papy/qdrant-loader
Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration in development environments.
Language: Python - Size: 4.94 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 5 - Forks: 1

quarkiverse/quarkus-docling
Docling simplifies document processing, parsing diverse formats โ including advanced PDF understanding โ and providing seamless integrations with the gen AI ecosystem
Language: Java - Size: 123 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 5 - Forks: 3

m4nd0mb3/document-templater
Document Templater is a powerful tool for automated document generation. Streamline the process of creating standard documents, such as contracts, reports, and forms, using predefined templates. This repository contains the source code for Document Templater, allowing you to easily integrate this functionality into your projects and automate docs.
Language: JavaScript - Size: 579 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 0

CentralFloridaAttorney/zmongo_retriever
Use data from MongoDB in LangChain, Llama and OpenAI
Language: Python - Size: 27.3 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 4 - Forks: 1

Swiftgum/swiftgum
The user data connection layer for AI applications. Transform any source into LLM-ready markdown. Focus on your AI, not integrations.
Language: TypeScript - Size: 3.05 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 0

baughmann/tikara
The metadata and text content extractor for almost every file type.
Language: Python - Size: 161 MB - Last synced at: 2 days ago - Pushed at: 6 months ago - Stars: 4 - Forks: 0

AmadeusITGroup/docs2vecs
CLI that helps with docs splitting, embedding and exposing them in a seamless manner
Language: Python - Size: 1.51 MB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 3 - Forks: 5

johnsirmon/clearcouncil
ClearCouncil: Automated tools for collecting, organizing, and embedding publicly available local state county council documents (minutes, agendas) into LLMs. Python, JS, and wget scripts included for easy data retrieval and integration.
Language: Python - Size: 134 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 2 - Forks: 2

Ftjjgfgh/scientific-pdf-translator
# Scientific PDF TranslatorThis project offers a high-quality translation system for scanned scientific PDFs, converting documents from English to French while preserving formatting and mathematical expressions. With features like OCR integration and LaTeX output, it ensures accurate and professional results for academic use. ๐ ๏ธ๐
Language: Shell - Size: 30.3 KB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 2 - Forks: 0

ResetNetwork/n8n-nodes
A collection of custom n8n nodes for enhanced document processing, text splitting, and embeddings generation
Language: TypeScript - Size: 1.23 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 2 - Forks: 1

Jayanth-MKV/advanced-rag-cookbooks
Advanced RAG Techniques and Projects
Language: Jupyter Notebook - Size: 4.25 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

mancrurod/Resume-Optimization
Personal project that automates resume adaptation using LLMs. Converts .docx resumes to Markdown, tailors them to job descriptions with GPT-4o-mini or Gemini, and exports clean HTML and PDF resumes โ with built-in editing and logging features.
Language: Python - Size: 71.3 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

Danitilahun/Document-processing-Pdf-Structured-Data-Extractor
This project demonstrates how to extract structured information from PDF documents using a combination of Langchain, OpenAI models, and the DocLing library. It provides a framework for parsing PDFs and leveraging LLMs to identify and format key data points.
Language: Jupyter Notebook - Size: 64.5 KB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 1

jayllfpt/table2html
A Python package that converts table images into HTML format using Object Detection model and OCR.
Language: Python - Size: 365 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 2 - Forks: 0

caltechlibrary/popstar
Phone-Oriented Processing SofTware for ARchives
Language: Makefile - Size: 49.2 MB - Last synced at: 4 months ago - Pushed at: about 1 year ago - Stars: 2 - Forks: 0

Oneirocom/generative-intent-detection
Generative intent detection with Magick
Language: TypeScript - Size: 42 KB - Last synced at: 3 months ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

RPetitpierre/Generic_Semantic_Segmentation_of_Historical_Maps
Language: Jupyter Notebook - Size: 94.4 MB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 2 - Forks: 0

AhmedZeyadTareq/Smart-markdown-Extractor
AI-powered document processing tool with smart extraction, OCR, and intelligent content analysis
Language: Python - Size: 152 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

Rayyan9477/ocr-app
State-of-the-art Optical Character Recognition (OCR) with Vision Language Model (VLM) integration for enhanced accuracy and optimal document processing.
Language: TypeScript - Size: 22.9 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 1 - Forks: 0

samay-jain/Voice_Assistant_RAG_System_using_LangChain_and_Streamlit
Voice Assistant RAG System using LangChain, Whisper, and Streamlit - A voice-enabled assistant that lets you ask questions by speaking, processes your custom documents, and responds with natural speech. Built with LangChain, Ollama, Whisper, ElevenLabs, and Streamlit.
Language: Python - Size: 367 KB - Last synced at: 4 days ago - Pushed at: 21 days ago - Stars: 1 - Forks: 0

JDM-Github/debahra-efficio
DEHBARA (Efficio) is a React and Express-based web application designed to streamline service requests for DTI, SSS, and other document processing needs. It simplifies the process of requesting official papers and services, integrating cloud storage for efficient data management.
Language: TypeScript - Size: 14.7 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

acsenrafilho/cucaracha
A bureaucratic cockroach (cucaracha) assistent to help in document processing and analysis
Language: Python - Size: 5.93 MB - Last synced at: 8 days ago - Pushed at: 24 days ago - Stars: 1 - Forks: 1

adamrangwala/DirCity_Directory_Crop-out-with-Key-Lines
Turn Old City Directory scans into searchable data. Automated pipeline handles column detection, OCR processing, and accuracy evaluation for historical document digitization.
Language: Python - Size: 40.4 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 1 - Forks: 0

Daniel-codi/Concept_Curve_Embeddings_Indexation
Code to make any AI have unlimited context persistent memory. In the example, a software for any AI to read the Uniform Commercial Code of Michigan. A document of 220,000 tokens
Language: JavaScript - Size: 22.6 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

natgluons/AI-docs-analyzer-API
Automate invoice analysis and identity verification, built with an open-source multimodal LLM and OCR (DocTR/TrOCR), using FastAPI, Supabase, PgVector, and Neo4j.
Language: Python - Size: 8.79 KB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

reinelt88/rag-chatbot-documents
This project implements a RAG (Retrieval-Augmented Generation) based chatbot that allows you to upload PDF documents, index them with embeddings, and ask questions about their content. It supports both OpenAI and Hugging Face models via the Inference API.
Language: Python - Size: 215 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

gs-ai/PDFProfessor
PDF Professor 2.0 extracts and processes PDF text, analyzed by Ollama for summarization, data extraction, and insights. More coming soon!
Language: Python - Size: 1.95 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

jromero132/pdf-splitter
PDF Splitter is a Python tool that takes a multi-page PDF file and splits it into individual PDF files, one for each page of the original document.
Language: Python - Size: 2.93 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

jromero132/pdf-merger
A Python utility for merging multiple PDFs and images into a single PDF file. This tool maintains aspect ratios, centers content on custom-sized pages (default A4), and supports recursive directory processing. Perfect for organizing documents and creating cohesive PDF compilations.
Language: Python - Size: 2.93 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

FayazK/Document-Metadata-Extractor
A Python tool that uses Google's Gemini AI to automatically extract structured metadata from PDF and DOCX documents, saving results to Excel for easy analysis and organizing raw responses as JSON files.
Language: Python - Size: 11.7 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

souvik03-136/TenderBot
Task
Language: Python - Size: 127 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

jcaperella29/Document_cleaning_CLI
A deep learning-based pipeline for cleaning scanned document images. Automatically removes noise, enhances text clarity, and optimizes images for OCR. ๐
Language: MATLAB - Size: 94.5 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

Md-Emon-Hasan/LangChain
Powerful framework for building applications with Large Language Models (LLMs), enabling seamless integration with memory, agents, and external data sources.
Language: Jupyter Notebook - Size: 737 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

kallebysantos/ocrlot
A distributed ocr engine ๐
Language: Elixir - Size: 291 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

Huang-lab/figure-extractor
Flask-based service using PDFFigures 2.0 to extract figures and tables from scholarly PDFs. Features REST API, CLI, Docker support, and JSON metadata output (~1.5s/page processing). Designed for document processing and RAG pipelines.
Language: Python - Size: 16.8 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

BjornMelin/pdfusion
A lightweight Python utility for effortlessly merging multiple PDF files into a single document.
Language: Python - Size: 40 KB - Last synced at: 4 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

Shahrom-S/BarsAI
AI assistant
Language: Python - Size: 11.2 MB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 1 - Forks: 0

dayang4321/MSc-Team-Project-CMPU9010-2023-24-Group-3
TU Dublin Computer Science MSc. Final Project Group 3 - Accessibilator
Language: Jupyter Notebook - Size: 100 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

x1ao4/doc-merger
้่ฟ python ่ๆฌๅฐไธคไธช็ธๅฏนไธๅฎๆด็ๆๆกฃๅๅนถไธบไธไธชๅฎๆด็ๆๆกฃ / merge two relatively incomplete documents into one complete document via python script
Language: Python - Size: 22.5 KB - Last synced at: 29 days ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

joseferrerh/invoices-leanautomation
This set of robots provides support for automatically obtaining information from invoices using docDigitizer API and keep track of the processed invoices on an Airtable repository
Language: RobotFramework - Size: 403 KB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 1 - Forks: 0

thoth2357/Watermark-removal
Program Helps remove watermark from a pdf document
Language: Python - Size: 3.91 KB - Last synced at: 26 days ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

anne27/Information-Retrieval
An implementation of basic IR techniques from scratch.
Language: Python - Size: 27.8 MB - Last synced at: about 1 year ago - Pushed at: about 6 years ago - Stars: 1 - Forks: 0

Jackojc/old-wotpp ๐ฆ
A document preprocessor that works in conjunction with tools like groff/troff & refer.
Language: C++ - Size: 60.5 KB - Last synced at: about 1 year ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 0

GuglielmoCerri/test-assets
A version-controlled collection of stable assets (documents, images, etc..) for integration testing
Size: 139 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

Horneychan/rag-contract-analyzer
Analyze rental and purchase contracts with RAG technology. Identify risks, ensure compliance, and extract key insights effortlessly. ๐ ๏ธ๐
Language: Python - Size: 47.9 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 0 - Forks: 0

easytocloud/Mac-letterhead
A macOS utility for merging letterhead templates with PDF and Markdown documents using a drag-and-drop interface
Language: Python - Size: 3.36 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

deeksha006/PII-DETECTION
A Streamlit web application for detecting and redacting Personally Identifiable Information (PII) in documents including PDFs, images, and text files. Supports Aadhaar, PAN, Driving License, and Voter ID detection with automated redaction capabilities.
Language: Python - Size: 5.86 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

DaandinhoPy94/rag-contract-analyzer
๐ค RAG-powered contract analyzer using Gemini API, LangChain & ChromaDB for intelligent legal document analysis
Language: Python - Size: 31.3 KB - Last synced at: 7 days ago - Pushed at: 8 days ago - Stars: 0 - Forks: 0

divyasree-dolly/Smart-Resume-Cover-Letter-Generator
๐ AI-powered web app that generates tailored cover letters and enhances resume bullet points using OpenAI GPT. Built with Streamlit for easy job application optimization.
Language: Python - Size: 22.5 KB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 0 - Forks: 0

klapom/generic-kg-pipeline
A flexible, plugin-based pipeline system for extracting knowledge graphs from documents
Language: Python - Size: 6.21 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

hasnaintypes/lawbotics-v2
LawBotics v2 is an AI-powered legal contract analysis platform that combines machine learning with modern web technologies to automate legal document review and clause extraction.
Language: TypeScript - Size: 83.6 MB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

kevv1m/tikara
The metadata and text content extractor for almost every file type.
Size: 1000 Bytes - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

lifenture/flash-mail-merge
Streamline Document Automation with Serverless Mail Merge and DOCX Processing.
Language: Go - Size: 15.2 MB - Last synced at: 11 days ago - Pushed at: 12 days ago - Stars: 0 - Forks: 0

akshitharsola/overleaf-automation
๐ Intelligent document analysis and LaTeX conversion automation tool. Converts Word documents (.docx) to LaTeX with automatic table detection, equation recognition, and multi-format support (ACM, IEEE, Springer). Built with React & TypeScript.
Language: TypeScript - Size: 692 KB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 0 - Forks: 0

ibarani/boxsavant
AI-powered document organization system for Box.com with full account reorganization capabilities
Language: Python - Size: 6.58 MB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

evaibhav/Ai-File-Analyzer
๐ AI-powered document analysis webapp - Upload files (PDF, DOCX, TXT, CSV, XLSX) and get intelligent analysis using local Ollama AI. Built with Flask and Python. Privacy-first with local processing.
Language: Python - Size: 219 KB - Last synced at: 17 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

TimInTech/pdf-text-duplicate-checker
PDF Duplicate Detector & Mover (Text + Image Hashing)
Language: Python - Size: 98.4 MB - Last synced at: 6 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

Hassan-Memon/ai-book-translator
An AI-powered Urdu to Arabic book translator that intelligently processes documents (PDF, Word, Excel, or images), chunks content based on structure, and uses multi-stage LLM agents to ensure accurate, context-aware, and faithful translation without omissions or additions.
Language: Python - Size: 5.86 KB - Last synced at: 11 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

jonathanfavorite/RAGamuffin
A lightweight, cross-platform .NET library for building RAG (Retrieval-Augmented Generation) pipelines with local embedding models and SQLite vector storage. Perfect for developers who need privacy-focused, offline-capable document search and AI-powered question answering without external API dependencies.
Language: C# - Size: 6.75 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

bneweling/neuronode
๐ง Neuronode - Enterprise-grade Knowledge Management System with LiteLLM, Neo4j, and Vector Search. AI-powered document processing, intelligent relationship discovery, and advanced query orchestration.
Language: Python - Size: 4.27 MB - Last synced at: 13 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

mhashas/financeQA
financeQA is a modular Retrieval-Augmented Generation (RAG) system for finance question answering. It features document preprocessing, image and table extraction, vector database indexing, and OpenAI-powered chat interfaces, designed for robust financial data analysis and evaluation.
Language: Python - Size: 150 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

GovindKurapati/dev_docs_chat
RAG-powered document/url Q&A system with ChromaDB + Groq LLM. Upload docs, ingest URLs, get AI answers.
Language: Python - Size: 801 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

rijwan10/rf001-gas-report-formatter-pro
This repository contains a report generation system that leverages the Claude API for natural language processing, allowing users to create professional reports efficiently. Explore the project to see how it streamlines report creation while maintaining high-quality standards. ๐ ๏ธ๐
Language: JavaScript - Size: 33.2 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

syedaliwaqar12/Resume-Parser
๐ A beautiful, production-ready web app that extracts structured data from PDF resumes using AI and NLP. Built with React + TypeScript + FastAPI.
Language: Python - Size: 53.7 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

chatfin-tara/Chatfin
AI-powered finance automation platform for reconciliation, compliance, and intelligent data operationsโintegrated with Oracle NetSuite.
Size: 0 Bytes - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 0 - Forks: 0

bylickilabs/pdfAnalyzer
PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.
Language: Python - Size: 0 Bytes - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

Marbrgr/DocProc
AI-Powered processing platform with RAG-based Q&A capabilities
Size: 417 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

rgianordoli/rgianordoli
Architekturรผbersicht und Dokumentation eines modularen Systems zur automatisierten Dokumentverarbeitung.
Size: 167 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

kaptinka/-GiGTakaful-AI-Insurance
Advanced AI fraud detection for Takaful motor insurance claims. Automate analysis of police reports and estimates with OCR and real-time analytics. ๐๐ป
Size: 4.88 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

nsourlos/OCR_and_RAG
Tests of OCR and RAG with LLMs
Language: Jupyter Notebook - Size: 21.5 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

e-candeloro/Credem_Hack_2025
AI-powered document processing pipeline for Credem Hackathon 2025. Leverages Google Cloud AI services to intelligently extract, classify, and process HR documents through a robust ETL pipeline.
Language: Jupyter Notebook - Size: 13.9 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0
