GitHub topics: pdf-extraction
matheus-rech/systematic-review-extractor
AI-powered systematic review data extraction system with zero hallucination guarantee
Size: 188 KB - Last synced at: about 6 hours ago - Pushed at: about 7 hours ago - Stars: 0 - Forks: 0

Goldziher/kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
Language: Python - Size: 28.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,329 - Forks: 95

24eme/signaturepdf
Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf
Language: JavaScript - Size: 6.83 MB - Last synced at: 8 days ago - Pushed at: 29 days ago - Stars: 619 - Forks: 70

pytr-org/pytr
Use TradeRepublic in terminal and mass download all documents
Language: Python - Size: 270 KB - Last synced at: 9 days ago - Pushed at: 29 days ago - Stars: 573 - Forks: 113

billy-enrizky/pdf-extraction
Scalable PDF Extraction using Multimodal GPT 4o
Language: Python - Size: 144 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Language: Python - Size: 69.3 KB - Last synced at: 7 days ago - Pushed at: 10 months ago - Stars: 92 - Forks: 8

MarkShawn2020/video2ppt Fork of wudududu/extract-video-ppt
Extract presentation slides from videos with accurate timestamps
Language: Swift - Size: 476 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 6 - Forks: 0

Aumlo123/pdfdoom
DOOM in a PDF (as ascii art)
Size: 1000 Bytes - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 1 - Forks: 0

aidalinfo/extract-kit
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
Language: TypeScript - Size: 455 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 6 - Forks: 0

arv-fazriansyah/ekstrak-pdf-kartu-keluarga
Ekstrak PDF Kartu Keluarga adalah aplikasi web berbasis React + Vite yang memanfaatkan Google Gemini API untuk mengekstrak data dari dokumen KK (PDF atau ZIP) secara otomatis, menampilkannya dalam tabel interaktif, dan mengekspor hasilnya ke Excel.
Language: TypeScript - Size: 44.9 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

ArtifexSoftware/mupdf.js
JavaScript bindings for MuPDF
Size: 2.64 MB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 558 - Forks: 39

Khanna-Aman/tesseract-invoice-ocr
Python CLI tool for extracting structured data from scanned invoices using Tesseract OCR. Converts PDF/image invoices to CSV/JSON with vendor details, line items, and totals. Features robust error handling, batch processing, and professional-grade code quality.
Language: Python - Size: 141 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

iodize6399/wwmai-copper-data
Historical copper price data from WWMAI circulars. Raw PDFs and structured CSV data tracking electrolytic copper wire rod prices and calculation components.
Size: 22.4 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 0 - Forks: 0

cam-rodrigues/fydsync
FidSync is a professional-grade web tool that helps financial teams extract fund statuses from PDF scorecards and update Excel templates accurately — without manual matching or formatting headaches. Built with Streamlit · PDF + Excel automation · Fuzzy matching · Secure and client-ready
Language: Python - Size: 11.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

MohamedAziz15/MLOps-pipeline
End-to-End LLMOps Pipeline
Language: Jupyter Notebook - Size: 1020 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Vejandlachakrish/PersonaPrep-Persona-Aligned-Educational-PDF-Extractor
Extracts and ranks example problems, derivations, and formulas from physics PDFs using PyMuPDF. Fully containerized with Docker, no network access required.
Language: Python - Size: 2.43 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

anyparser/anyparserjs
Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.
Language: TypeScript - Size: 408 KB - Last synced at: 3 days ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

javaidb/personal-finance-tracker
Personal finance tracker via interpretation of bank statements from Scotiabank. Insights into spending habits, trends and long-term growth.
Language: Jupyter Notebook - Size: 851 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

rrayhka/GRI-Extractor
A tool to automatically extract GRI disclosure codes from corporate sustainability reports, enabling efficient analysis of environmental, social, and governance (ESG) data. Supports English and Indonesian reports.
Language: Python - Size: 13.9 MB - Last synced at: 29 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

bylickilabs/pdfAnalyzer
PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.
Language: Python - Size: 6.84 KB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

sgrimee/waste-calendar-extractor
Extract waste collection dates for the Luxemburgish commune of Niederanven from PDF calendars and generate iCal files.
Language: Python - Size: 2.06 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

nickchristopherson/duluth-tourism-analysis
End-to-End Data Pipeline for Tourism Industry Analysis
Language: HTML - Size: 6.84 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

olympus-terminal/data-processing
Data analysis and processing tools
Language: Python - Size: 14.6 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Atul-vaibhav/OCR-Extraction-Using-Python
Extract text from images and PDFs using python and store in a JSON Format. Store the extracted in MYSQL database.
Language: Python - Size: 743 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

JoseLVillaronga/teccam_pdf
Teccam PDF es una aplicación web en Python/Flask que extrae texto de documentos PDF y páginas web, lo convierte automáticamente a Markdown y lo almacena en MongoDB. Ofrece interfaz responsive con modo claro/oscuro, gestión de permisos (público/privado), marcadores de posición de lectura y despliegue como servicio systemd.
Language: HTML - Size: 41 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

BenjaminDanker/Data-AI-Prepare
A collection of Python utilities for preparing and transforming text data—PDF extraction, paragraph analysis, embedding generation, URL scraping, CSV conversion, and Astra DB uploads
Language: Python - Size: 473 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

RaghuSharma14/PDF-Reader
A PDF Reader application powered by AI, allowing users to upload PDF documents and extract meaningful information using advanced NLP models. Built with Streamlit, Transformers, and Langchain, this app provides a seamless interface for interacting with and analyzing PDF content.
Language: Python - Size: 0 Bytes - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

ozcanmiraay/opsbot
AI-powered PDF extraction suite for structured insights from contracts, forms, and documents. Built with Streamlit, LangChain, GPT-4o, and PDFPlumber.
Language: Python - Size: 9.61 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

souvik03-136/TenderBot
Task
Language: Python - Size: 127 MB - Last synced at: 5 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

LorysHamadache/pdf2txt-multipage-extractor
Fast batch tool to extract first-page text from all PDFs in a folder using Python. Optimized with multiprocessing to handle thousands of PDFs efficiently.
Language: Python - Size: 609 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

mateogon/pdf-narrator
Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.
Language: Python - Size: 4.38 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 56 - Forks: 10

AnhDungPham2901/extract_data_from_pdf
Using LLM to extract unstructured data from pdf file into structured format
Language: Jupyter Notebook - Size: 217 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

ascender1729/vodafone-financial-analysis
Automated financial table extraction and standardization from Vodafone's annual report using GPT-4o-mini
Language: Rich Text Format - Size: 797 KB - Last synced at: 9 days ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

anquetos/gcp-professional-data-engineer-rag
Build a local RAG (Retrieval Augmented Generation) to generate exam questions for the Google Cloud Platform professional Data Engineer certification.
Language: Jupyter Notebook - Size: 289 KB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

SSAYKO/schedule_app
Efficient algorithm for generating optimized academic schedules based on subject priorities and group availability.
Language: Python - Size: 59.6 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

siddharth-nandagopal/billionaires-rag-query
Billionaires RAG Query uses LLMs and a RAG framework to analyze the world's billionaires list. Extracts tabular data from PDFs, converts to multiple formats, and enables precise queries about net worth, age, and more. Integrates with Poetry and asdf for easy setup and management.
Language: Python - Size: 707 KB - Last synced at: 5 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

Amartya-007/Pdf-Reader
Making an app so that we can read and extract information from prf easily or chat with our pdfs.
Language: Python - Size: 7.81 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

lectrician1/extract-text-app
Web app to allow users to batch extract text from images and PDFs
Language: Svelte - Size: 536 KB - Last synced at: 6 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

rishisolanke/PDF_Query_Langchain
PDF Query LangChain is a tool that extracts and queries information from PDF documents using advanced language processing. Leveraging LangChain, OpenAI, and Cassandra, this app enables efficient, interactive querying of PDF content. Ideal for data analysis, research, and automated reporting, it simplifies detailed document analysis with ease.
Language: Python - Size: 4.88 KB - Last synced at: 6 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

adobe/pdftools-extract-java-sdk-samples
This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.
Language: Java - Size: 604 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 6

tracywong117/extract-info-from-pdf-paper
This Python script uses pdfminer.six, PyPDF2, pdf2image to extract information (text, image) from pdf paper.
Language: Python - Size: 3.37 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 1

heijul/pdf2gtfs
A python tool to extract schedule data from PDF timetables and output it in GTFS.
Language: Python - Size: 14.2 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

heshiming/paddlefish Fork of os-climate/crrf-det
A Python + C implementation for image-based PDF page layout analysis and content extraction.
Language: C++ - Size: 5.26 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

pcschreiber1/PDF_Extraction-Translation
Translate many large PDF Reports for free using Python.
Language: Jupyter Notebook - Size: 5.61 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 3

FTiniNadhirah/Text-Preprocessing
Language: Python - Size: 1.08 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 0 - Forks: 0
