GitHub topics: pdf-extraction

Repositories

matheus-rech/systematic-review-extractor

AI-powered systematic review data extraction system with zero hallucination guarantee

Size: 188 KB - Last synced at: about 6 hours ago - Pushed at: about 7 hours ago - Stars: 0 - Forks: 0

Goldziher/kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

Language: Python - Size: 28.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 2,329 - Forks: 95

24eme/signaturepdf

Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf

Language: JavaScript - Size: 6.83 MB - Last synced at: 8 days ago - Pushed at: 29 days ago - Stars: 619 - Forks: 70

pytr-org/pytr

Use TradeRepublic in terminal and mass download all documents

Language: Python - Size: 270 KB - Last synced at: 9 days ago - Pushed at: 29 days ago - Stars: 573 - Forks: 113

billy-enrizky/pdf-extraction

Scalable PDF Extraction using Multimodal GPT 4o

Language: Python - Size: 144 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

Language: Python - Size: 69.3 KB - Last synced at: 7 days ago - Pushed at: 10 months ago - Stars: 92 - Forks: 8

MarkShawn2020/video2ppt Fork of wudududu/extract-video-ppt

Extract presentation slides from videos with accurate timestamps

Language: Swift - Size: 476 KB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 6 - Forks: 0

Aumlo123/pdfdoom

DOOM in a PDF (as ascii art)

Size: 1000 Bytes - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 1 - Forks: 0

aidalinfo/extract-kit

Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

Language: TypeScript - Size: 455 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 6 - Forks: 0

arv-fazriansyah/ekstrak-pdf-kartu-keluarga

Ekstrak PDF Kartu Keluarga adalah aplikasi web berbasis React + Vite yang memanfaatkan Google Gemini API untuk mengekstrak data dari dokumen KK (PDF atau ZIP) secara otomatis, menampilkannya dalam tabel interaktif, dan mengekspor hasilnya ke Excel.

Language: TypeScript - Size: 44.9 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

ArtifexSoftware/mupdf.js

JavaScript bindings for MuPDF

Size: 2.64 MB - Last synced at: 28 days ago - Pushed at: 3 months ago - Stars: 558 - Forks: 39

Khanna-Aman/tesseract-invoice-ocr

Python CLI tool for extracting structured data from scanned invoices using Tesseract OCR. Converts PDF/image invoices to CSV/JSON with vendor details, line items, and totals. Features robust error handling, batch processing, and professional-grade code quality.

Language: Python - Size: 141 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

iodize6399/wwmai-copper-data

Historical copper price data from WWMAI circulars. Raw PDFs and structured CSV data tracking electrolytic copper wire rod prices and calculation components.

Size: 22.4 MB - Last synced at: 30 days ago - Pushed at: 30 days ago - Stars: 0 - Forks: 0

cam-rodrigues/fydsync

FidSync is a professional-grade web tool that helps financial teams extract fund statuses from PDF scorecards and update Excel templates accurately — without manual matching or formatting headaches. Built with Streamlit · PDF + Excel automation · Fuzzy matching · Secure and client-ready

Language: Python - Size: 11.1 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

MohamedAziz15/MLOps-pipeline

End-to-End LLMOps Pipeline

Language: Jupyter Notebook - Size: 1020 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Vejandlachakrish/PersonaPrep-Persona-Aligned-Educational-PDF-Extractor

Extracts and ranks example problems, derivations, and formulas from physics PDFs using PyMuPDF. Fully containerized with Docker, no network access required.

Language: Python - Size: 2.43 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

anyparser/anyparserjs

Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.

Language: TypeScript - Size: 408 KB - Last synced at: 3 days ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

javaidb/personal-finance-tracker

Personal finance tracker via interpretation of bank statements from Scotiabank. Insights into spending habits, trends and long-term growth.

Language: Jupyter Notebook - Size: 851 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

rrayhka/GRI-Extractor

A tool to automatically extract GRI disclosure codes from corporate sustainability reports, enabling efficient analysis of environmental, social, and governance (ESG) data. Supports English and Indonesian reports.

Language: Python - Size: 13.9 MB - Last synced at: 29 days ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

bylickilabs/pdfAnalyzer

PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.

Language: Python - Size: 6.84 KB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

sgrimee/waste-calendar-extractor

Extract waste collection dates for the Luxemburgish commune of Niederanven from PDF calendars and generate iCal files.

Language: Python - Size: 2.06 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

nickchristopherson/duluth-tourism-analysis

End-to-End Data Pipeline for Tourism Industry Analysis

Language: HTML - Size: 6.84 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

olympus-terminal/data-processing

Data analysis and processing tools

Language: Python - Size: 14.6 KB - Last synced at: about 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Atul-vaibhav/OCR-Extraction-Using-Python

Extract text from images and PDFs using python and store in a JSON Format. Store the extracted in MYSQL database.

Language: Python - Size: 743 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

JoseLVillaronga/teccam_pdf

Teccam PDF es una aplicación web en Python/Flask que extrae texto de documentos PDF y páginas web, lo convierte automáticamente a Markdown y lo almacena en MongoDB. Ofrece interfaz responsive con modo claro/oscuro, gestión de permisos (público/privado), marcadores de posición de lectura y despliegue como servicio systemd.

Language: HTML - Size: 41 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

BenjaminDanker/Data-AI-Prepare

A collection of Python utilities for preparing and transforming text data—PDF extraction, paragraph analysis, embedding generation, URL scraping, CSV conversion, and Astra DB uploads

Language: Python - Size: 473 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

RaghuSharma14/PDF-Reader

A PDF Reader application powered by AI, allowing users to upload PDF documents and extract meaningful information using advanced NLP models. Built with Streamlit, Transformers, and Langchain, this app provides a seamless interface for interacting with and analyzing PDF content.

Language: Python - Size: 0 Bytes - Last synced at: 4 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

ozcanmiraay/opsbot

AI-powered PDF extraction suite for structured insights from contracts, forms, and documents. Built with Streamlit, LangChain, GPT-4o, and PDFPlumber.

Language: Python - Size: 9.61 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

souvik03-136/TenderBot

Task

Language: Python - Size: 127 MB - Last synced at: 5 months ago - Pushed at: 6 months ago - Stars: 1 - Forks: 0

LorysHamadache/pdf2txt-multipage-extractor

Fast batch tool to extract first-page text from all PDFs in a folder using Python. Optimized with multiprocessing to handle thousands of PDFs efficiently.

Language: Python - Size: 609 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 1 - Forks: 0

mateogon/pdf-narrator

Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

Language: Python - Size: 4.38 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 56 - Forks: 10

AnhDungPham2901/extract_data_from_pdf

Using LLM to extract unstructured data from pdf file into structured format

Language: Jupyter Notebook - Size: 217 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

ascender1729/vodafone-financial-analysis

Automated financial table extraction and standardization from Vodafone's annual report using GPT-4o-mini

Language: Rich Text Format - Size: 797 KB - Last synced at: 9 days ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

anquetos/gcp-professional-data-engineer-rag

Build a local RAG (Retrieval Augmented Generation) to generate exam questions for the Google Cloud Platform professional Data Engineer certification.

Language: Jupyter Notebook - Size: 289 KB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

SSAYKO/schedule_app

Efficient algorithm for generating optimized academic schedules based on subject priorities and group availability.

Language: Python - Size: 59.6 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

siddharth-nandagopal/billionaires-rag-query

Billionaires RAG Query uses LLMs and a RAG framework to analyze the world's billionaires list. Extracts tabular data from PDFs, converts to multiple formats, and enables precise queries about net worth, age, and more. Integrates with Poetry and asdf for easy setup and management.

Language: Python - Size: 707 KB - Last synced at: 5 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

Amartya-007/Pdf-Reader

Making an app so that we can read and extract information from prf easily or chat with our pdfs.

Language: Python - Size: 7.81 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 1 - Forks: 0

lectrician1/extract-text-app

Web app to allow users to batch extract text from images and PDFs

Language: Svelte - Size: 536 KB - Last synced at: 6 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

rishisolanke/PDF_Query_Langchain

PDF Query LangChain is a tool that extracts and queries information from PDF documents using advanced language processing. Leveraging LangChain, OpenAI, and Cassandra, this app enables efficient, interactive querying of PDF content. Ideal for data analysis, research, and automated reporting, it simplifies detailed document analysis with ease.

Language: Python - Size: 4.88 KB - Last synced at: 6 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

adobe/pdftools-extract-java-sdk-samples

This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.

Language: Java - Size: 604 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 6 - Forks: 6

tracywong117/extract-info-from-pdf-paper

This Python script uses pdfminer.six, PyPDF2, pdf2image to extract information (text, image) from pdf paper.

Language: Python - Size: 3.37 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 1

heijul/pdf2gtfs

A python tool to extract schedule data from PDF timetables and output it in GTFS.

Language: Python - Size: 14.2 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 0

heshiming/paddlefish Fork of os-climate/crrf-det

A Python + C implementation for image-based PDF page layout analysis and content extraction.

Language: C++ - Size: 5.26 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

pcschreiber1/PDF_Extraction-Translation

Translate many large PDF Reports for free using Python.

Language: Jupyter Notebook - Size: 5.61 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 5 - Forks: 3

FTiniNadhirah/Text-Preprocessing

Language: Python - Size: 1.08 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 0 - Forks: 0

Related Keywords

pdf-extraction 45 python 18 pdf 10 ocr 8 machine-learning 7 automation 6 streamlit 6 text-extraction 5 langchain 5 rag 5 document-processing 5 nlp 4 data-analysis 4 llm 4 openai 4 json 3 csv 3 table-extraction 3 pdf-analysis 3 data-extraction 3 opencv 2 financial-analysis 2 pdf-viewer 2 typescript 2 cli-tool 2 retrieval-augmented-generation 2 gpt-4o 2 pytesseract 2 pdfplumber 2 camelot 2 structured-data 2 pymupdf 2 embeddings 2 artificial-intelligence 2 data-visualization 2 pdf-extractor 2 image-processing 2 web-scraping 2 pandas 2 text-mining 2 pdf-tools 2 pdf-manipulation 2 pdf-editor 2 finance 2 natural-language-processing 2 ical 1 api 1 beautifulsoup4 1 dark-mode 1 dotenv 1 flask 1 markdown 1 mongodb 1 extract 1 pymupdf4llm 1 responsive-design 1 systemd 1 astra-db 1 research-tool 1 document-query 1 data-preparation 1 relative-text 1 question-answering 1 text-processing 1 url-scraping 1 vector-database 1 natural-language-processing-nlp 1 pdf-reader 1 google-api-client 1 preprocessing 1 luxembourg 1 merge-pdf 1 waste-management 1 duluth 1 economic-analysis 1 jupyter 1 anaconda 1 tourism 1 data-processing 1 data-science 1 etl 1 r 1 pdf-translation 1 research 1 statistics 1 layout-analysis 1 image-segmentation 1 image-analysis 1 mysql 1 ocr-python 1 ocr-recognition 1 ocr-text-reader 1 python3 1 gtfs 1 java 1 low-resource 1 pdf-to-audiobook 1 text-to-speech 1 tts 1 ai 1