Topic: "pdf-to-text"
infiniflow/ragflow
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
Language: TypeScript - Size: 60.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 52,874 - Forks: 5,060

docling-project/docling
Get your documents ready for gen AI
Language: Python - Size: 80 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 30,095 - Forks: 1,895

Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Language: HTML - Size: 192 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11,234 - Forks: 938

run-llama/llama_cloud_services
Knowledge Agents and Management in the Cloud
Language: Python - Size: 45.8 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 3,975 - Forks: 412

enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Language: Python - Size: 20.3 MB - Last synced at: 2 days ago - Pushed at: 9 days ago - Stars: 1,252 - Forks: 124

Academic-Hammer/SciTSR
Table structure recognition dataset of the paper: Complicated Table Structure Recognition
Language: Python - Size: 47.9 KB - Last synced at: about 2 months ago - Pushed at: almost 5 years ago - Stars: 360 - Forks: 58

pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Language: HTML - Size: 930 KB - Last synced at: 2 days ago - Pushed at: over 1 year ago - Stars: 316 - Forks: 38

shoryasethia/markdrop
A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.
Language: Python - Size: 158 KB - Last synced at: 17 days ago - Pushed at: about 2 months ago - Stars: 101 - Forks: 5

BitMiracle/Docotic.Pdf.Samples
C# and VB.NET samples for Docotic.Pdf library
Language: Visual Basic .NET - Size: 53.5 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 78 - Forks: 39

datalogics/adobe-pdf-library-samples
Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library
Size: 43.3 MB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 77 - Forks: 62

galkahana/pdf-text-extraction
cli for extracting text from PDF files (and maybe possibly tables)
Language: C++ - Size: 5.67 MB - Last synced at: 5 days ago - Pushed at: 2 months ago - Stars: 76 - Forks: 20

nainiayoub/pdf-text-data-extractor
PDF text data extraction web app with OCR for scanned documents
Language: Python - Size: 24.4 KB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 69 - Forks: 42

autokent/pdf-parse
Pure javascript cross-platform module to extract texts from PDFs.
Last synced at: 21 days ago - Stars: 66 - Forks: 53

papercast-dev/papercast
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Language: Python - Size: 218 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 48 - Forks: 1

seinecle/nocodefunctions-web-app
The code base of the front-end of nocodefunctions.com
Language: CSS - Size: 37.7 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 39 - Forks: 7

mbzuai-oryx/KITAB-Bench
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Language: Python - Size: 26.3 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 35 - Forks: 1

iditectweb/converter
Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework
Language: C# - Size: 10.7 KB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 31 - Forks: 12

asika32764/php-pdf-2-text
Simple PHP PDF to Text class
Language: PHP - Size: 204 KB - Last synced at: 20 days ago - Pushed at: over 1 year ago - Stars: 24 - Forks: 17

NanoNets/ocr-python
OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.
Language: Jupyter Notebook - Size: 5.52 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 24 - Forks: 4

asepmaulanaismail/pdf-to-txt-python
Simple pdf to text with python using PDFtk and PyPDF2
Language: Python - Size: 550 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 13 - Forks: 9

LuisAraujo/API-Tabua-Mare
[Eng] API for obtaining data from the Tide Table, using web scraping. [Pt-Br] API para Obtenção da Tábua de Maré diária, usando web scraping com PHP.
Language: JavaScript - Size: 1.28 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 12 - Forks: 7

aspose-pdf/Aspose.PDF-for-JavaScript-via-CPP
Aspose.PDF for Javascript via C++
Language: HTML - Size: 923 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 11 - Forks: 0

Clearedge-AI/clearedge
Build a RAG preprocessing pipeline
Language: Jupyter Notebook - Size: 24.7 MB - Last synced at: 26 days ago - Pushed at: about 1 year ago - Stars: 11 - Forks: 0

andrealenzi11/py-poppleract
Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents
Language: Python - Size: 202 KB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 10 - Forks: 2

shine-jayakumar/Extract-Data-From-PDF-In-Python
Batch-convert pdf to text, extract data from pdf in python
Language: Python - Size: 13.7 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 9 - Forks: 4

AshkanAbd/pdf2word-GUI
convert pdf to word
Language: Java - Size: 18.6 MB - Last synced at: about 2 months ago - Pushed at: over 6 years ago - Stars: 9 - Forks: 6

datalogics/apdfl-csharp-dotnet-samples
Sample code for the Datalogics .NET interface of the Adobe PDF Library
Language: C# - Size: 315 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 8 - Forks: 9

datalogics/apdfl-cplusplus-samples
Sample code for the Datalogics C++ interface of the Adobe PDF Library
Language: C++ - Size: 11.1 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 8 - Forks: 7

bytescout/pdf-extractor-sdk-samples
ByteScout PDF Extractor SDK source code samples
Language: C# - Size: 27.5 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 8 - Forks: 5

asiff00/bangla-pdf-ocr
Bangla PDF to text converter that works on Windows, macOS, and Linux without any extra downloads or configurations.
Language: Python - Size: 94.7 KB - Last synced at: 7 days ago - Pushed at: 7 months ago - Stars: 8 - Forks: 2

mic-kul/pdf-textstream
JRuby gem to pdf to text while keeping the layout from original pdf file
Language: Java - Size: 3.59 MB - Last synced at: about 1 month ago - Pushed at: about 7 years ago - Stars: 8 - Forks: 0

madnight/pdf-layout-text-stripper Fork of JonathanLink/PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf.
Language: Java - Size: 1.8 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 7 - Forks: 4

monambike/pdfconverter-pdftables-to-csv
Python project that converts tables inside PDFs to CSV for convenient data manipulation. It has log and exception handling.
Language: Python - Size: 142 MB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 7 - Forks: 1

renan-siqueira/python-pdf-tool
This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.
Language: Python - Size: 7.81 KB - Last synced at: about 2 months ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 1

datalogics/apdfl-java-maven-samples
Sample code for the Datalogics Java interface of the Adobe PDF Library setup to build with Maven
Language: Java - Size: 1.18 MB - Last synced at: 9 days ago - Pushed at: 10 days ago - Stars: 4 - Forks: 11

adaptaware/ragit
A RAG back and front end application
Language: Python - Size: 985 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 4 - Forks: 0

datalogics/apdfl-csharp-dotnet-framework-samples
Sample code for the Datalogics .NET Framework interface of the Adobe PDF Library
Language: C# - Size: 563 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 3 - Forks: 9

arjun-mavonic/scanned-pdf-text-extractor
This is a Python application that converts non-readable PDF files, such as scanned documents, into readable Word documents. It achieves this by first converting the PDF files into images and then extracting the text from the images to create the Word documents. The application provides a user-friendly interface to do the above task.
Language: Python - Size: 28.3 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 3 - Forks: 2

KOUISAmine/pdf-tools
A collection of PDF tools to convert, merge, and compress PDFs. Free & No installation.
Size: 2.93 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 0

jdonohue44/NOAA-Weather-Modification-Forms-LLM-Extractor
Extract key information from 1,000s of NOAA Form 17-4 (Initial Report On Weather Modification Activities) using LLM.
Language: Python - Size: 982 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 2 - Forks: 0

seinecle/nocodefunctions-io
io for nocodefunctions: csv, txt, pdf, and xlsx so far
Language: Java - Size: 174 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 2 - Forks: 0

baughmann/tikara
The metadata and text content extractor for almost every file type.
Language: Python - Size: 161 MB - Last synced at: about 3 hours ago - Pushed at: 4 months ago - Stars: 2 - Forks: 0

autokent/crawler-request
HTTP request module customized for crawlers.
Last synced at: about 1 month ago - Stars: 2 - Forks: 2

codewithalihamza/SummarizeAI
SummarizeAI is a powerful AI-driven SaaS tool that converts lengthy PDF documents into clear, concise summaries in seconds. Whether you're a student, researcher, or busy professional, SummarizeAI helps you save time and extract key insights effortlessly.
Language: TypeScript - Size: 943 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

datalogics/apdfl-vb-dotnet-samples
Adobe PDF Library Samples in Visual Basic for .NET
Language: Visual Basic .NET - Size: 174 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 4

FurqanHun/textnomnom-py
Extract text from PDFs, PPTs, & URLs (with OCR support). Converts PPT to PDF & handles files or folders. 🦍
Language: Python - Size: 46.9 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

StellarExplorerGuy/projects
Repo for all projects
Size: 8.75 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 1 - Forks: 0

sushantnair/arxiv_extractor
This code can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.
Language: Python - Size: 7.81 KB - Last synced at: about 2 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

cr4yfish/docling-js
Parsing Documents to one datatype (Typescript port of Docling)
Size: 23.4 KB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

princebhatt9588/Versatile_Code_Hub
VersatileCodeHub: Your one-stop repository for an array of coding projects. Explore diverse applications, from games like Flappy Bird to tools like QRCode Scanners. Expand your skills across various domains, all in one place.
Language: Python - Size: 4.98 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 1

SaiGanesh-S/OCR-Django
Implementing the concept of Optical Character Recognition in Django
Language: Python - Size: 290 KB - Last synced at: 7 months ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

selectpdf/selectpdf-api-nodejs-client
Node.js client for SelectPdf Online REST API
Language: JavaScript - Size: 30.3 KB - Last synced at: 3 days ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

selectpdf/selectpdf-api-ruby-client
Ruby client for SelectPdf Online REST API
Language: Ruby - Size: 21.5 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

selectpdf/selectpdf-api-perl-client
Perl client for SelectPdf Online REST API
Language: Perl - Size: 27.3 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

bytescout/pdfco-rails
PDF.co Gem plugin for Ruby on Rails
Language: Ruby - Size: 13.7 KB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 1

amitbd1508/Blind-EYE
A book reader with voice control functionality for blind people
Language: C# - Size: 7.04 MB - Last synced at: 2 months ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 1

zevio/pcu_pdf
PDF parser component (Apache Tika) for PCU project
Language: Python - Size: 53.3 MB - Last synced at: 26 days ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 0

orijtech/tikago
Apache Tika adapter in Go
Language: Go - Size: 48 MB - Last synced at: about 9 hours ago - Pushed at: over 8 years ago - Stars: 1 - Forks: 0

joinsime/Adobe-Acrobat
Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents
Size: 1.95 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

kevv1m/tikara
The metadata and text content extractor for almost every file type.
Size: 1000 Bytes - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

AkandindaJunior/Cloud-Services
If it’s not documented, it never happened. 📝 Please check my README.md for more details. 🔍
Size: 1000 Bytes - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

datalogics/apdfl-kotlin-samples
Adobe PDF Library Samples in Kotlin
Language: Kotlin - Size: 139 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 6

l1m1nal/Adobe-Acrobat
Adobe-Acrobat-is-a-powerful-PDF-solution-for-creating,-editing,-managing,-and-securing-documents
Size: 0 Bytes - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

islam-bld/Adobe-Acrobat
Adobe-Acrobat-is-a-powerful-PDF-solution-for-creating,-editing,-managing,-and-securing-documents
Language: JavaScript - Size: 2.93 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

genisis2025/Adobe-Acrobat
Adobe-Acrobat-is-a-powerful-PDF-solution-for-creating,-editing,-managing,-and-securing-documents
Language: JavaScript - Size: 2.93 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

IlyaFerens/Adobe-Acrobat
Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents
Size: 1.95 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

akii2423/Adobe-Acrobat
Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents
Size: 0 Bytes - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Feysis/Adobe-Acrobat
Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents
Size: 1.95 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

shahidmanzoor1/Adobe-Acrobat
Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents
Size: 1.95 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

bdbhaislive/Adobe-Acrobat
Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents
Size: 3.91 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

JaspreetSingh-exe/MedXpert-Backend-FastAPI
AI-powered medical report analyzer that extracts text from PDFs/images, summarizes reports, detects abnormalities, and provides a chatbot for medical queries. Built with FastAPI, OCR (Tesseract, pdfplumber), OpenAI GPT-3.5, and deployed on Google Cloud. Future enhancements include medical image classification and predictions. Contributions Welcome!
Language: Python - Size: 56.6 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

JaspreetSingh-exe/MedXpert-FrontEnd
MedXpert is an Android-based healthcare application that leverages OCR (Tesseract, pdfplumber) and LLMs (OpenAI GPT-3.5) to automate medical report extraction, abnormality detection, and natural language summarization. It features Firebase-powered user authentication, role-based access control, and real-time chatbot integration for medical queries.
Language: Kotlin - Size: 2.98 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

davibusanello/pdf2txt
A simple CLI to to convert PDF files into TXT using OCR
Language: Python - Size: 23.4 KB - Last synced at: 3 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

AliAlWahayb/Arabic_OCR_From_PDF Fork of zaakki-ahamed/Arabic_OCR_From_PDF
Perform Optical Character Recognition (OCR) on a scanned PDF file containing Arabic text and output a searchable PDF
Language: Python - Size: 197 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

AlexTkDev/PDF-to-Word-Conversion
A parser that will retype text from a PDF into an MS Word document with the specified specifications
Language: Python - Size: 35.2 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

CllsPy/PyPTE
The PDF Text Extractor API allows users to upload PDF files and receive the extracted text from those files. This API is built using FastAPI and leverages the PyMuPDF library for efficient text extraction.
Language: Python - Size: 11.7 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

gabriel-batistuta/pdf-to-any
a simple and functional multi convert system using amount of python librarys
Language: Python - Size: 36.1 KB - Last synced at: 11 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

dongju93/extract-ti-from-reports
Convert PDFs to text, then transform that text into structured JSON objects for Threat Intelligence.
Language: Jupyter Notebook - Size: 134 MB - Last synced at: 3 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

Kamaruddheen/document-scanner
Extract structured text and data from documents like invoices, book pages, tables, etc.. using OpenCV and Tesseract OCR
Language: HTML - Size: 68.7 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

mehmet-kozan/pdf-parse
Pure javascript cross-platform module to extract texts from PDFs.
Language: JavaScript - Size: 7.78 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

ExceptedPrism3/PDFToAudio
"PDF To Audio" is a Python tool that transforms PDF documents into audio files using OCR and Text-to-Speech technology. Ideal for accessibility and auditory learning, it supports multiple languages, parallel processing, and smart rate limit handling.
Language: Python - Size: 2.81 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

53buahapel/pdf-to-text-converter
python script that i made to convert pdf to text
Language: Python - Size: 1000 Bytes - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

aishwarya-art/Pdf-to-text-extract
Pdf to text extraction using PDF parser library in codeigniter 3 sample code
Language: PHP - Size: 2.93 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

revanthkalagudi/pdf-to-text-python
This code is designed to analyze a PDF document and determine the percentage of AI-generated content within the text. It utilizes the PyPDF2 library to extract the text from each page of the PDF and the NLTK library to check for AI-generated words.
Language: Python - Size: 10.7 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

fabriziomiano/pdf2txt-azure-ocr
A script to convert PDF files to TXT
Language: Python - Size: 8.79 KB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Directorman9/Optical-character-recognition
The notebook in this repository uses pytesseract to extract text from a pdf document. The script can be used to automate text acquisition from a large body of printed resources such as books. The acquired text can then be used for dowstream tasks, such as training language models, topic models, document summarization etc
Size: 1000 Bytes - Last synced at: almost 2 years ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

mfakca/pdf2text
PDF'leri metne dönüştürür
Language: Roff - Size: 21.4 MB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

zevio/pcu_io
IO management for PCU project
Language: Python - Size: 361 KB - Last synced at: 28 days ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0
