Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: pdf-to-text

BitMiracle/Docotic.Pdf.Samples

C# and VB.NET samples for Docotic.Pdf library

Language: Visual Basic .NET - Size: 52.9 MB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 65 - Forks: 39

datalogics/apdfl-cplusplus-samples

Sample code for the Datalogics C++ interface of the Adobe PDF Library

Language: C++ - Size: 11 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 6 - Forks: 5

infiniflow/ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

Language: Python - Size: 19.4 MB - Last synced: 21 days ago - Pushed: 22 days ago - Stars: 5,952 - Forks: 499

aspose-pdf/Aspose.PDF-for-JavaScript-via-CPP

Aspose.PDF for Javascript via C++

Language: HTML - Size: 314 MB - Last synced: 21 days ago - Pushed: 23 days ago - Stars: 7 - Forks: 0

seinecle/nocodefunctions-io

io for nocodefunctions: csv, txt, pdf, and xlsx so far

Language: Java - Size: 156 KB - Last synced: 24 days ago - Pushed: 24 days ago - Stars: 2 - Forks: 0

fabriziomiano/pdf2txt-azure-ocr

A script to convert PDF files to TXT

Language: Python - Size: 8.79 KB - Last synced: 25 days ago - Pushed: over 1 year ago - Stars: 0 - Forks: 0

seinecle/nocodefunctions-web-app

The code base of the front-end of nocodefunctions.com

Language: CSS - Size: 27 MB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 30 - Forks: 5

asika32764/php-pdf-2-text

Simple PHP PDF to Text class

Language: PHP - Size: 204 KB - Last synced: 28 days ago - Pushed: 6 months ago - Stars: 24 - Forks: 17

Unstructured-IO/unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Language: HTML - Size: 124 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 5,819 - Forks: 424

Clearedge-AI/clearedge

Build a RAG preprocessing pipeline

Language: Jupyter Notebook - Size: 24.7 MB - Last synced: 18 days ago - Pushed: about 1 month ago - Stars: 9 - Forks: 0

princebhatt9588/Versatile_Code_Hub

VersatileCodeHub: Your one-stop repository for an array of coding projects. Explore diverse applications, from games like Flappy Bird to tools like QRCode Scanners. Expand your skills across various domains, all in one place.

Language: Python - Size: 4.98 MB - Last synced: 25 days ago - Pushed: 9 months ago - Stars: 1 - Forks: 1

datalogics/apdfl-java-maven-samples

Sample code for the Datalogics Java interface of the Adobe PDF Library setup to build with Maven

Language: Java - Size: 862 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 2 - Forks: 10

datalogics/apdfl-csharp-dotnet-samples

Sample code for the Datalogics .NET interface of the Adobe PDF Library

Language: C# - Size: 258 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 4 - Forks: 7

monambike/pdfconverter-pdftables-to-csv

Python project that converts tables inside PDFs to CSV for convenient data manipulation. It has log and exception handling.

Language: Python - Size: 142 MB - Last synced: 26 days ago - Pushed: about 2 months ago - Stars: 4 - Forks: 1

bytescout/pdfco-rails

PDF.co Gem plugin for Ruby on Rails

Language: Ruby - Size: 13.7 KB - Last synced: about 2 months ago - Pushed: over 3 years ago - Stars: 1 - Forks: 1

dongju93/extract-ti-from-reports

Convert PDFs to text, then transform that text into structured JSON objects for Threat Intelligence.

Language: Jupyter Notebook - Size: 134 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 0 - Forks: 0

bytescout/pdf-extractor-sdk-samples

ByteScout PDF Extractor SDK source code samples

Language: C# - Size: 27.4 MB - Last synced: about 2 months ago - Pushed: 10 months ago - Stars: 7 - Forks: 5

Kamaruddheen/document-scanner

Extract structured text and data from documents like invoices, book pages, tables, etc.. using OpenCV and Tesseract OCR

Language: HTML - Size: 68.7 MB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 0 - Forks: 0

madnight/pdf-layout-text-stripper Fork of JonathanLink/PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf.

Language: Java - Size: 1.8 MB - Last synced: about 1 month ago - Pushed: about 7 years ago - Stars: 6 - Forks: 3

Academic-Hammer/SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

Language: Python - Size: 47.9 KB - Last synced: 2 months ago - Pushed: almost 4 years ago - Stars: 316 - Forks: 56

mic-kul/pdf-textstream

JRuby gem to pdf to text while keeping the layout from original pdf file

Language: Java - Size: 3.59 MB - Last synced: 7 days ago - Pushed: about 6 years ago - Stars: 8 - Forks: 0

mehmet-kozan/pdf-parse

Pure javascript cross-platform module to extract texts from PDFs.

Language: JavaScript - Size: 7.78 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0

pd3f/pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Language: HTML - Size: 930 KB - Last synced: 3 months ago - Pushed: 7 months ago - Stars: 246 - Forks: 33

reirualluap/PDF-to-TEXT

Transforme any .pdf to a .txt file in a few second

Language: Jupyter Notebook - Size: 28.3 KB - Last synced: 3 months ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0

datalogics/apdfl-csharp-dotnet-framework-samples

Sample code for the Datalogics .NET Framework interface of the Adobe PDF Library

Language: C# - Size: 517 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 1 - Forks: 6

andrealenzi11/py-poppleract

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents

Language: Python - Size: 195 KB - Last synced: 3 months ago - Pushed: 6 months ago - Stars: 5 - Forks: 0

ExceptedPrism3/PDFToAudio

"PDF To Audio" is a Python tool that transforms PDF documents into audio files using OCR and Text-to-Speech technology. Ideal for accessibility and auditory learning, it supports multiple languages, parallel processing, and smart rate limit handling.

Language: Python - Size: 2.81 MB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

galkahana/pdf-text-extraction

cli for extracting text from PDF files (and maybe possibly tables)

Language: C++ - Size: 5.41 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 58 - Forks: 15

KOUISAmine/pdf-tools

A collection of PDF tools to convert, merge, and compress PDFs. Free & No installation.

Size: 2.93 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 1 - Forks: 0

53buahapel/pdf-to-text-converter

python script that i made to convert pdf to text

Language: Python - Size: 1000 Bytes - Last synced: 3 months ago - Pushed: 5 months ago - Stars: 0 - Forks: 0

renan-siqueira/python-pdf-tool

This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.

Language: Python - Size: 6.84 KB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 1 - Forks: 0

iditectweb/converter

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework

Language: C# - Size: 10.7 KB - Last synced: 7 months ago - Pushed: over 5 years ago - Stars: 31 - Forks: 12

aishwarya-art/Pdf-to-text-extract

Pdf to text extraction using PDF parser library in codeigniter 3 sample code

Language: PHP - Size: 2.93 KB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 0 - Forks: 0

datalogics/adobe-pdf-library-samples

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

Size: 43.3 MB - Last synced: about 1 month ago - Pushed: 12 months ago - Stars: 77 - Forks: 62

nainiayoub/pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

Language: Python - Size: 56.6 KB - Last synced: 9 months ago - Pushed: 11 months ago - Stars: 43 - Forks: 21

LuisAraujo/API-Tabua-Mare

[Eng] API for obtaining data from the Tide Table, using web scraping. [Pt-Br] API para Obtenção da Tábua de Maré diária, usando web scraping com PHP.

Language: JavaScript - Size: 1.28 MB - Last synced: 9 months ago - Pushed: 11 months ago - Stars: 12 - Forks: 7

Directorman9/Optical-character-recognition

The notebook in this repository uses pytesseract to extract text from a pdf document. The script can be used to automate text acquisition from a large body of printed resources such as books. The acquired text can then be used for dowstream tasks, such as training language models, topic models, document summarization etc

Size: 1000 Bytes - Last synced: 10 months ago - Pushed: about 2 years ago - Stars: 0 - Forks: 0

NanoNets/ocr-python

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

Language: Jupyter Notebook - Size: 5.52 MB - Last synced: 10 months ago - Pushed: over 1 year ago - Stars: 24 - Forks: 4

isuruwa/PDF-TOOLBOX

A Multi Purpose PDF Toolkit

Language: Python - Size: 121 KB - Last synced: 10 months ago - Pushed: 10 months ago - Stars: 39 - Forks: 6

revanthkalagudi/pdf-to-text-python

This code is designed to analyze a PDF document and determine the percentage of AI-generated content within the text. It utilizes the PyPDF2 library to extract the text from each page of the PDF and the NLTK library to check for AI-generated words.

Language: Python - Size: 10.7 KB - Last synced: about 1 year ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0

amitbd1508/Blind-EYE

A book reader with voice control functionality for blind people

Language: C# - Size: 7.04 MB - Last synced: about 1 year ago - Pushed: almost 4 years ago - Stars: 1 - Forks: 0

shine-jayakumar/Extract-Data-From-PDF-In-Python

Batch-convert pdf to text, extract data from pdf in python

Language: Python - Size: 13.7 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 9 - Forks: 4

asepmaulanaismail/pdf-to-txt-python

Simple pdf to text with python using PDFtk and PyPDF2

Language: Python - Size: 550 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 13 - Forks: 9

AshkanAbd/pdf2word-GUI

convert pdf to word

Language: Java - Size: 18.6 MB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 8 - Forks: 5

selectpdf/selectpdf-api-nodejs-client

Node.js client for SelectPdf Online REST API

Language: JavaScript - Size: 30.3 KB - Last synced: 9 months ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0

selectpdf/selectpdf-api-ruby-client

Ruby client for SelectPdf Online REST API

Language: Ruby - Size: 21.5 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0

selectpdf/selectpdf-api-perl-client

Perl client for SelectPdf Online REST API

Language: Perl - Size: 27.3 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0

mfakca/pdf2text

PDF'leri metne dönüştürür

Language: Roff - Size: 21.4 MB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0

zevio/pcu_pdf

PDF parser component (Apache Tika) for PCU project

Language: Python - Size: 53.3 MB - Last synced: 15 days ago - Pushed: over 5 years ago - Stars: 1 - Forks: 0

zevio/pcu_io

IO management for PCU project

Language: Python - Size: 361 KB - Last synced: 9 days ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

orijtech/tikago

Apache Tika adapter in Go

Language: Go - Size: 48 MB - Last synced: about 1 month ago - Pushed: over 7 years ago - Stars: 0 - Forks: 0