GitHub topics: pdf-to-text | Ecosyste.ms: Repos

Unstructured-IO/unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Language: HTML - Size: 193 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 12,544 - Forks: 1,030

run-llama/llama_cloud_services

Knowledge Agents and Management in the Cloud

Language: TypeScript - Size: 66.5 MB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 4,122 - Forks: 443

proafxin/screener

AI based HR screener/Resume helper

Language: Python - Size: 200 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

docling-project/docling

Get your documents ready for gen AI

Language: Python - Size: 132 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 37,086 - Forks: 2,565

aspose-pdf/Aspose.PDF-for-Node.js-via-CPP

Aspose.PDF for Node.js via C++

Language: JavaScript - Size: 145 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

opendataloader-project/opendataloader-pdf

🔓 Unlock Your PDFs for AI

Language: Java - Size: 47.9 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 6 - Forks: 0

aspose-pdf/Aspose.PDF-for-JavaScript-via-CPP

Aspose.PDF for JavaScript via C++

Language: HTML - Size: 1.04 GB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 12 - Forks: 0

datalogics/apdfl-vb-dotnet-samples

Adobe PDF Library Samples in Visual Basic for .NET

Language: Visual Basic .NET - Size: 176 KB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 1 - Forks: 4

seinecle/nocodefunctions-web-app

The code base of the front-end of nocodefunctions.com

Language: Java - Size: 38.2 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 40 - Forks: 7

tacticaxyz/tactica.llama.pdftotext.net

Pdf to Text .NET transcriber CLI app using Ollama

Language: C# - Size: 1.09 MB - Last synced at: 8 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 1

enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Language: Python - Size: 20.5 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1,378 - Forks: 134

Titan-327/Resume-Analyser

A Resume Analyzer that scores resumes, highlights issues, suggests improvements, generates LaTeX code for each section, and visualizes ATS scores through interactive performance graphs.

Language: TypeScript - Size: 134 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

MohammadHNdev/Persian_OCR_Arvangram

🔥 Best free Persian/Farsi OCR tool - Extract text from PDF/images with 95%+ accuracy. Fast, reliable, Google Colab ready.

Language: Jupyter Notebook - Size: 12.7 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 3 - Forks: 0

datalogics/apdfl-csharp-dotnet-samples

Sample code for the Datalogics .NET interface of the Adobe PDF Library

Language: C# - Size: 315 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 8 - Forks: 10

AkandindaJunior/Cloud-Services

If it’s not documented, it never happened. 📝 Please check my README.md for more details. 🔍

Size: 1000 Bytes - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

kevv1m/tikara

The metadata and text content extractor for almost every file type.

Size: 1000 Bytes - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 0 - Forks: 0

galkahana/pdf-text-extraction

cli for extracting text from PDF files (and maybe possibly tables)

Language: C++ - Size: 5.67 MB - Last synced at: 3 days ago - Pushed at: 3 months ago - Stars: 74 - Forks: 20

datalogics/apdfl-kotlin-samples

Adobe PDF Library Samples in Kotlin

Language: Kotlin - Size: 146 KB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 0 - Forks: 7

hddevteam/vscode-md-converter

一键转换Word、Excel、PDF、PowerPoint文档为Markdown格式的VS Code扩展。支持批量处理、智能文本提取、演讲者备注提取，提供中英文双语界面。OneClick Markdown Converter VS Code Extension with PowerPoint support, batch processing, and bilingual interface.

Language: TypeScript - Size: 6.41 MB - Last synced at: 22 days ago - Pushed at: 23 days ago - Stars: 6 - Forks: 1

gabriele-mastrapasqua/fastapi-ocr

FastAPI OCR service using tesseract or paddleOCR

Language: Python - Size: 86.9 KB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 1 - Forks: 0

pd3f/pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Language: HTML - Size: 930 KB - Last synced at: 5 days ago - Pushed at: almost 2 years ago - Stars: 326 - Forks: 39

BitMiracle/Docotic.Pdf.Samples

C# and VB.NET samples for Docotic.Pdf library

Language: Visual Basic .NET - Size: 53.8 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 78 - Forks: 39

seinecle/nocodefunctions-io

io for nocodefunctions: csv, txt, pdf, and xlsx so far

Language: Java - Size: 273 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

AI-Enginner/Intelligent-Document-Processing

AI-powered data extraction tool that converts PDFs, images, and scanned documents into structured data in seconds.

Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

datalogics/apdfl-java-maven-samples

Sample code for the Datalogics Java interface of the Adobe PDF Library setup to build with Maven

Language: Java - Size: 1.2 MB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 5 - Forks: 12

taher-el-mehdi/story-to-video

🎥 command-line Python tool that allows you to convert a PDF story into a video.

Language: Python - Size: 226 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

jdonohue44/NOAA-Weather-Modification-Forms-LLM-Extractor

Extract key information from 1,000s of NOAA Form 17-4 (Initial Report On Weather Modification Activities) using LLM.

Language: Python - Size: 993 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

asiff00/bangla-pdf-ocr

Bangla PDF to text converter that works on Windows, macOS, and Linux without any extra downloads or configurations.

Language: Python - Size: 94.7 KB - Last synced at: 6 days ago - Pushed at: 11 months ago - Stars: 13 - Forks: 2

datalogics/apdfl-cplusplus-samples

Sample code for the Datalogics C++ interface of the Adobe PDF Library

Language: C++ - Size: 35.3 MB - Last synced at: about 23 hours ago - Pushed at: 1 day ago - Stars: 9 - Forks: 9

datalogics/apdfl-csharp-dotnet-framework-samples

Sample code for the Datalogics .NET Framework interface of the Adobe PDF Library

Language: C# - Size: 564 KB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 3 - Forks: 9

baughmann/tikara

The metadata and text content extractor for almost every file type.

Language: Python - Size: 161 MB - Last synced at: 4 days ago - Pushed at: 7 months ago - Stars: 4 - Forks: 0

Rishabh9559/pdf-to-text

Welcome to the PDF to Text Converter built with Python and Streamlit! Upload any PDF file, extract clean text in seconds, and download it as a .txt file — all through a beautiful and easy-to-use web interface.

Language: Python - Size: 72.3 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

GiftMungmeeprued/document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

Size: 4.25 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 94 - Forks: 1

shoryasethia/markdrop

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Language: Python - Size: 158 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 123 - Forks: 5

StellarExplorerGuy/projects

Repo for all projects

Size: 12.9 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

FurqanHun/textnomnom-py

Extract text from PDFs, PPTs, & URLs (with OCR support). Converts PPT to PDF & handles files or folders. 🦍

Language: Python - Size: 76.2 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

codewithalihamza/SummarizeAI

SummarizeAI is a powerful AI-driven SaaS tool that converts lengthy PDF documents into clear, concise summaries in seconds. Whether you're a student, researcher, or busy professional, SummarizeAI helps you save time and extract key insights effortlessly.

Language: TypeScript - Size: 990 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

mbzuai-oryx/KITAB-Bench

[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Language: Python - Size: 26.3 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 40 - Forks: 2

TechAtNYU/ClassroomLM

ClassroomLM allows educational institutions to create specialized AI assistants for their classrooms.

Language: TypeScript - Size: 5.17 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 3 - Forks: 2

Utkarsh212/react-pdftotext

Light-weight memory-safe client library for extracting plain text from pdf files.

Language: TypeScript - Size: 143 KB - Last synced at: 2 months ago - Pushed at: 10 months ago - Stars: 19 - Forks: 6

monambike/pdfconverter-pdftables-to-csv

Python project that converts tables inside PDFs to CSV for convenient data manipulation. It has log and exception handling.

Language: Python - Size: 142 MB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 1

tusharkolekar24/GenAI-for-Pdf-Summarizer

Generative AI for Pdf Summarizer

Language: CSS - Size: 1.92 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

MegrezAI/LeapRAG

LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.

Language: Python - Size: 8.86 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

joinsime/Adobe-Acrobat

Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents

Size: 1.95 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

adaptaware/ragit

A RAG back and front end application

Language: Python - Size: 985 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

l1m1nal/Adobe-Acrobat

Adobe-Acrobat-is-a-powerful-PDF-solution-for-creating,-editing,-managing,-and-securing-documents

Size: 0 Bytes - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

islam-bld/Adobe-Acrobat

Adobe-Acrobat-is-a-powerful-PDF-solution-for-creating,-editing,-managing,-and-securing-documents

Language: JavaScript - Size: 2.93 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

genisis2025/Adobe-Acrobat

Adobe-Acrobat-is-a-powerful-PDF-solution-for-creating,-editing,-managing,-and-securing-documents

Language: JavaScript - Size: 2.93 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

cr4yfish/docling-js

Parsing Documents to one datatype (Typescript port of Docling) (NOT STARTED!)

Size: 23.4 KB - Last synced at: 4 days ago - Pushed at: 10 months ago - Stars: 1 - Forks: 0

IlyaFerens/Adobe-Acrobat

Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents

Size: 1.95 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

akii2423/Adobe-Acrobat

Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents

Size: 0 Bytes - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Feysis/Adobe-Acrobat

Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents

Size: 1.95 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

shahidmanzoor1/Adobe-Acrobat

Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents

Size: 1.95 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

bdbhaislive/Adobe-Acrobat

Adobe Acrobat is a powerful PDF solution for creating, editing, managing, and securing documents

Size: 3.91 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

papercast-dev/papercast

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Language: Python - Size: 218 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 48 - Forks: 1

JaspreetSingh-exe/MedXpert-Backend-FastAPI

AI-powered medical report analyzer that extracts text from PDFs/images, summarizes reports, detects abnormalities, and provides a chatbot for medical queries. Built with FastAPI, OCR (Tesseract, pdfplumber), OpenAI GPT-3.5, and deployed on Google Cloud. Future enhancements include medical image classification and predictions. Contributions Welcome!

Language: Python - Size: 56.6 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

JaspreetSingh-exe/MedXpert-FrontEnd

MedXpert is an Android-based healthcare application that leverages OCR (Tesseract, pdfplumber) and LLMs (OpenAI GPT-3.5) to automate medical report extraction, abnormality detection, and natural language summarization. It features Firebase-powered user authentication, role-based access control, and real-time chatbot integration for medical queries.

Language: Kotlin - Size: 2.98 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

Academic-Hammer/SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

Language: Python - Size: 47.9 KB - Last synced at: 5 months ago - Pushed at: about 5 years ago - Stars: 360 - Forks: 58

sushantnair/arxiv_extractor

This code can effectively convert PDF Research Papers to clean Text files, avoiding images and tables.

Language: Python - Size: 7.81 KB - Last synced at: 2 months ago - Pushed at: 8 months ago - Stars: 1 - Forks: 0

bytescout/pdf-extractor-sdk-samples

ByteScout PDF Extractor SDK source code samples

Language: C# - Size: 27.5 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 8 - Forks: 5

andrealenzi11/py-poppleract

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents

Language: Python - Size: 202 KB - Last synced at: 5 months ago - Pushed at: 11 months ago - Stars: 10 - Forks: 2

davibusanello/pdf2txt

A simple CLI to to convert PDF files into TXT using OCR

Language: Python - Size: 23.4 KB - Last synced at: 3 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

arjun-mavonic/scanned-pdf-text-extractor

This is a Python application that converts non-readable PDF files, such as scanned documents, into readable Word documents. It achieves this by first converting the PDF files into images and then extracting the text from the images to create the Word documents. The application provides a user-friendly interface to do the above task.

Language: Python - Size: 28.3 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 3 - Forks: 2

Clearedge-AI/clearedge

Build a RAG preprocessing pipeline

Language: Jupyter Notebook - Size: 24.7 MB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 11 - Forks: 0

KOUISAmine/pdf-tools

A collection of PDF tools to convert, merge, and compress PDFs. Free & No installation.

Size: 2.93 KB - Last synced at: 6 months ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 0

AlexTkDev/PDF-to-Word-Conversion

A parser that will retype text from a PDF into an MS Word document with the specified specifications

Language: Python - Size: 35.2 KB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

AliAlWahayb/Arabic_OCR_From_PDF Fork of zaakki-ahamed/Arabic_OCR_From_PDF

Perform Optical Character Recognition (OCR) on a scanned PDF file containing Arabic text and output a searchable PDF

Language: Python - Size: 197 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

CllsPy/PyPTE

The PDF Text Extractor API allows users to upload PDF files and receive the extracted text from those files. This API is built using FastAPI and leverages the PyMuPDF library for efficient text extraction.

Language: Python - Size: 11.7 KB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

orijtech/tikago

Apache Tika adapter in Go

Language: Go - Size: 48 MB - Last synced at: 4 days ago - Pushed at: over 8 years ago - Stars: 1 - Forks: 0

renan-siqueira/python-pdf-tool

This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.

Language: Python - Size: 7.81 KB - Last synced at: 5 months ago - Pushed at: almost 2 years ago - Stars: 5 - Forks: 1

AshkanAbd/pdf2word-GUI

convert pdf to word

Language: Java - Size: 18.6 MB - Last synced at: 5 months ago - Pushed at: almost 7 years ago - Stars: 9 - Forks: 6

gabriel-batistuta/pdf-to-any

a simple and functional multi convert system using amount of python librarys

Language: Python - Size: 36.1 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

madnight/pdf-layout-text-stripper Fork of JonathanLink/PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf.

Language: Java - Size: 1.8 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 7 - Forks: 4

nainiayoub/pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

Language: Python - Size: 24.4 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 69 - Forks: 42

fabriziomiano/pdf2txt-azure-ocr

A script to convert PDF files to TXT

Language: Python - Size: 8.79 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

asika32764/php-pdf-2-text

Simple PHP PDF to Text class

Language: PHP - Size: 204 KB - Last synced at: 15 days ago - Pushed at: almost 2 years ago - Stars: 24 - Forks: 17

princebhatt9588/Versatile_Code_Hub

VersatileCodeHub: Your one-stop repository for an array of coding projects. Explore diverse applications, from games like Flappy Bird to tools like QRCode Scanners. Expand your skills across various domains, all in one place.

Language: Python - Size: 4.98 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 1 - Forks: 1

bytescout/pdfco-rails

PDF.co Gem plugin for Ruby on Rails

Language: Ruby - Size: 13.7 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 1

Kamaruddheen/document-scanner

Extract structured text and data from documents like invoices, book pages, tables, etc.. using OpenCV and Tesseract OCR

Language: HTML - Size: 68.7 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mic-kul/pdf-textstream

JRuby gem to pdf to text while keeping the layout from original pdf file

Language: Java - Size: 3.59 MB - Last synced at: 5 days ago - Pushed at: over 7 years ago - Stars: 8 - Forks: 0

mehmet-kozan/pdf-parse

Pure javascript cross-platform module to extract texts from PDFs.

Language: JavaScript - Size: 7.78 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ExceptedPrism3/PDFToAudio

"PDF To Audio" is a Python tool that transforms PDF documents into audio files using OCR and Text-to-Speech technology. Ideal for accessibility and auditory learning, it supports multiple languages, parallel processing, and smart rate limit handling.

Language: Python - Size: 2.81 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

53buahapel/pdf-to-text-converter

python script that i made to convert pdf to text

Language: Python - Size: 1000 Bytes - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

iditectweb/converter

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework

Language: C# - Size: 10.7 KB - Last synced at: almost 2 years ago - Pushed at: almost 7 years ago - Stars: 31 - Forks: 12

aishwarya-art/Pdf-to-text-extract

Pdf to text extraction using PDF parser library in codeigniter 3 sample code

Language: PHP - Size: 2.93 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

dongju93/extract-ti-from-reports

Convert PDFs to text, then transform that text into structured JSON objects for Threat Intelligence.

Language: Jupyter Notebook - Size: 134 MB - Last synced at: 6 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

datalogics/adobe-pdf-library-samples

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

Size: 43.3 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 77 - Forks: 62

LuisAraujo/API-Tabua-Mare

[Eng] API for obtaining data from the Tide Table, using web scraping. [Pt-Br] API para Obtenção da Tábua de Maré diária, usando web scraping com PHP.

Language: JavaScript - Size: 1.28 MB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 12 - Forks: 7

Directorman9/Optical-character-recognition

The notebook in this repository uses pytesseract to extract text from a pdf document. The script can be used to automate text acquisition from a large body of printed resources such as books. The acquired text can then be used for dowstream tasks, such as training language models, topic models, document summarization etc

Size: 1000 Bytes - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0