GitHub topics: pdf-text-extraction

Repositories

hyeonsangjeon/PDF2LLM-Tuning-Studio

PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.

Language: Jupyter Notebook - Size: 1.03 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 6 - Forks: 1

nsourlos/OCR_and_RAG

Tests of OCR and RAG with LLMs

Language: Jupyter Notebook - Size: 21.5 KB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

kushalpatel0265/Resume-Parser

A resume parser that extracts key details from PDF files using Groq's LLM

Language: Jupyter Notebook - Size: 239 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

rithulkamesh/docproc

Opinionated and Sophisticated Document Region Analyzer.

Language: Python - Size: 219 KB - Last synced at: 1 day ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

eli64s/pdflex

CLI for merging PDF contexts.

Language: Python - Size: 465 KB - Last synced at: 20 days ago - Pushed at: 6 months ago - Stars: 3 - Forks: 0

rmottanet/unchainedtext

UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.

Language: Python - Size: 31.3 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

VirajMadhu/pdf_key_matcher

Highlights the key matches between your Given PDF and the description text

Language: Python - Size: 19.5 KB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 2 - Forks: 0

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

Language: Python - Size: 87.9 KB - Last synced at: 5 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

towfique-elahe/pdf-to-structured-csv

A Python-based tool for extracting structured data from PDFs using OCR and regex, and exporting it to CSV. Ideal for processing invoices, logs, or scanned documents into organized, usable datasets.

Language: Jupyter Notebook - Size: 27.3 KB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 0

Zeeshanahmad4/NLP-Pdf-Minning-Extracting-text-from-pdf

NLP Pdf Minning Extracting text from pdf

Language: Python - Size: 2.86 MB - Last synced at: 5 months ago - Pushed at: over 5 years ago - Stars: 3 - Forks: 1

PrathameshDhande22/PdfTxtBot

A Telegram bot which extract Text from PDF, also extract the Images of PDF Pages. Made with Python

Language: Python - Size: 12.7 KB - Last synced at: 5 months ago - Pushed at: over 2 years ago - Stars: 4 - Forks: 2

RealBlueSwan/BSPDFDataExtractor

Extracts Data from provided PDF using key words to identify relevant datapoints. Using UglyToad PDFPIG(great lib btw)

Language: C# - Size: 7.8 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

mamiriqbal1/rag_book_qa_prompt

A simple demonstration of how you can implement retrieval augmented generation (RAG) for a book.

Language: Jupyter Notebook - Size: 12.3 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 1 - Forks: 0

vijayengineer/PDFTextSpeechConverter

Converts scanned documents and ordinary documents into speech mp3 using Amazon Polly

Language: Python - Size: 1.18 MB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 4 - Forks: 1

Spikes2012/DjangoBusPriority

This is for Technology Application Project at Swinburne University of Technology

Language: Python - Size: 249 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

houking-can/PDFSDK

Based on Foxit Quick PDF Library，python interface

Language: Python - Size: 8.27 MB - Last synced at: over 2 years ago - Pushed at: over 5 years ago - Stars: 8 - Forks: 2

Related Keywords

pdf-text-extraction 16 text-extraction 5 python 5 data-extraction 4 ocr 3 pdf 3 pdf-document-processor 3 llm 3 python3 2 document-processing 2 rag 2 text-processing 2 content-extraction 2 pdf-converter 2 pdf-to-csv 1 yaml-configuration 1 web-crawling 1 pdf2image 1 pytesseract 1 python-automation 1 regex-parsing 1 structured-data-extraction 1 extract-text 1 pdf-files 1 terminal-based 1 text-compression 1 virajmadhu 1 concurrent-crawling 1 data-collection 1 data-extraction-pipeline 1 data-preservation-and-recovery 1 data-scraping 1 error-handling 1 html-parsing 1 http-requests 1 metadata-storage 1 modular-design 1 python-crawler 1 rate-limiting 1 structured-data-storage 1 url-normalization 1 question-answering 1 retrieval-augmented-generation 1 audiobook 1 aws-polly 1 images 1 scanned-documents 1 speech 1 synthesis 1 text 1 django 1 file-upload 1 image-to-text 1 webapplication 1 pdf-merge 1 pdf-sdk 1 pdf-split 1 pdf-format 1 pdfcon 1 pdfkit 1 pdftohtml 1 pdftoimage 1 pdftools 1 pdftotext 1 image-extractor 1 pdf-image 1 pdf-text 1 python-telegram 1 python-telegram-bot 1 telegram 1 telegram-bot 1 chatgpt-web 1 large-language-models 1 document-parsing 1 document-analysis 1 streamlit-webapp 1 nlp 1 google-colab 1 api 1 qwen2-vl 1 openai 1 mistral 1 information-retrieval 1 gemini 1 colpali 1 cohere 1 unstructured 1 unsloth 1 text-disti 1 sagemaker 1 processing-job 1 processing 1 pdf-generation 1 gpu 1 finetuning 1 docker 1 distillation 1 data-argumantation 1 cuda 1 claude 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Repos