GitHub topics: content-extraction
pinkpixel-dev/web-scout-mcp
A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.
Language: JavaScript - Size: 440 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 4 - Forks: 2

vakharwalad23/mark-minion
The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.
Language: TypeScript - Size: 866 KB - Last synced at: 1 day ago - Pushed at: 3 days ago - Stars: 9 - Forks: 1

mendableai/firecrawl-mcp-server
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
Language: JavaScript - Size: 337 KB - Last synced at: 6 days ago - Pushed at: 14 days ago - Stars: 3,408 - Forks: 324

graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
Language: TypeScript - Size: 625 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 301 - Forks: 34

suebksnn/web-scout-mcp
A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.
Language: JavaScript - Size: 353 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

DeepKariaX/Analysis-Alpaca-Researcher
π¦ Production-ready MCP server for comprehensive research and analysis with web + academic search, content extraction, and optional React web interface
Language: Python - Size: 218 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

JustM3Sunny/AI_WEB_INFO_RETRIVAL
AI-powered web info retriever that performs real-time search, intelligent content extraction, and Gemini-based summarization with CLI & Python support.
Language: Python - Size: 102 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

baughmann/tikara
The metadata and text content extractor for almost every file type.
Language: Python - Size: 161 MB - Last synced at: 8 days ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

mohammad6706/export-data-url
Language: Python - Size: 7.81 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

currentslab/extractnet
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
Language: HTML - Size: 421 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 276 - Forks: 24

dust-ai-mr/dust-html
Dust library for html processing
Language: Java - Size: 86.9 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

oiwn/dom-content-extraction
DOM Based Content Extraction via Text Density
Language: Rust - Size: 1.29 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 31 - Forks: 2

spences10/mcp-jinaai-reader
π Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader
Language: JavaScript - Size: 116 KB - Last synced at: 1 day ago - Pushed at: 2 months ago - Stars: 28 - Forks: 4

Prakashmaheshwaran/docscraperforai
A Python library for scraping technical documentation with ease β supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.
Language: Python - Size: 18.6 KB - Last synced at: 3 days ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

shyndman/defuddler
Defuddler is a CLI tool for extracting the content of web pages and articles, and leaving the noisy aggravations behind. Features multiple output formats, browser preview, and customizable user-agent options. Wraps the excellent kepano/defuddle tool.
Language: TypeScript - Size: 319 KB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

01one/wordpress-content-extractor
WordPress Content Extractor: XML to Structured Text Converter
Language: Python - Size: 4.88 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

pinkpixel-dev/prysm
Prysm is a blazing-smart Puppeteer-based web scraper that doesn't just extract - it understands structure. Capable of scraping virtually any website with intelligent content detection and 14 specialized scroll strategies that adapt to different page layouts, Prysm excels at extracting content that other scrapers miss.
Language: JavaScript - Size: 1.16 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

rithulkamesh/docproc
Opinionated and Sophisticated Document Region Analyzer.
Language: Python - Size: 219 KB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

helioLJ/youtube-transcript-copier
Chrome extension to copy YouTube transcripts with AI-friendly features
Language: JavaScript - Size: 2.29 MB - Last synced at: 17 days ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

leroyanders/acrticle-scrapper
This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structuredβ¦
Language: Python - Size: 23.4 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 1

mlibre/Deep-Truth
DeepTruth is your ultimate research buddy π€ that uses next-gen AI (via Ollama and Google Generative AI) to dig deep, extract exact quotes, and stitch them into a narrative. No fluff, just facts! ππ
Language: JavaScript - Size: 10.7 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Solrikk/DataDigger
DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.
Language: Go - Size: 38.1 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

Aish-p/WebScraperAPI
WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.
Language: Python - Size: 287 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

pdfix/pdfix_sdk_example_npm
Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
Language: JavaScript - Size: 882 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

newben420/gdelt_utility
A web-based utility for fetching, categorizing, summarizing and managing global news and articles using the GDELT 2.0 API. Designed for content creators, news aggregators, and researchers, this tool simplifies access to up-to-date articles with an intuitive UI and customizable configurations.
Language: JavaScript - Size: 1.77 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

amirthfultehrani/Youtube-Transcript-Copier
A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.
Language: JavaScript - Size: 1.17 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

pdfix/pdfix_sdk_example_cpp
Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
Language: C++ - Size: 21.4 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 20 - Forks: 4

peremenov/seize π¦
Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader
Language: HTML - Size: 11.2 MB - Last synced at: 27 days ago - Pushed at: about 8 years ago - Stars: 12 - Forks: 1

gregors/boilerpipe-ruby
Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles
Language: Ruby - Size: 240 KB - Last synced at: 8 days ago - Pushed at: over 4 years ago - Stars: 43 - Forks: 5

simonpierreboucher/Crawler
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
Language: Python - Size: 87.9 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

mvasilkov/readability2
Readability2 converts HTML to plain text.
Language: TypeScript - Size: 106 KB - Last synced at: 2 months ago - Pushed at: over 6 years ago - Stars: 109 - Forks: 15

minarc/godensity
This repository is implematation of π DOM based content extraction via text density. Tested for Korean web pages.
Language: Go - Size: 2.7 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 5 - Forks: 0

midstreeeam/peduncle π¦
content extraction from html
Language: Python - Size: 22.5 KB - Last synced at: 20 days ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

gdamdam/sumo
Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
Language: Python - Size: 34.2 KB - Last synced at: 2 months ago - Pushed at: over 6 years ago - Stars: 20 - Forks: 5

nikitautiu/learnhtml
Web content extraction using machine learning
Language: HTML - Size: 29.7 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 32 - Forks: 9

TypesetIO/jsuite
Tools for parsing and manipulating JATS XML documents.
Language: Python - Size: 20.5 KB - Last synced at: 8 days ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 2

LandWhale2/TD-Spider
Via Text Density Simple Web Crawler With Go
Language: Go - Size: 4.88 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 13 - Forks: 0

tuffstuff9/nextjs-pdf-parser
Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.
Language: TypeScript - Size: 44.9 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 17 - Forks: 2

bencmc/youtube_video_summarizer
This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.
Language: Python - Size: 744 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

rmwkwok/crawler
Multi-process crawler which extracts main content and sustain itself by extracting more links to crawl.
Language: Python - Size: 85.9 KB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 1

HarryDulaney/news-feed-scraper
Configurable and schedulable web scrapping tool. Used to extract raw article content and metadata for aggregated news feeds.
Language: Java - Size: 6.57 MB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

SbstnErhrdt/node-readability
Simple node server to extract relevant content from website source code using Mozilla's Readability.js
Language: JavaScript - Size: 14.6 KB - Last synced at: 12 days ago - Pushed at: over 4 years ago - Stars: 3 - Forks: 0

timoteostewart/benson
Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!
Language: Python - Size: 144 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 11 - Forks: 0

zeoagency/mobile-first-indexing-tool
Mobile First Indexing Tool
Language: Python - Size: 43.6 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 10 - Forks: 2

sebischair/LowestCommonAncestorExtractor
A python content extraction library for the structured extraction of Terms and Conditions from German and English online shops
Language: Python - Size: 19.5 KB - Last synced at: 5 months ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 2

pdfix/pdfix_sdk_example_node_js
Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
Language: JavaScript - Size: 329 KB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 2 - Forks: 0

thorkill/dbce
Diff Based Content Extraction is a part of my Bachelor Thesis: Joint Approach to Boilerplate Detection in Web Archives
Language: HTML - Size: 5.53 MB - Last synced at: about 2 years ago - Pushed at: about 8 years ago - Stars: 1 - Forks: 1

masud-technope/ContentSuggest-Replication-Package-CASCON2015
Recommending Relevant Sections from a Webpage About Programming Errors and Exceptions
Language: Hack - Size: 4.59 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

KunlinY/DistributedCrawlSystem
εεΈεΌη¬θ«η³»η»
Language: Java - Size: 109 MB - Last synced at: over 2 years ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 3
