GitHub topics: content-extraction

Repositories

pinkpixel-dev/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

Language: JavaScript - Size: 440 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 4 - Forks: 2

vakharwalad23/mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

Language: TypeScript - Size: 866 KB - Last synced at: 1 day ago - Pushed at: 3 days ago - Stars: 9 - Forks: 1

mendableai/firecrawl-mcp-server

Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

Language: JavaScript - Size: 337 KB - Last synced at: 6 days ago - Pushed at: 14 days ago - Stars: 3,408 - Forks: 324

graphlit/graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

Language: TypeScript - Size: 625 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 301 - Forks: 34

suebksnn/web-scout-mcp

Language: JavaScript - Size: 353 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

DeepKariaX/Analysis-Alpaca-Researcher

🦙 Production-ready MCP server for comprehensive research and analysis with web + academic search, content extraction, and optional React web interface

Language: Python - Size: 218 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

JustM3Sunny/AI_WEB_INFO_RETRIVAL

AI-powered web info retriever that performs real-time search, intelligent content extraction, and Gemini-based summarization with CLI & Python support.

Language: Python - Size: 102 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

baughmann/tikara

The metadata and text content extractor for almost every file type.

Language: Python - Size: 161 MB - Last synced at: 8 days ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

mohammad6706/export-data-url

Language: Python - Size: 7.81 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

currentslab/extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

Language: HTML - Size: 421 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 276 - Forks: 24

dust-ai-mr/dust-html

Dust library for html processing

Language: Java - Size: 86.9 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

oiwn/dom-content-extraction

DOM Based Content Extraction via Text Density

Language: Rust - Size: 1.29 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 31 - Forks: 2

spences10/mcp-jinaai-reader

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

Language: JavaScript - Size: 116 KB - Last synced at: 1 day ago - Pushed at: 2 months ago - Stars: 28 - Forks: 4

Prakashmaheshwaran/docscraperforai

A Python library for scraping technical documentation with ease — supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.

Language: Python - Size: 18.6 KB - Last synced at: 3 days ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

shyndman/defuddler

Defuddler is a CLI tool for extracting the content of web pages and articles, and leaving the noisy aggravations behind. Features multiple output formats, browser preview, and customizable user-agent options. Wraps the excellent kepano/defuddle tool.

Language: TypeScript - Size: 319 KB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

01one/wordpress-content-extractor

WordPress Content Extractor: XML to Structured Text Converter

Language: Python - Size: 4.88 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

pinkpixel-dev/prysm

Prysm is a blazing-smart Puppeteer-based web scraper that doesn't just extract - it understands structure. Capable of scraping virtually any website with intelligent content detection and 14 specialized scroll strategies that adapt to different page layouts, Prysm excels at extracting content that other scrapers miss.

Language: JavaScript - Size: 1.16 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

rithulkamesh/docproc

Opinionated and Sophisticated Document Region Analyzer.

Language: Python - Size: 219 KB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

helioLJ/youtube-transcript-copier

Chrome extension to copy YouTube transcripts with AI-friendly features

Language: JavaScript - Size: 2.29 MB - Last synced at: 17 days ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

leroyanders/acrticle-scrapper

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

Language: Python - Size: 23.4 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 1

mlibre/Deep-Truth

DeepTruth is your ultimate research buddy 🤖 that uses next-gen AI (via Ollama and Google Generative AI) to dig deep, extract exact quotes, and stitch them into a narrative. No fluff, just facts! 🔍🚀

Language: JavaScript - Size: 10.7 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Solrikk/DataDigger

DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.

Language: Go - Size: 38.1 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

Aish-p/WebScraperAPI

WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.

Language: Python - Size: 287 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

pdfix/pdfix_sdk_example_npm

Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

Language: JavaScript - Size: 882 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

newben420/gdelt_utility

A web-based utility for fetching, categorizing, summarizing and managing global news and articles using the GDELT 2.0 API. Designed for content creators, news aggregators, and researchers, this tool simplifies access to up-to-date articles with an intuitive UI and customizable configurations.

Language: JavaScript - Size: 1.77 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

amirthfultehrani/Youtube-Transcript-Copier

A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.

Language: JavaScript - Size: 1.17 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

pdfix/pdfix_sdk_example_cpp

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

Language: C++ - Size: 21.4 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 20 - Forks: 4

peremenov/seize 📦

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

Language: HTML - Size: 11.2 MB - Last synced at: 27 days ago - Pushed at: about 8 years ago - Stars: 12 - Forks: 1

gregors/boilerpipe-ruby

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

Language: Ruby - Size: 240 KB - Last synced at: 8 days ago - Pushed at: over 4 years ago - Stars: 43 - Forks: 5

simonpierreboucher/Crawler

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

Language: Python - Size: 87.9 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0