An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: content-extraction

pinkpixel-dev/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

Language: JavaScript - Size: 440 KB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 4 - Forks: 2

vakharwalad23/mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

Language: TypeScript - Size: 866 KB - Last synced at: 1 day ago - Pushed at: 3 days ago - Stars: 9 - Forks: 1

mendableai/firecrawl-mcp-server

Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

Language: JavaScript - Size: 337 KB - Last synced at: 6 days ago - Pushed at: 14 days ago - Stars: 3,408 - Forks: 324

graphlit/graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

Language: TypeScript - Size: 625 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 301 - Forks: 34

suebksnn/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

Language: JavaScript - Size: 353 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

DeepKariaX/Analysis-Alpaca-Researcher

πŸ¦™ Production-ready MCP server for comprehensive research and analysis with web + academic search, content extraction, and optional React web interface

Language: Python - Size: 218 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

JustM3Sunny/AI_WEB_INFO_RETRIVAL

AI-powered web info retriever that performs real-time search, intelligent content extraction, and Gemini-based summarization with CLI & Python support.

Language: Python - Size: 102 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

baughmann/tikara

The metadata and text content extractor for almost every file type.

Language: Python - Size: 161 MB - Last synced at: 8 days ago - Pushed at: 5 months ago - Stars: 2 - Forks: 0

mohammad6706/export-data-url

Language: Python - Size: 7.81 KB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 0 - Forks: 0

currentslab/extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

Language: HTML - Size: 421 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 276 - Forks: 24

dust-ai-mr/dust-html

Dust library for html processing

Language: Java - Size: 86.9 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

oiwn/dom-content-extraction

DOM Based Content Extraction via Text Density

Language: Rust - Size: 1.29 MB - Last synced at: 7 days ago - Pushed at: about 1 month ago - Stars: 31 - Forks: 2

spences10/mcp-jinaai-reader

πŸ” Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

Language: JavaScript - Size: 116 KB - Last synced at: 1 day ago - Pushed at: 2 months ago - Stars: 28 - Forks: 4

Prakashmaheshwaran/docscraperforai

A Python library for scraping technical documentation with ease β€” supports single-page and domain-wide crawling, multiple output formats (Markdown, JSON, TXT), and proxy-friendly setups. Ideal for AI, research, and content generation workflows.

Language: Python - Size: 18.6 KB - Last synced at: 3 days ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

shyndman/defuddler

Defuddler is a CLI tool for extracting the content of web pages and articles, and leaving the noisy aggravations behind. Features multiple output formats, browser preview, and customizable user-agent options. Wraps the excellent kepano/defuddle tool.

Language: TypeScript - Size: 319 KB - Last synced at: 7 days ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

01one/wordpress-content-extractor

WordPress Content Extractor: XML to Structured Text Converter

Language: Python - Size: 4.88 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

pinkpixel-dev/prysm

Prysm is a blazing-smart Puppeteer-based web scraper that doesn't just extract - it understands structure. Capable of scraping virtually any website with intelligent content detection and 14 specialized scroll strategies that adapt to different page layouts, Prysm excels at extracting content that other scrapers miss.

Language: JavaScript - Size: 1.16 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

rithulkamesh/docproc

Opinionated and Sophisticated Document Region Analyzer.

Language: Python - Size: 219 KB - Last synced at: 4 days ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

helioLJ/youtube-transcript-copier

Chrome extension to copy YouTube transcripts with AI-friendly features

Language: JavaScript - Size: 2.29 MB - Last synced at: 17 days ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

leroyanders/acrticle-scrapper

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

Language: Python - Size: 23.4 KB - Last synced at: 2 months ago - Pushed at: over 1 year ago - Stars: 5 - Forks: 1

mlibre/Deep-Truth

DeepTruth is your ultimate research buddy πŸ€– that uses next-gen AI (via Ollama and Google Generative AI) to dig deep, extract exact quotes, and stitch them into a narrative. No fluff, just facts! πŸ”πŸš€

Language: JavaScript - Size: 10.7 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Solrikk/DataDigger

DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.

Language: Go - Size: 38.1 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 4 - Forks: 0

Aish-p/WebScraperAPI

WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.

Language: Python - Size: 287 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

pdfix/pdfix_sdk_example_npm

Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

Language: JavaScript - Size: 882 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

newben420/gdelt_utility

A web-based utility for fetching, categorizing, summarizing and managing global news and articles using the GDELT 2.0 API. Designed for content creators, news aggregators, and researchers, this tool simplifies access to up-to-date articles with an intuitive UI and customizable configurations.

Language: JavaScript - Size: 1.77 MB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 2 - Forks: 0

amirthfultehrani/Youtube-Transcript-Copier

A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.

Language: JavaScript - Size: 1.17 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 1 - Forks: 0

pdfix/pdfix_sdk_example_cpp

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

Language: C++ - Size: 21.4 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 20 - Forks: 4

peremenov/seize πŸ“¦

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

Language: HTML - Size: 11.2 MB - Last synced at: 27 days ago - Pushed at: about 8 years ago - Stars: 12 - Forks: 1

gregors/boilerpipe-ruby

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

Language: Ruby - Size: 240 KB - Last synced at: 8 days ago - Pushed at: over 4 years ago - Stars: 43 - Forks: 5

simonpierreboucher/Crawler

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

Language: Python - Size: 87.9 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

mvasilkov/readability2

Readability2 converts HTML to plain text.

Language: TypeScript - Size: 106 KB - Last synced at: 2 months ago - Pushed at: over 6 years ago - Stars: 109 - Forks: 15

minarc/godensity

This repository is implematation of πŸ“„ DOM based content extraction via text density. Tested for Korean web pages.

Language: Go - Size: 2.7 MB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 5 - Forks: 0

midstreeeam/peduncle πŸ“¦

content extraction from html

Language: Python - Size: 22.5 KB - Last synced at: 20 days ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

gdamdam/sumo

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

Language: Python - Size: 34.2 KB - Last synced at: 2 months ago - Pushed at: over 6 years ago - Stars: 20 - Forks: 5

nikitautiu/learnhtml

Web content extraction using machine learning

Language: HTML - Size: 29.7 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 32 - Forks: 9

TypesetIO/jsuite

Tools for parsing and manipulating JATS XML documents.

Language: Python - Size: 20.5 KB - Last synced at: 8 days ago - Pushed at: almost 3 years ago - Stars: 3 - Forks: 2

LandWhale2/TD-Spider

Via Text Density Simple Web Crawler With Go

Language: Go - Size: 4.88 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 13 - Forks: 0

tuffstuff9/nextjs-pdf-parser

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

Language: TypeScript - Size: 44.9 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 17 - Forks: 2

bencmc/youtube_video_summarizer

This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

Language: Python - Size: 744 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

rmwkwok/crawler

Multi-process crawler which extracts main content and sustain itself by extracting more links to crawl.

Language: Python - Size: 85.9 KB - Last synced at: almost 2 years ago - Pushed at: over 4 years ago - Stars: 2 - Forks: 1

HarryDulaney/news-feed-scraper

Configurable and schedulable web scrapping tool. Used to extract raw article content and metadata for aggregated news feeds.

Language: Java - Size: 6.57 MB - Last synced at: 4 months ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

SbstnErhrdt/node-readability

Simple node server to extract relevant content from website source code using Mozilla's Readability.js

Language: JavaScript - Size: 14.6 KB - Last synced at: 12 days ago - Pushed at: over 4 years ago - Stars: 3 - Forks: 0

timoteostewart/benson

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

Language: Python - Size: 144 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 11 - Forks: 0

zeoagency/mobile-first-indexing-tool

Mobile First Indexing Tool

Language: Python - Size: 43.6 MB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 10 - Forks: 2

sebischair/LowestCommonAncestorExtractor

A python content extraction library for the structured extraction of Terms and Conditions from German and English online shops

Language: Python - Size: 19.5 KB - Last synced at: 5 months ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 2

pdfix/pdfix_sdk_example_node_js

Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

Language: JavaScript - Size: 329 KB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 2 - Forks: 0

thorkill/dbce

Diff Based Content Extraction is a part of my Bachelor Thesis: Joint Approach to Boilerplate Detection in Web Archives

Language: HTML - Size: 5.53 MB - Last synced at: about 2 years ago - Pushed at: about 8 years ago - Stars: 1 - Forks: 1

masud-technope/ContentSuggest-Replication-Package-CASCON2015

Recommending Relevant Sections from a Webpage About Programming Errors and Exceptions

Language: Hack - Size: 4.59 MB - Last synced at: about 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

KunlinY/DistributedCrawlSystem

εˆ†εΈƒεΌηˆ¬θ™«η³»η»Ÿ

Language: Java - Size: 109 MB - Last synced at: over 2 years ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 3