GitHub topics: web-crawler | Ecosyste.ms: Repos

Synterraapt/Https-Site-Screaper-Social-Commerce-Court-Web

powerful web scraping tool designed to extract data from HTTPS-secured websites, including social media, e-commerce platforms, and court records, for analysis, monitoring, or research purposes.

Language: C# - Size: 1.82 MB - Last synced at: about 3 hours ago - Pushed at: about 3 hours ago - Stars: 3 - Forks: 0

scrapfly/scrapfly-scrapers

Scalable Python web scraping scripts for +40 popular domains

Language: Python - Size: 4.93 MB - Last synced at: about 11 hours ago - Pushed at: 7 days ago - Stars: 540 - Forks: 129

platonai/PulsarRPA

PulsarRPA: An AI-Enabled, Super-Fast, Thread-Safe Browser Automation Solution! 💖

Language: Kotlin - Size: 30.4 MB - Last synced at: about 18 hours ago - Pushed at: about 23 hours ago - Stars: 883 - Forks: 128

exbomkimaa/Google-News-scraper

Scraper for Google News articles with headline extraction, keyword targeting, and proxy support.

Language: Python - Size: 8.79 KB - Last synced at: about 22 hours ago - Pushed at: about 23 hours ago - Stars: 0 - Forks: 0

kelvinweijun/AI-Powered-Search-Engine

AI-powered search engine that uses FAISS and DenseNet-50 for both text and reverse image search capabilities. Comes with an asynchronous based web crawler

Language: Python - Size: 617 KB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 0 - Forks: 0

mendableai/firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Language: TypeScript - Size: 57.5 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 40,225 - Forks: 3,756

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Language: TypeScript - Size: 140 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 17,966 - Forks: 839

GreenMeeple/MensaarLecker

A fully automated scraper and static website for the Saarbrücken Mensa, powered by Python, Selenium, Google Sheets, and GitHub Actions.

Language: HTML - Size: 1.13 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

a-b-z-b/web-spider

A Humble Web Crawler

Language: Go - Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Palioboro/Argus

Modern C library for command-line argument parsing. Advanced features: subcommands, validation, multi-inputs, environment variables

Language: C - Size: 1.65 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

mendableai/firecrawl-app-examples

🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.

Language: Jupyter Notebook - Size: 13.6 MB - Last synced at: 2 days ago - Pushed at: 18 days ago - Stars: 416 - Forks: 111

KhoaLon/scrapedo-scrapers

Web scraping scripts with Scrape.do 😎

Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

Saptha-Harsh/LilHomie

A Machine Learning Project implemented from scratch which involves web scraping, data engineering, exploratory data analysis and machine learning to predict housing prices in New York Tri-State Area.

Language: Jupyter Notebook - Size: 2.22 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

pinkpixel-dev/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

Language: JavaScript - Size: 440 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 2

apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Language: Python - Size: 28.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 5,737 - Forks: 391

hyperplasma/hyfetcher

A high-performance web content downloader and localizer built with Rust. Leverages Rust's powerful concurrency to efficiently batch download web pages and save them as local files.

Language: Rust - Size: 37.1 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

ScrapeGraphAI/scrapegraph-sdk

🕷️ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. 🤖 Hassle-free web scraping made simple.

Language: Jupyter Notebook - Size: 6.64 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 54 - Forks: 8

crawlab-team/crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

Language: Go - Size: 23.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11,769 - Forks: 1,837

hominee/dyer

Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.

Language: Rust - Size: 75 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 126 - Forks: 7

Arman171/WebForensic

WebForensicAnalyzer is an advanced all-in-one tool for web reconnaissance, digital forensics, OSINT, and cybersecurity professionals. It automates deep website analysis—leveraging Shodan, Nmap, and more—to detect vulnerabilities, extract data, and deliver structured forensic results

Language: Python - Size: 3.05 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 2 - Forks: 1

webrecorder/browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container

Language: TypeScript - Size: 53.1 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 805 - Forks: 104

gosom/scrapemate

Golang Crawling and scraping framework

Language: Go - Size: 193 KB - Last synced at: 1 day ago - Pushed at: about 2 months ago - Stars: 126 - Forks: 17

mamy2008/BrokenLinkFinder

Broken Link Finder is a straightforward Python CLI tool that helps users spot broken links on websites, improving SEO and user experience. With features like deep site crawling and URL normalization, it ensures efficient and accurate audits. 🐙🔗

Language: Python - Size: 13.7 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

cloudy-sfu/Compare-electricity-plans-NZ

Compare electricity fee based on personal electricity usage history between different electricity plans in New Zealand

Language: Python - Size: 622 KB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 2 - Forks: 0

platonai/PulsarRPAPro

Fully automated and hands-free, accurately extracting and understanding web content — powered by machine learning agents.

Language: Kotlin - Size: 24.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 119 - Forks: 27

jpjacobpadilla/SearchAI

Google Search tool with advanced filters and LLM-friendly outputs!

Language: Python - Size: 562 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 19 - Forks: 0

amelia05-spec/crowdfunding-real-estate-scrapy

This project is a powerful and extensible scrapy-based crawler designed to extract and aggregate data from multiple real estate crowdfunding platforms. Ideal for investors, analysts and researchers interested in tracking investment opportunities, platform performance and market trends

Language: Python - Size: 31.3 KB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

devflowinc/firecrawl-simple Fork of mendableai/firecrawl

➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.

Language: TypeScript - Size: 40 MB - Last synced at: 7 days ago - Pushed at: 28 days ago - Stars: 470 - Forks: 37

MarginaliaSearch/MarginaliaSearch

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

Language: HTML - Size: 15.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,375 - Forks: 33

kaioobrabo/mcp-client-server

An MCP Server that's also an MCP Client. Useful for letting Claude develop and test MCPs without needing to reset the application.

Language: TypeScript - Size: 141 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 4 - Forks: 2

ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

Language: Python - Size: 15.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 19,984 - Forks: 1,699

internetarchive/Zeno

State-of-the-art web crawler 🔱

Language: Go - Size: 2.78 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 178 - Forks: 34

ScrapeGraphAI/scrapegraph-mcp

ScapeGraph MCP Server

Language: Python - Size: 200 KB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 25 - Forks: 4

mendableai/firecrawl-mcp-server

Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

Language: JavaScript - Size: 337 KB - Last synced at: 8 days ago - Pushed at: 16 days ago - Stars: 3,408 - Forks: 324

apache/stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

Language: Java - Size: 7.28 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 915 - Forks: 265

kan01234/ur-web-spider

web spider to scan UR avialbe room and output as csv

Language: Python - Size: 53.8 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 6 - Forks: 1

Harvey-AU/blue-banded-bee

Cache warming app that crawls sites efficiently.

Language: Go - Size: 185 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 0

graphlit/graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

Language: TypeScript - Size: 625 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 301 - Forks: 34

jepsh/jepsh-ssg

A static site generator for modern web frameworks with route crawling and hydration support

Language: JavaScript - Size: 81.1 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 2 - Forks: 1

crwlrsoft/crawler

Library for Rapid (Web) Crawler and Scraper Development

Language: PHP - Size: 1.02 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 364 - Forks: 13

Debajyoti0-0/TriNetra

TriNetra is a fast web recon tool that uncovers hidden endpoints, API keys, and tokens — built for bug hunters and OSINT pros with Tor support and rich CLI output.

Language: Python - Size: 102 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

ialejandro/crowdfunding-real-estate-scrapy

This project is a powerful and extensible scrapy-based crawler designed to extract and aggregate data from multiple real estate crowdfunding platforms. Ideal for investors, analysts and researchers interested in tracking investment opportunities, platform performance and market trends

Language: Python - Size: 70.3 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

cloudy-sfu/Web-crawler-chorus-outage

Record Internet outage data in New Zealand

Language: Python - Size: 21.5 KB - Last synced at: 3 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

abaykan/CrawlBox 📦

Easy way to brute-force web directory.

Language: Python - Size: 71.3 KB - Last synced at: 1 day ago - Pushed at: about 6 years ago - Stars: 153 - Forks: 40

MultiX0/froxy

🕸️ Froxy – A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.

Language: TypeScript - Size: 725 KB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

laurentvv/crawl4ai-mcp

Web crawling tool that integrates with AI assistants via the MCP

Language: Python - Size: 64.5 KB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 11 - Forks: 5

HHN/crawler4j Fork of yasserg/crawler4j

Open Source Web Crawler for Java - A fork of yasserg/crawler4j

Language: Java - Size: 2 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 27 - Forks: 7

suebksnn/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

Language: JavaScript - Size: 353 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

nicksherron/proxi 📦

Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.

Language: Go - Size: 1.09 MB - Last synced at: 4 days ago - Pushed at: about 5 years ago - Stars: 34 - Forks: 4

crawler-commons/crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language: Java - Size: 3.73 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 244 - Forks: 80

sgogo0228/news-crawler-pipeline

Fetch news on Facebook and Google for content posting and internal reporting.

Language: Python - Size: 9.63 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

apache/nutch

Apache Nutch is an extensible and scalable web crawler

Language: Java - Size: 132 MB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 3,030 - Forks: 1,253

privacy-tech-lab/gpc-web-crawler

GPC Web Crawler for detecting websites' compliance with GPC privacy preference signals at scale

Language: Python - Size: 106 MB - Last synced at: 6 days ago - Pushed at: 15 days ago - Stars: 6 - Forks: 3

Norconex/crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

Language: Java - Size: 15.6 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 188 - Forks: 69

privacy-tech-lab/privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer

Language: JavaScript - Size: 529 MB - Last synced at: 6 days ago - Pushed at: 17 days ago - Stars: 1 - Forks: 0

Yashh56/web-crawler

A high-performance, feature-rich web crawler built in Go with real-time CLI visualization.

Language: Go - Size: 4.84 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 1 - Forks: 0

dmytrochumakov/web-crawler

web-crawler

Language: Go - Size: 10.7 KB - Last synced at: 5 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

antchfx/antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Language: Go - Size: 56.6 KB - Last synced at: 16 days ago - Pushed at: about 5 years ago - Stars: 262 - Forks: 41

code-418-dpr/SportHub-parser

Парсер PDF-файла ЕКП Минспорта РФ для проекта SportHub

Language: Python - Size: 4.1 MB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

anlp-team/LTI_Neural_Navigator

"Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases" by Jiarui Li and Ye Yuan and Zehua Zhang

Language: HTML - Size: 32.3 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 45 - Forks: 3

BHM-Bob/BA_PY

BA_PY: Optimize Your Workflow with Python!

Language: Python - Size: 2 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 2 - Forks: 1

bAndie91/tools

all-in collection of productivity scripts, CLI tools, utility libraries, fuse filesystems, and also some stuff

Language: Perl - Size: 1.42 MB - Last synced at: 20 days ago - Pushed at: 21 days ago - Stars: 18 - Forks: 2

sjdirect/abotx

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.

Language: C# - Size: 17 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 134 - Forks: 23

scrape-do/scrapedo-scrapers

Web scraping examples with Scrape.do 😎

Language: Python - Size: 38.1 KB - Last synced at: 20 days ago - Pushed at: 21 days ago - Stars: 6 - Forks: 0

darsan-in/Job-Crawler

The Job Crawler is an integral component of the Job RAID project, designed to automatically scrape and collect data from various job listing websites. This crawler enables Job RAID to aggregate comprehensive job listings, ensuring that users have access to up-to-date and relevant job opportunities.

Language: Python - Size: 6.83 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 6 - Forks: 0