GitHub topics: web-crawler
Synterraapt/Https-Site-Screaper-Social-Commerce-Court-Web
powerful web scraping tool designed to extract data from HTTPS-secured websites, including social media, e-commerce platforms, and court records, for analysis, monitoring, or research purposes.
Language: C# - Size: 1.82 MB - Last synced at: about 3 hours ago - Pushed at: about 3 hours ago - Stars: 3 - Forks: 0

scrapfly/scrapfly-scrapers
Scalable Python web scraping scripts for +40 popular domains
Language: Python - Size: 4.93 MB - Last synced at: about 11 hours ago - Pushed at: 7 days ago - Stars: 540 - Forks: 129

platonai/PulsarRPA
PulsarRPA: An AI-Enabled, Super-Fast, Thread-Safe Browser Automation Solution! 💖
Language: Kotlin - Size: 30.4 MB - Last synced at: about 18 hours ago - Pushed at: about 23 hours ago - Stars: 883 - Forks: 128

exbomkimaa/Google-News-scraper
Scraper for Google News articles with headline extraction, keyword targeting, and proxy support.
Language: Python - Size: 8.79 KB - Last synced at: about 22 hours ago - Pushed at: about 23 hours ago - Stars: 0 - Forks: 0

kelvinweijun/AI-Powered-Search-Engine
AI-powered search engine that uses FAISS and DenseNet-50 for both text and reverse image search capabilities. Comes with an asynchronous based web crawler
Language: Python - Size: 617 KB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 0 - Forks: 0

mendableai/firecrawl
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Language: TypeScript - Size: 57.5 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 40,225 - Forks: 3,756

apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language: TypeScript - Size: 140 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 17,966 - Forks: 839

GreenMeeple/MensaarLecker
A fully automated scraper and static website for the Saarbrücken Mensa, powered by Python, Selenium, Google Sheets, and GitHub Actions.
Language: HTML - Size: 1.13 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

a-b-z-b/web-spider
A Humble Web Crawler
Language: Go - Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Palioboro/Argus
Modern C library for command-line argument parsing. Advanced features: subcommands, validation, multi-inputs, environment variables
Language: C - Size: 1.65 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

mendableai/firecrawl-app-examples
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Language: Jupyter Notebook - Size: 13.6 MB - Last synced at: 2 days ago - Pushed at: 18 days ago - Stars: 416 - Forks: 111

KhoaLon/scrapedo-scrapers
Web scraping scripts with Scrape.do 😎
Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

Saptha-Harsh/LilHomie
A Machine Learning Project implemented from scratch which involves web scraping, data engineering, exploratory data analysis and machine learning to predict housing prices in New York Tri-State Area.
Language: Jupyter Notebook - Size: 2.22 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

pinkpixel-dev/web-scout-mcp
A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.
Language: JavaScript - Size: 440 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 2

apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language: Python - Size: 28.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 5,737 - Forks: 391

hyperplasma/hyfetcher
A high-performance web content downloader and localizer built with Rust. Leverages Rust's powerful concurrency to efficiently batch download web pages and save them as local files.
Language: Rust - Size: 37.1 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

ScrapeGraphAI/scrapegraph-sdk
🕷️ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. 🤖 Hassle-free web scraping made simple.
Language: Jupyter Notebook - Size: 6.64 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 54 - Forks: 8

crawlab-team/crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Language: Go - Size: 23.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11,769 - Forks: 1,837

hominee/dyer
Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.
Language: Rust - Size: 75 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 126 - Forks: 7

Arman171/WebForensic
WebForensicAnalyzer is an advanced all-in-one tool for web reconnaissance, digital forensics, OSINT, and cybersecurity professionals. It automates deep website analysis—leveraging Shodan, Nmap, and more—to detect vulnerabilities, extract data, and deliver structured forensic results
Language: Python - Size: 3.05 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 2 - Forks: 1

webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Language: TypeScript - Size: 53.1 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 805 - Forks: 104

gosom/scrapemate
Golang Crawling and scraping framework
Language: Go - Size: 193 KB - Last synced at: 1 day ago - Pushed at: about 2 months ago - Stars: 126 - Forks: 17

mamy2008/BrokenLinkFinder
Broken Link Finder is a straightforward Python CLI tool that helps users spot broken links on websites, improving SEO and user experience. With features like deep site crawling and URL normalization, it ensures efficient and accurate audits. 🐙🔗
Language: Python - Size: 13.7 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

cloudy-sfu/Compare-electricity-plans-NZ
Compare electricity fee based on personal electricity usage history between different electricity plans in New Zealand
Language: Python - Size: 622 KB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 2 - Forks: 0

platonai/PulsarRPAPro
Fully automated and hands-free, accurately extracting and understanding web content — powered by machine learning agents.
Language: Kotlin - Size: 24.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 119 - Forks: 27

jpjacobpadilla/SearchAI
Google Search tool with advanced filters and LLM-friendly outputs!
Language: Python - Size: 562 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 19 - Forks: 0

amelia05-spec/crowdfunding-real-estate-scrapy
This project is a powerful and extensible scrapy-based crawler designed to extract and aggregate data from multiple real estate crowdfunding platforms. Ideal for investors, analysts and researchers interested in tracking investment opportunities, platform performance and market trends
Language: Python - Size: 31.3 KB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

devflowinc/firecrawl-simple Fork of mendableai/firecrawl
➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.
Language: TypeScript - Size: 40 MB - Last synced at: 7 days ago - Pushed at: 28 days ago - Stars: 470 - Forks: 37

MarginaliaSearch/MarginaliaSearch
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
Language: HTML - Size: 15.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,375 - Forks: 33

kaioobrabo/mcp-client-server
An MCP Server that's also an MCP Client. Useful for letting Claude develop and test MCPs without needing to reset the application.
Language: TypeScript - Size: 141 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 4 - Forks: 2

ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
Language: Python - Size: 15.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 19,984 - Forks: 1,699

internetarchive/Zeno
State-of-the-art web crawler 🔱
Language: Go - Size: 2.78 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 178 - Forks: 34

ScrapeGraphAI/scrapegraph-mcp
ScapeGraph MCP Server
Language: Python - Size: 200 KB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 25 - Forks: 4

mendableai/firecrawl-mcp-server
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
Language: JavaScript - Size: 337 KB - Last synced at: 8 days ago - Pushed at: 16 days ago - Stars: 3,408 - Forks: 324

apache/stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
Language: Java - Size: 7.28 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 915 - Forks: 265

kan01234/ur-web-spider
web spider to scan UR avialbe room and output as csv
Language: Python - Size: 53.8 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 6 - Forks: 1

Harvey-AU/blue-banded-bee
Cache warming app that crawls sites efficiently.
Language: Go - Size: 185 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 0

graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
Language: TypeScript - Size: 625 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 301 - Forks: 34

jepsh/jepsh-ssg
A static site generator for modern web frameworks with route crawling and hydration support
Language: JavaScript - Size: 81.1 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 2 - Forks: 1

crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
Language: PHP - Size: 1.02 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 364 - Forks: 13

Debajyoti0-0/TriNetra
TriNetra is a fast web recon tool that uncovers hidden endpoints, API keys, and tokens — built for bug hunters and OSINT pros with Tor support and rich CLI output.
Language: Python - Size: 102 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

ialejandro/crowdfunding-real-estate-scrapy
This project is a powerful and extensible scrapy-based crawler designed to extract and aggregate data from multiple real estate crowdfunding platforms. Ideal for investors, analysts and researchers interested in tracking investment opportunities, platform performance and market trends
Language: Python - Size: 70.3 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

cloudy-sfu/Web-crawler-chorus-outage
Record Internet outage data in New Zealand
Language: Python - Size: 21.5 KB - Last synced at: 3 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

abaykan/CrawlBox 📦
Easy way to brute-force web directory.
Language: Python - Size: 71.3 KB - Last synced at: 1 day ago - Pushed at: about 6 years ago - Stars: 153 - Forks: 40

MultiX0/froxy
🕸️ Froxy – A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.
Language: TypeScript - Size: 725 KB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

laurentvv/crawl4ai-mcp
Web crawling tool that integrates with AI assistants via the MCP
Language: Python - Size: 64.5 KB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 11 - Forks: 5

HHN/crawler4j Fork of yasserg/crawler4j
Open Source Web Crawler for Java - A fork of yasserg/crawler4j
Language: Java - Size: 2 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 27 - Forks: 7

suebksnn/web-scout-mcp
A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.
Language: JavaScript - Size: 353 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

nicksherron/proxi 📦
Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.
Language: Go - Size: 1.09 MB - Last synced at: 4 days ago - Pushed at: about 5 years ago - Stars: 34 - Forks: 4

crawler-commons/crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
Language: Java - Size: 3.73 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 244 - Forks: 80

sgogo0228/news-crawler-pipeline
Fetch news on Facebook and Google for content posting and internal reporting.
Language: Python - Size: 9.63 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

apache/nutch
Apache Nutch is an extensible and scalable web crawler
Language: Java - Size: 132 MB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 3,030 - Forks: 1,253

privacy-tech-lab/gpc-web-crawler
GPC Web Crawler for detecting websites' compliance with GPC privacy preference signals at scale
Language: Python - Size: 106 MB - Last synced at: 6 days ago - Pushed at: 15 days ago - Stars: 6 - Forks: 3

Norconex/crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
Language: Java - Size: 15.6 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 188 - Forks: 69

privacy-tech-lab/privacy-pioneer-web-crawler
Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer
Language: JavaScript - Size: 529 MB - Last synced at: 6 days ago - Pushed at: 17 days ago - Stars: 1 - Forks: 0

Yashh56/web-crawler
A high-performance, feature-rich web crawler built in Go with real-time CLI visualization.
Language: Go - Size: 4.84 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 1 - Forks: 0

dmytrochumakov/web-crawler
web-crawler
Language: Go - Size: 10.7 KB - Last synced at: 5 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

antchfx/antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Language: Go - Size: 56.6 KB - Last synced at: 16 days ago - Pushed at: about 5 years ago - Stars: 262 - Forks: 41

code-418-dpr/SportHub-parser
Парсер PDF-файла ЕКП Минспорта РФ для проекта SportHub
Language: Python - Size: 4.1 MB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

anlp-team/LTI_Neural_Navigator
"Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases" by Jiarui Li and Ye Yuan and Zehua Zhang
Language: HTML - Size: 32.3 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 45 - Forks: 3

BHM-Bob/BA_PY
BA_PY: Optimize Your Workflow with Python!
Language: Python - Size: 2 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 2 - Forks: 1

bAndie91/tools
all-in collection of productivity scripts, CLI tools, utility libraries, fuse filesystems, and also some stuff
Language: Perl - Size: 1.42 MB - Last synced at: 20 days ago - Pushed at: 21 days ago - Stars: 18 - Forks: 2

sjdirect/abotx
Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
Language: C# - Size: 17 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 134 - Forks: 23

scrape-do/scrapedo-scrapers
Web scraping examples with Scrape.do 😎
Language: Python - Size: 38.1 KB - Last synced at: 20 days ago - Pushed at: 21 days ago - Stars: 6 - Forks: 0

darsan-in/Job-Crawler
The Job Crawler is an integral component of the Job RAID project, designed to automatically scrape and collect data from various job listing websites. This crawler enables Job RAID to aggregate comprehensive job listings, ensuring that users have access to up-to-date and relevant job opportunities.
Language: Python - Size: 6.83 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 6 - Forks: 0

jpjacobpadilla/Stealth-Requests
Undetected web-scraping & seamless HTML parsing in Python!
Language: Python - Size: 714 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 251 - Forks: 13

TurnerSoftware/InfinityCrawler
A simple but powerful web crawler library for .NET
Language: C# - Size: 326 KB - Last synced at: 15 days ago - Pushed at: over 1 year ago - Stars: 252 - Forks: 37

blueheron786/google-sites-to-google-doc
Crawls and exports Classic Google Sites into a .docx file, preserving recipe titles and bullet lists. Written in Python.
Language: Python - Size: 8.79 KB - Last synced at: 21 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

atymri/WebCrawler
WebCrawler is a C# console application that recursively scans a website starting from a given URL, collects all discovered links, and saves them to a file. It’s useful for site mapping, link analysis, and content discovery.
Language: C# - Size: 6.84 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

lewisakura/spiderboi
A web crawling library written in TypeScript.
Language: TypeScript - Size: 376 KB - Last synced at: 4 days ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 1

gildas-lormeau/single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Language: JavaScript - Size: 5.16 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 830 - Forks: 83

LSmyrnaios/PublicationsRetriever
A Java-program which retrieves the full-texts or datasets from the Publication-Web-Pages.
Language: Java - Size: 8.76 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

lewisdonovan/google-news-scraper
Lightweight scraper for Google News
Language: TypeScript - Size: 895 KB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 330 - Forks: 66

Hecate2/Ignareo-ISML-auto-voter
Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)
Language: Python - Size: 33.1 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 187 - Forks: 11

cloudy-sfu/Web-crawler-Manuka-honey
Compare price of Manuka honey in New Zealand
Language: Python - Size: 61.5 KB - Last synced at: 3 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

amrnima/BrokenLinkFinder
A simple yet powerful Python CLI tool for technical SEO audits. Find and fix broken links, manage site-wide crawls with depth and page limits, and generate detailed JSON reports.
Language: Python - Size: 11.7 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

hyunwoongko/kochat
Opensource Korean chatbot framework
Language: Python - Size: 310 MB - Last synced at: 8 days ago - Pushed at: about 2 years ago - Stars: 457 - Forks: 185

cloudy-sfu/Web-crawler-gaspy
Record petrol and diesel fuel prices in major New Zealand cities
Language: Python - Size: 51.8 KB - Last synced at: 3 days ago - Pushed at: 26 days ago - Stars: 1 - Forks: 0

cloudy-sfu/Web-crawler-petrolspy
Record petrol and diesel fuel prices in major Australia cities
Language: Python - Size: 78.1 KB - Last synced at: 3 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

devopsgroup-io/siteshooter
:camera: Automate full website screenshots and PDF generation with multiple viewport support.
Language: JavaScript - Size: 496 KB - Last synced at: 3 days ago - Pushed at: about 6 years ago - Stars: 65 - Forks: 13

ScrapingAnt/amazon_scraper
Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt
Language: JavaScript - Size: 52.7 KB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 82 - Forks: 19

Young-TW/dl
Download all you play
Language: Python - Size: 36.1 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
Language: Java - Size: 247 KB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 346 - Forks: 37

andrei-punko/java-crawlers
Collection of Java web crawlers
Language: Java - Size: 462 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 0 - Forks: 1

cloudy-sfu/AI-agent-rednote
AI agent for www.xiaohongshu.com thread
Language: Python - Size: 96.7 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 1

Algebra-FUN/WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
Language: Python - Size: 520 KB - Last synced at: 29 days ago - Pushed at: almost 2 years ago - Stars: 949 - Forks: 168

sjdirect/abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Language: C# - Size: 6.92 MB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 2,281 - Forks: 560

ssssssss-team/spider-flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Language: Java - Size: 3.23 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 9,965 - Forks: 1,916

qfcy/Python
This repository contains the python source code, containing more than 40 python projects, involving many fields.仓库用于储存python源代码, 包含40多个python项目,涉及爬虫、算法、OpenGL、tkinter、面向对象编程等多个领域。
Language: Python - Size: 188 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 58 - Forks: 4

adithya-s-k/omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Language: Python - Size: 592 KB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 6,543 - Forks: 527

Not70xic/web-ocr2
A fast, CLI-based tool to crawl a website path, download PDFs, OCR scanned files, and search text content using Whoosh indexing.
Language: Python - Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

rivermont/spidy
The simple, easy to use command line web crawler.
Language: Python - Size: 81.8 MB - Last synced at: 24 days ago - Pushed at: 11 months ago - Stars: 346 - Forks: 69

rowyio/LLM-Web-Crawler
Web Scraper and Crawler for LLM Apps and AI Workflows with NoCode / LowCode. Plug and play with your own logic and customize it flexibly and scalably on BuildShip.
Language: TypeScript - Size: 271 KB - Last synced at: 8 days ago - Pushed at: 11 months ago - Stars: 25 - Forks: 7

spider-rs/spider-py
Spider ported to Python
Language: Rust - Size: 1.36 MB - Last synced at: 27 days ago - Pushed at: 5 months ago - Stars: 83 - Forks: 13

dangtoi05122003/Threads
Language: Jupyter Notebook - Size: 9.96 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

gpassero/uol-redacoes-xml
O banco de redações da UOL (http://educacao.uol.com.br/bancoderedacoes/) em XML como modelo de testes e validação de técnicas de PLN (Processamento de Linguagem Natural) sobre redações.
Language: Python - Size: 23.4 MB - Last synced at: 16 days ago - Pushed at: almost 5 years ago - Stars: 34 - Forks: 10

s0rg/crawley
The unix-way web crawler
Language: Go - Size: 213 KB - Last synced at: 28 days ago - Pushed at: about 1 month ago - Stars: 296 - Forks: 16

hc-nolan/pycrawler
Asynchronous Python web crawler. Goes as fast as you let it.
Language: Python - Size: 25.4 KB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

AhmedSobhy01/sher-look
A high-performance search engine that crawls, indexes, and ranks web content that supports Boolean query, phrase searching, and an attractive web interface
Language: Java - Size: 570 KB - Last synced at: about 2 hours ago - Pushed at: about 1 month ago - Stars: 8 - Forks: 1

xianhu/PSpider
简单易用的Python爬虫框架,QQ交流群:597510560
Language: Python - Size: 814 KB - Last synced at: 29 days ago - Pushed at: about 3 years ago - Stars: 1,837 - Forks: 501
