An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: web-crawler

Synterraapt/Https-Site-Screaper-Social-Commerce-Court-Web

powerful web scraping tool designed to extract data from HTTPS-secured websites, including social media, e-commerce platforms, and court records, for analysis, monitoring, or research purposes.

Language: C# - Size: 1.82 MB - Last synced at: about 3 hours ago - Pushed at: about 3 hours ago - Stars: 3 - Forks: 0

scrapfly/scrapfly-scrapers

Scalable Python web scraping scripts for +40 popular domains

Language: Python - Size: 4.93 MB - Last synced at: about 11 hours ago - Pushed at: 7 days ago - Stars: 540 - Forks: 129

platonai/PulsarRPA

PulsarRPA: An AI-Enabled, Super-Fast, Thread-Safe Browser Automation Solution! 💖

Language: Kotlin - Size: 30.4 MB - Last synced at: about 18 hours ago - Pushed at: about 23 hours ago - Stars: 883 - Forks: 128

exbomkimaa/Google-News-scraper

Scraper for Google News articles with headline extraction, keyword targeting, and proxy support.

Language: Python - Size: 8.79 KB - Last synced at: about 22 hours ago - Pushed at: about 23 hours ago - Stars: 0 - Forks: 0

kelvinweijun/AI-Powered-Search-Engine

AI-powered search engine that uses FAISS and DenseNet-50 for both text and reverse image search capabilities. Comes with an asynchronous based web crawler

Language: Python - Size: 617 KB - Last synced at: about 23 hours ago - Pushed at: about 24 hours ago - Stars: 0 - Forks: 0

mendableai/firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Language: TypeScript - Size: 57.5 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 40,225 - Forks: 3,756

apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Language: TypeScript - Size: 140 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 17,966 - Forks: 839

GreenMeeple/MensaarLecker

A fully automated scraper and static website for the Saarbrücken Mensa, powered by Python, Selenium, Google Sheets, and GitHub Actions.

Language: HTML - Size: 1.13 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

a-b-z-b/web-spider

A Humble Web Crawler

Language: Go - Size: 12.7 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

Palioboro/Argus

Modern C library for command-line argument parsing. Advanced features: subcommands, validation, multi-inputs, environment variables

Language: C - Size: 1.65 MB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 0 - Forks: 0

mendableai/firecrawl-app-examples

🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.

Language: Jupyter Notebook - Size: 13.6 MB - Last synced at: 2 days ago - Pushed at: 18 days ago - Stars: 416 - Forks: 111

KhoaLon/scrapedo-scrapers

Web scraping scripts with Scrape.do 😎

Size: 1000 Bytes - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 1 - Forks: 0

Saptha-Harsh/LilHomie

A Machine Learning Project implemented from scratch which involves web scraping, data engineering, exploratory data analysis and machine learning to predict housing prices in New York Tri-State Area.

Language: Jupyter Notebook - Size: 2.22 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

pinkpixel-dev/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

Language: JavaScript - Size: 440 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 4 - Forks: 2

apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Language: Python - Size: 28.3 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 5,737 - Forks: 391

hyperplasma/hyfetcher

A high-performance web content downloader and localizer built with Rust. Leverages Rust's powerful concurrency to efficiently batch download web pages and save them as local files.

Language: Rust - Size: 37.1 KB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 0

ScrapeGraphAI/scrapegraph-sdk

🕷️ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. 🤖 Hassle-free web scraping made simple.

Language: Jupyter Notebook - Size: 6.64 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 54 - Forks: 8

crawlab-team/crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Language: Go - Size: 23.9 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 11,769 - Forks: 1,837

hominee/dyer

Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.

Language: Rust - Size: 75 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 126 - Forks: 7

Arman171/WebForensic

WebForensicAnalyzer is an advanced all-in-one tool for web reconnaissance, digital forensics, OSINT, and cybersecurity professionals. It automates deep website analysis—leveraging Shodan, Nmap, and more—to detect vulnerabilities, extract data, and deliver structured forensic results

Language: Python - Size: 3.05 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 2 - Forks: 1

webrecorder/browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container

Language: TypeScript - Size: 53.1 MB - Last synced at: 5 days ago - Pushed at: 6 days ago - Stars: 805 - Forks: 104

gosom/scrapemate

Golang Crawling and scraping framework

Language: Go - Size: 193 KB - Last synced at: 1 day ago - Pushed at: about 2 months ago - Stars: 126 - Forks: 17

mamy2008/BrokenLinkFinder

Broken Link Finder is a straightforward Python CLI tool that helps users spot broken links on websites, improving SEO and user experience. With features like deep site crawling and URL normalization, it ensures efficient and accurate audits. 🐙🔗

Language: Python - Size: 13.7 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 0 - Forks: 0

cloudy-sfu/Compare-electricity-plans-NZ

Compare electricity fee based on personal electricity usage history between different electricity plans in New Zealand

Language: Python - Size: 622 KB - Last synced at: 3 days ago - Pushed at: 6 days ago - Stars: 2 - Forks: 0

platonai/PulsarRPAPro

Fully automated and hands-free, accurately extracting and understanding web content — powered by machine learning agents.

Language: Kotlin - Size: 24.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 119 - Forks: 27

jpjacobpadilla/SearchAI

Google Search tool with advanced filters and LLM-friendly outputs!

Language: Python - Size: 562 KB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 19 - Forks: 0

amelia05-spec/crowdfunding-real-estate-scrapy

This project is a powerful and extensible scrapy-based crawler designed to extract and aggregate data from multiple real estate crowdfunding platforms. Ideal for investors, analysts and researchers interested in tracking investment opportunities, platform performance and market trends

Language: Python - Size: 31.3 KB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

devflowinc/firecrawl-simple Fork of mendableai/firecrawl

➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.

Language: TypeScript - Size: 40 MB - Last synced at: 7 days ago - Pushed at: 28 days ago - Stars: 470 - Forks: 37

MarginaliaSearch/MarginaliaSearch

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

Language: HTML - Size: 15.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1,375 - Forks: 33

kaioobrabo/mcp-client-server

An MCP Server that's also an MCP Client. Useful for letting Claude develop and test MCPs without needing to reset the application.

Language: TypeScript - Size: 141 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 4 - Forks: 2

ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

Language: Python - Size: 15.4 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 19,984 - Forks: 1,699

internetarchive/Zeno

State-of-the-art web crawler 🔱

Language: Go - Size: 2.78 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 178 - Forks: 34

ScrapeGraphAI/scrapegraph-mcp

ScapeGraph MCP Server

Language: Python - Size: 200 KB - Last synced at: 6 days ago - Pushed at: about 1 month ago - Stars: 25 - Forks: 4

mendableai/firecrawl-mcp-server

Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

Language: JavaScript - Size: 337 KB - Last synced at: 8 days ago - Pushed at: 16 days ago - Stars: 3,408 - Forks: 324

apache/stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

Language: Java - Size: 7.28 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 915 - Forks: 265

kan01234/ur-web-spider

web spider to scan UR avialbe room and output as csv

Language: Python - Size: 53.8 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 6 - Forks: 1

Harvey-AU/blue-banded-bee

Cache warming app that crawls sites efficiently.

Language: Go - Size: 185 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 1 - Forks: 0

graphlit/graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

Language: TypeScript - Size: 625 KB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 301 - Forks: 34

jepsh/jepsh-ssg

A static site generator for modern web frameworks with route crawling and hydration support

Language: JavaScript - Size: 81.1 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 2 - Forks: 1

crwlrsoft/crawler

Library for Rapid (Web) Crawler and Scraper Development

Language: PHP - Size: 1.02 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 364 - Forks: 13

Debajyoti0-0/TriNetra

TriNetra is a fast web recon tool that uncovers hidden endpoints, API keys, and tokens — built for bug hunters and OSINT pros with Tor support and rich CLI output.

Language: Python - Size: 102 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

ialejandro/crowdfunding-real-estate-scrapy

This project is a powerful and extensible scrapy-based crawler designed to extract and aggregate data from multiple real estate crowdfunding platforms. Ideal for investors, analysts and researchers interested in tracking investment opportunities, platform performance and market trends

Language: Python - Size: 70.3 KB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 1 - Forks: 0

cloudy-sfu/Web-crawler-chorus-outage

Record Internet outage data in New Zealand

Language: Python - Size: 21.5 KB - Last synced at: 3 days ago - Pushed at: 10 days ago - Stars: 0 - Forks: 0

abaykan/CrawlBox 📦

Easy way to brute-force web directory.

Language: Python - Size: 71.3 KB - Last synced at: 1 day ago - Pushed at: about 6 years ago - Stars: 153 - Forks: 40

MultiX0/froxy

🕸️ Froxy – A chill open-source web indexing engine built with Go, Node.js, and Next.js. Crawls, analyzes, and serves structured web data with TF-IDF magic and Supabase as the brain.

Language: TypeScript - Size: 725 KB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 1 - Forks: 0

laurentvv/crawl4ai-mcp

Web crawling tool that integrates with AI assistants via the MCP

Language: Python - Size: 64.5 KB - Last synced at: 9 days ago - Pushed at: 3 months ago - Stars: 11 - Forks: 5

HHN/crawler4j Fork of yasserg/crawler4j

Open Source Web Crawler for Java - A fork of yasserg/crawler4j

Language: Java - Size: 2 MB - Last synced at: 3 days ago - Pushed at: 4 days ago - Stars: 27 - Forks: 7

suebksnn/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

Language: JavaScript - Size: 353 KB - Last synced at: 11 days ago - Pushed at: 11 days ago - Stars: 0 - Forks: 0

nicksherron/proxi 📦

Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.

Language: Go - Size: 1.09 MB - Last synced at: 4 days ago - Pushed at: about 5 years ago - Stars: 34 - Forks: 4

crawler-commons/crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language: Java - Size: 3.73 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 244 - Forks: 80

sgogo0228/news-crawler-pipeline

Fetch news on Facebook and Google for content posting and internal reporting.

Language: Python - Size: 9.63 MB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 0 - Forks: 0

apache/nutch

Apache Nutch is an extensible and scalable web crawler

Language: Java - Size: 132 MB - Last synced at: 7 days ago - Pushed at: 3 months ago - Stars: 3,030 - Forks: 1,253

privacy-tech-lab/gpc-web-crawler

GPC Web Crawler for detecting websites' compliance with GPC privacy preference signals at scale

Language: Python - Size: 106 MB - Last synced at: 6 days ago - Pushed at: 15 days ago - Stars: 6 - Forks: 3

Norconex/crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

Language: Java - Size: 15.6 MB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 188 - Forks: 69

privacy-tech-lab/privacy-pioneer-web-crawler

Web crawler for detecting websites' data collection and sharing practices at scale using Privacy Pioneer

Language: JavaScript - Size: 529 MB - Last synced at: 6 days ago - Pushed at: 17 days ago - Stars: 1 - Forks: 0

Yashh56/web-crawler

A high-performance, feature-rich web crawler built in Go with real-time CLI visualization.

Language: Go - Size: 4.84 MB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 1 - Forks: 0

dmytrochumakov/web-crawler

web-crawler

Language: Go - Size: 10.7 KB - Last synced at: 5 days ago - Pushed at: 18 days ago - Stars: 0 - Forks: 0

antchfx/antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Language: Go - Size: 56.6 KB - Last synced at: 16 days ago - Pushed at: about 5 years ago - Stars: 262 - Forks: 41

code-418-dpr/SportHub-parser

Парсер PDF-файла ЕКП Минспорта РФ для проекта SportHub

Language: Python - Size: 4.1 MB - Last synced at: 18 days ago - Pushed at: 19 days ago - Stars: 0 - Forks: 0

anlp-team/LTI_Neural_Navigator

"Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases" by Jiarui Li and Ye Yuan and Zehua Zhang

Language: HTML - Size: 32.3 MB - Last synced at: 8 days ago - Pushed at: over 1 year ago - Stars: 45 - Forks: 3

BHM-Bob/BA_PY

BA_PY: Optimize Your Workflow with Python!

Language: Python - Size: 2 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 2 - Forks: 1

bAndie91/tools

all-in collection of productivity scripts, CLI tools, utility libraries, fuse filesystems, and also some stuff

Language: Perl - Size: 1.42 MB - Last synced at: 20 days ago - Pushed at: 21 days ago - Stars: 18 - Forks: 2

sjdirect/abotx

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.

Language: C# - Size: 17 MB - Last synced at: 13 days ago - Pushed at: over 1 year ago - Stars: 134 - Forks: 23

scrape-do/scrapedo-scrapers

Web scraping examples with Scrape.do 😎

Language: Python - Size: 38.1 KB - Last synced at: 20 days ago - Pushed at: 21 days ago - Stars: 6 - Forks: 0

darsan-in/Job-Crawler

The Job Crawler is an integral component of the Job RAID project, designed to automatically scrape and collect data from various job listing websites. This crawler enables Job RAID to aggregate comprehensive job listings, ensuring that users have access to up-to-date and relevant job opportunities.

Language: Python - Size: 6.83 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 6 - Forks: 0

jpjacobpadilla/Stealth-Requests

Undetected web-scraping & seamless HTML parsing in Python!

Language: Python - Size: 714 KB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 251 - Forks: 13

TurnerSoftware/InfinityCrawler

A simple but powerful web crawler library for .NET

Language: C# - Size: 326 KB - Last synced at: 15 days ago - Pushed at: over 1 year ago - Stars: 252 - Forks: 37

blueheron786/google-sites-to-google-doc

Crawls and exports Classic Google Sites into a .docx file, preserving recipe titles and bullet lists. Written in Python.

Language: Python - Size: 8.79 KB - Last synced at: 21 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

atymri/WebCrawler

WebCrawler is a C# console application that recursively scans a website starting from a given URL, collects all discovered links, and saves them to a file. It’s useful for site mapping, link analysis, and content discovery.

Language: C# - Size: 6.84 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 0

lewisakura/spiderboi

A web crawling library written in TypeScript.

Language: TypeScript - Size: 376 KB - Last synced at: 4 days ago - Pushed at: over 2 years ago - Stars: 8 - Forks: 1

gildas-lormeau/single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

Language: JavaScript - Size: 5.16 MB - Last synced at: 22 days ago - Pushed at: 3 months ago - Stars: 830 - Forks: 83

LSmyrnaios/PublicationsRetriever

A Java-program which retrieves the full-texts or datasets from the Publication-Web-Pages.

Language: Java - Size: 8.76 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 1 - Forks: 0

lewisdonovan/google-news-scraper

Lightweight scraper for Google News

Language: TypeScript - Size: 895 KB - Last synced at: 19 days ago - Pushed at: 3 months ago - Stars: 330 - Forks: 66

Hecate2/Ignareo-ISML-auto-voter

Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)

Language: Python - Size: 33.1 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 187 - Forks: 11

cloudy-sfu/Web-crawler-Manuka-honey

Compare price of Manuka honey in New Zealand

Language: Python - Size: 61.5 KB - Last synced at: 3 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

amrnima/BrokenLinkFinder

A simple yet powerful Python CLI tool for technical SEO audits. Find and fix broken links, manage site-wide crawls with depth and page limits, and generate detailed JSON reports.

Language: Python - Size: 11.7 KB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 0 - Forks: 0

hyunwoongko/kochat

Opensource Korean chatbot framework

Language: Python - Size: 310 MB - Last synced at: 8 days ago - Pushed at: about 2 years ago - Stars: 457 - Forks: 185

cloudy-sfu/Web-crawler-gaspy

Record petrol and diesel fuel prices in major New Zealand cities

Language: Python - Size: 51.8 KB - Last synced at: 3 days ago - Pushed at: 26 days ago - Stars: 1 - Forks: 0

cloudy-sfu/Web-crawler-petrolspy

Record petrol and diesel fuel prices in major Australia cities

Language: Python - Size: 78.1 KB - Last synced at: 3 days ago - Pushed at: 26 days ago - Stars: 0 - Forks: 0

devopsgroup-io/siteshooter

:camera: Automate full website screenshots and PDF generation with multiple viewport support.

Language: JavaScript - Size: 496 KB - Last synced at: 3 days ago - Pushed at: about 6 years ago - Stars: 65 - Forks: 13

ScrapingAnt/amazon_scraper

Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt

Language: JavaScript - Size: 52.7 KB - Last synced at: 11 days ago - Pushed at: over 1 year ago - Stars: 82 - Forks: 19

Young-TW/dl

Download all you play

Language: Python - Size: 36.1 MB - Last synced at: 27 days ago - Pushed at: 27 days ago - Stars: 0 - Forks: 0

commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

Language: Java - Size: 247 KB - Last synced at: 8 days ago - Pushed at: 4 months ago - Stars: 346 - Forks: 37

andrei-punko/java-crawlers

Collection of Java web crawlers

Language: Java - Size: 462 KB - Last synced at: 28 days ago - Pushed at: 28 days ago - Stars: 0 - Forks: 1

cloudy-sfu/AI-agent-rednote

AI agent for www.xiaohongshu.com thread

Language: Python - Size: 96.7 KB - Last synced at: 3 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 1

Algebra-FUN/WeReadScan

扫描“微信读书”已购图书并下载本地PDF的爬虫

Language: Python - Size: 520 KB - Last synced at: 29 days ago - Pushed at: almost 2 years ago - Stars: 949 - Forks: 168

sjdirect/abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Language: C# - Size: 6.92 MB - Last synced at: about 1 month ago - Pushed at: 9 months ago - Stars: 2,281 - Forks: 560

ssssssss-team/spider-flow

新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。

Language: Java - Size: 3.23 MB - Last synced at: about 1 month ago - Pushed at: about 2 years ago - Stars: 9,965 - Forks: 1,916

qfcy/Python

This repository contains the python source code, containing more than 40 python projects, involving many fields.仓库用于储存python源代码, 包含40多个python项目,涉及爬虫、算法、OpenGL、tkinter、面向对象编程等多个领域。

Language: Python - Size: 188 MB - Last synced at: 11 days ago - Pushed at: about 1 month ago - Stars: 58 - Forks: 4

adithya-s-k/omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

Language: Python - Size: 592 KB - Last synced at: about 1 month ago - Pushed at: 2 months ago - Stars: 6,543 - Forks: 527

Not70xic/web-ocr2

A fast, CLI-based tool to crawl a website path, download PDFs, OCR scanned files, and search text content using Whoosh indexing.

Language: Python - Size: 5.86 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

rivermont/spidy

The simple, easy to use command line web crawler.

Language: Python - Size: 81.8 MB - Last synced at: 24 days ago - Pushed at: 11 months ago - Stars: 346 - Forks: 69

rowyio/LLM-Web-Crawler

Web Scraper and Crawler for LLM Apps and AI Workflows with NoCode / LowCode. Plug and play with your own logic and customize it flexibly and scalably on BuildShip.

Language: TypeScript - Size: 271 KB - Last synced at: 8 days ago - Pushed at: 11 months ago - Stars: 25 - Forks: 7

spider-rs/spider-py

Spider ported to Python

Language: Rust - Size: 1.36 MB - Last synced at: 27 days ago - Pushed at: 5 months ago - Stars: 83 - Forks: 13

dangtoi05122003/Threads

Language: Jupyter Notebook - Size: 9.96 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

gpassero/uol-redacoes-xml

O banco de redações da UOL (http://educacao.uol.com.br/bancoderedacoes/) em XML como modelo de testes e validação de técnicas de PLN (Processamento de Linguagem Natural) sobre redações.

Language: Python - Size: 23.4 MB - Last synced at: 16 days ago - Pushed at: almost 5 years ago - Stars: 34 - Forks: 10

s0rg/crawley

The unix-way web crawler

Language: Go - Size: 213 KB - Last synced at: 28 days ago - Pushed at: about 1 month ago - Stars: 296 - Forks: 16

hc-nolan/pycrawler

Asynchronous Python web crawler. Goes as fast as you let it.

Language: Python - Size: 25.4 KB - Last synced at: 20 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

AhmedSobhy01/sher-look

A high-performance search engine that crawls, indexes, and ranks web content that supports Boolean query, phrase searching, and an attractive web interface

Language: Java - Size: 570 KB - Last synced at: about 2 hours ago - Pushed at: about 1 month ago - Stars: 8 - Forks: 1

xianhu/PSpider

简单易用的Python爬虫框架,QQ交流群:597510560

Language: Python - Size: 814 KB - Last synced at: 29 days ago - Pushed at: about 3 years ago - Stars: 1,837 - Forks: 501