Topic: "web-crawler"
mendableai/firecrawl
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Language: TypeScript - Size: 57.1 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 39,788 - Forks: 3,705

ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
Language: Python - Size: 15.4 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 19,984 - Forks: 1,699

apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language: TypeScript - Size: 141 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 17,898 - Forks: 833

crawlab-team/crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Language: Go - Size: 23.9 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 11,769 - Forks: 1,837

ssssssss-team/spider-flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
Language: Java - Size: 3.23 MB - Last synced at: 29 days ago - Pushed at: about 2 years ago - Stars: 9,965 - Forks: 1,916

BruceDone/awesome-crawler
A collection of awesome web crawler,spider in different languages
Size: 74.2 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 6,744 - Forks: 716

adithya-s-k/omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Language: Python - Size: 592 KB - Last synced at: 29 days ago - Pushed at: 2 months ago - Stars: 6,543 - Forks: 527

apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language: Python - Size: 28.3 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 5,737 - Forks: 391

mendableai/firecrawl-mcp-server
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
Language: JavaScript - Size: 337 KB - Last synced at: 6 days ago - Pushed at: 14 days ago - Stars: 3,408 - Forks: 324

apache/nutch
Apache Nutch is an extensible and scalable web crawler
Language: Java - Size: 132 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 3,030 - Forks: 1,253

sjdirect/abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Language: C# - Size: 6.92 MB - Last synced at: 29 days ago - Pushed at: 9 months ago - Stars: 2,281 - Forks: 560

xianhu/PSpider
简单易用的Python爬虫框架,QQ交流群:597510560
Language: Python - Size: 814 KB - Last synced at: 27 days ago - Pushed at: about 3 years ago - Stars: 1,837 - Forks: 501

jasonxtn/Argus
The Ultimate Information Gathering Toolkit
Language: Python - Size: 131 KB - Last synced at: 6 months ago - Pushed at: 8 months ago - Stars: 1,413 - Forks: 148

MarginaliaSearch/MarginaliaSearch
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
Language: HTML - Size: 15.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,375 - Forks: 33

Algebra-FUN/WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
Language: Python - Size: 520 KB - Last synced at: 27 days ago - Pushed at: over 1 year ago - Stars: 949 - Forks: 168

apache/stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
Language: Java - Size: 7.28 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 915 - Forks: 265

platonai/PulsarRPA
PulsarRPA: An AI-Enabled, Super-Fast, Thread-Safe Browser Automation Solution! 💖
Language: Kotlin - Size: 29.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 882 - Forks: 128

gildas-lormeau/single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Language: JavaScript - Size: 5.16 MB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 830 - Forks: 83

postmodern/spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Language: Ruby - Size: 685 KB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 818 - Forks: 107

webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Language: TypeScript - Size: 53.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 805 - Forks: 104

cxcscmu/Craw4LLM
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
Language: Python - Size: 79.1 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 608 - Forks: 56

scrapfly/scrapfly-scrapers
Scalable Python web scraping scripts for +40 popular domains
Language: Python - Size: 4.79 MB - Last synced at: 8 days ago - Pushed at: 11 days ago - Stars: 534 - Forks: 129

devflowinc/firecrawl-simple Fork of mendableai/firecrawl
➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.
Language: TypeScript - Size: 40 MB - Last synced at: 5 days ago - Pushed at: 26 days ago - Stars: 470 - Forks: 37

VIDA-NYU/ache
ACHE is a web crawler for domain-specific search.
Language: Java - Size: 66.6 MB - Last synced at: 26 days ago - Pushed at: almost 2 years ago - Stars: 468 - Forks: 134

hyunwoongko/kochat
Opensource Korean chatbot framework
Language: Python - Size: 310 MB - Last synced at: 6 days ago - Pushed at: about 2 years ago - Stars: 457 - Forks: 185

mendableai/firecrawl-app-examples
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Language: Jupyter Notebook - Size: 13.6 MB - Last synced at: about 20 hours ago - Pushed at: 16 days ago - Stars: 416 - Forks: 111

USCDataScience/sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Language: Java - Size: 23.1 MB - Last synced at: 3 months ago - Pushed at: about 2 years ago - Stars: 412 - Forks: 141

brendonboshell/supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Language: JavaScript - Size: 664 KB - Last synced at: 22 days ago - Pushed at: over 2 years ago - Stars: 381 - Forks: 61

lefterisloukas/edgar-crawler
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
Language: Python - Size: 63 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 365 - Forks: 100

crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
Language: PHP - Size: 1.02 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 364 - Forks: 13

commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
Language: Java - Size: 247 KB - Last synced at: 7 days ago - Pushed at: 4 months ago - Stars: 346 - Forks: 37

rivermont/spidy
The simple, easy to use command line web crawler.
Language: Python - Size: 81.8 MB - Last synced at: 22 days ago - Pushed at: 10 months ago - Stars: 346 - Forks: 69

lewisdonovan/google-news-scraper
Lightweight scraper for Google News
Language: TypeScript - Size: 895 KB - Last synced at: 18 days ago - Pushed at: 3 months ago - Stars: 330 - Forks: 66

infinilabs/crawler
🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)
Language: Go - Size: 54.6 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 308 - Forks: 82

graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
Language: TypeScript - Size: 625 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 301 - Forks: 34

s0rg/crawley
The unix-way web crawler
Language: Go - Size: 213 KB - Last synced at: 26 days ago - Pushed at: about 1 month ago - Stars: 296 - Forks: 16

yields/ant
A web crawler for Go
Language: Go - Size: 168 KB - Last synced at: 26 days ago - Pushed at: 3 months ago - Stars: 278 - Forks: 17

microfisher/Strong-Web-Crawler
基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。
Language: C# - Size: 18.3 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 275 - Forks: 154

duyet/awesome-web-scraper
A collection of awesome web scaper, crawler.
Size: 48.8 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 273 - Forks: 46

antchfx/antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Language: Go - Size: 56.6 KB - Last synced at: 15 days ago - Pushed at: about 5 years ago - Stars: 262 - Forks: 41

lucasxlu/LagouJob
Data Analysis & Mining for lagou.com
Language: Python - Size: 28.1 MB - Last synced at: 7 months ago - Pushed at: about 6 years ago - Stars: 259 - Forks: 127

TurnerSoftware/InfinityCrawler
A simple but powerful web crawler library for .NET
Language: C# - Size: 326 KB - Last synced at: 14 days ago - Pushed at: over 1 year ago - Stars: 252 - Forks: 37

jpjacobpadilla/Stealth-Requests
Undetected web-scraping & seamless HTML parsing in Python!
Language: Python - Size: 714 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 251 - Forks: 13

crawler-commons/crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
Language: Java - Size: 3.73 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 244 - Forks: 80

crawlab-team/crawlab-lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Language: Vue - Size: 2.36 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 224 - Forks: 75

xiayouran/Musicer
旨在将网易云、酷狗、QQ、酷我等各音乐平台集于一体
Language: Python - Size: 10 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 202 - Forks: 21

elliotxx/zhihu-crawler-people
A simple distributed crawler for zhihu && data analysis
Language: Python - Size: 183 KB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 192 - Forks: 89

Norconex/crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
Language: Java - Size: 15.6 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 188 - Forks: 69

Hecate2/Ignareo-ISML-auto-voter
Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)
Language: Python - Size: 33.1 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 187 - Forks: 11

internetarchive/Zeno
State-of-the-art web crawler 🔱
Language: Go - Size: 2.78 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 178 - Forks: 34

saeeddhqan/evine
Interactive CLI Web Crawler
Language: Go - Size: 789 KB - Last synced at: 11 months ago - Pushed at: over 3 years ago - Stars: 175 - Forks: 32

Madi-S/Lead-Generation
Python script, which empowers people with no programming background to generate robust leads on a mass scale. This repo will be compiled of various versatile techniques used in lead generation.
Language: Python - Size: 9.67 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 153 - Forks: 38

abaykan/CrawlBox 📦
Easy way to brute-force web directory.
Language: Python - Size: 71.3 KB - Last synced at: 9 days ago - Pushed at: about 6 years ago - Stars: 152 - Forks: 40

sjdirect/abotx
Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
Language: C# - Size: 17 MB - Last synced at: 12 days ago - Pushed at: over 1 year ago - Stars: 134 - Forks: 23

skytruine/OSpider
开源矢量地理数据获取与预处理工具(POI/AOI/行政区/路网/土地利用)
Language: Python - Size: 81.6 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 130 - Forks: 22

brianmadden/krawler
A web crawling framework written in Kotlin
Language: Kotlin - Size: 403 KB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 130 - Forks: 16

mazzzystar/Proxy
A simple tool for fetching usable proxies from several websites.
Language: Python - Size: 64.5 KB - Last synced at: about 2 months ago - Pushed at: over 4 years ago - Stars: 127 - Forks: 68

hominee/dyer
Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.
Language: Rust - Size: 75 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 126 - Forks: 7

gosom/scrapemate
Golang Crawling and scraping framework
Language: Go - Size: 193 KB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 125 - Forks: 17

DwarfThief/Raspagem-de-dados-para-iniciantes
Raspagem de dados para iniciante usando Scrapy e outras libs básicas
Language: Python - Size: 2.04 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 122 - Forks: 20

platonai/PulsarRPAPro
Fully automated and hands-free, accurately extracting and understanding web content — powered by machine learning agents.
Language: Kotlin - Size: 24.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 119 - Forks: 27

MaxValue/Terpene-Profile-Parser-for-Cannabis-Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Language: Python - Size: 21.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 118 - Forks: 18

monkey-soft/SchweizerMesser
🎯Python 3 网络爬虫实战、数据分析合集 | 当当 | 网易云音乐 | unsplash | 必胜客 | 猫眼 |
Language: HTML - Size: 517 KB - Last synced at: over 2 years ago - Pushed at: over 5 years ago - Stars: 97 - Forks: 73

creekorful/bathyscaphe
Fast, highly configurable, cloud native dark web crawler.
Language: Go - Size: 830 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 93 - Forks: 22

KaidiGuo/keyword_based_Sina_weibo_crawler
A web crawler for Sina, search and retrieve microblogs that contain certain keywords 一个简单的python爬虫实践,爬取包含关键词的新浪微博
Language: Python - Size: 47.9 KB - Last synced at: almost 2 years ago - Pushed at: over 6 years ago - Stars: 93 - Forks: 31

Viveckh/LilHomie
A Machine Learning Project implemented from scratch which involves web scraping, data engineering, exploratory data analysis and machine learning to predict housing prices in New York Tri-State Area.
Language: Jupyter Notebook - Size: 10.5 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 92 - Forks: 19

tech-engine/goscrapy
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
Language: Go - Size: 6.16 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 89 - Forks: 2

havanagrawal/GoodreadsScraper
Scrape data from Goodreads using Scrapy and Selenium :books:
Language: Python - Size: 5.11 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 86 - Forks: 20

joe-habel/YouTube-View-Bot
A rotating proxy solution to bot YouTube views
Language: Python - Size: 2.93 KB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 86 - Forks: 81

spider-rs/spider-py
Spider ported to Python
Language: Rust - Size: 1.36 MB - Last synced at: 25 days ago - Pushed at: 5 months ago - Stars: 83 - Forks: 13

ScrapingAnt/amazon_scraper
Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt
Language: JavaScript - Size: 52.7 KB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 82 - Forks: 19

redcode-labs/UnChain
A tool to find redirection chains in multiple URLs
Language: Go - Size: 3.3 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 80 - Forks: 13

wyu-du/StockForecast
:dart: predict the price trend of individual stocks using deep learning and natural language processing
Language: Python - Size: 29.4 MB - Last synced at: about 1 year ago - Pushed at: over 7 years ago - Stars: 74 - Forks: 34

mattdeitke/CVPR2019
Displays all the 2019 CVPR Accepted Papers in a way that they are easy to parse.
Language: HTML - Size: 27.8 MB - Last synced at: 7 days ago - Pushed at: over 4 years ago - Stars: 70 - Forks: 10

devopsgroup-io/siteshooter
:camera: Automate full website screenshots and PDF generation with multiple viewport support.
Language: JavaScript - Size: 496 KB - Last synced at: 1 day ago - Pushed at: about 6 years ago - Stars: 65 - Forks: 13

abo123456789/leek
Distributed task redisqueue(最简单python分布式函数调度框架)
Language: Python - Size: 412 KB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 63 - Forks: 19

gicornachini/bolsa
Biblioteca feita em Python com o objetivo de facilitar o acesso a dados de seus investimentos na bolsa de valores(B3/CEI) através do Portal CEI.
Language: Python - Size: 125 KB - Last synced at: 2 months ago - Pushed at: almost 4 years ago - Stars: 62 - Forks: 18

amalrajan/learncpp-download
Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.
Language: Python - Size: 494 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 60 - Forks: 19

Misterhex/WebCrawler
Just a simple web crawler which return crawled links as IObservable using reactive extension and async await.
Language: C# - Size: 7.13 MB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 60 - Forks: 33

qfcy/Python
This repository contains the python source code, containing more than 40 python projects, involving many fields.仓库用于储存python源代码, 包含40多个python项目,涉及爬虫、算法、OpenGL、tkinter、面向对象编程等多个领域。
Language: Python - Size: 188 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 58 - Forks: 4

Cheng-Lin-Li/Market-Trend-Prediction
This is a project of build knowledge graph course. The project leverages historical stock price, and integrates social media listening from customers to predict market Trend On Dow Jones Industrial Average (DJIA).
Language: Julia - Size: 143 MB - Last synced at: about 2 months ago - Pushed at: about 7 years ago - Stars: 58 - Forks: 28

avilum/smart-url-fuzzer
Explore URLs of domains fast and efficiently using fuzzing techniques
Language: Python - Size: 338 KB - Last synced at: 3 months ago - Pushed at: about 4 years ago - Stars: 56 - Forks: 18

ScrapeGraphAI/scrapegraph-sdk
🕷️ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. 🤖 Hassle-free web scraping made simple.
Language: Jupyter Notebook - Size: 6.64 MB - Last synced at: about 3 hours ago - Pushed at: about 17 hours ago - Stars: 54 - Forks: 8

ScaleUnlimited/flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Language: Java - Size: 1.38 MB - Last synced at: about 1 year ago - Pushed at: about 6 years ago - Stars: 52 - Forks: 18

shenfe/puppeteer-service
🎠 Run headless Chrome (aka Puppeteer) as a service.
Language: JavaScript - Size: 133 KB - Last synced at: 5 days ago - Pushed at: over 7 years ago - Stars: 49 - Forks: 6

anlp-team/LTI_Neural_Navigator
"Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases" by Jiarui Li and Ye Yuan and Zehua Zhang
Language: HTML - Size: 32.3 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 45 - Forks: 3

spk/maman
Rust Web Crawler saving pages on Redis
Language: Rust - Size: 203 KB - Last synced at: 28 days ago - Pushed at: about 4 years ago - Stars: 44 - Forks: 5

ahmedshahriar/youtube-comment-scraper
This script will dump youtube video comments to a CSV from youtube video links. Video links can be placed inside a variable or list or CSV
Language: Jupyter Notebook - Size: 256 KB - Last synced at: about 2 months ago - Pushed at: over 3 years ago - Stars: 42 - Forks: 15

yuanyuanzijin/dutsso
快速登录大连理工大学统一身份认证系统(SSO)的Python模块,可轻松实现成绩提醒、抢课、玉兰卡信息、个人信息查询等功能。
Language: Python - Size: 109 KB - Last synced at: over 2 years ago - Pushed at: about 6 years ago - Stars: 40 - Forks: 10

spk/validate-website
Web crawler for checking the validity of your documents.
Language: HTML - Size: 934 KB - Last synced at: 15 days ago - Pushed at: almost 2 years ago - Stars: 39 - Forks: 9

GoncaloMark/CobWeb-lnx
CobWeb is a Python library for web scraping. The library consists of two classes: Spider and Scraper.
Language: Python - Size: 7.75 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 38 - Forks: 2

SylvainDe/ComicBookMaker
Script to fetch webcomics and use them to create ebooks.
Language: Python - Size: 2.48 MB - Last synced at: 2 months ago - Pushed at: almost 4 years ago - Stars: 36 - Forks: 7

weiyu666/Graduation_Design-Distributed_Web_Spider
基于微博用户信息数据的分布式爬虫所做的毕业设计,有一小部分简单的数据分析。这个也是为了纪念大学四年!里面包括了源代码,论文的一稿二稿等等还有查重终稿,UML图 、PPT等等
Language: Python - Size: 92.5 MB - Last synced at: over 2 years ago - Pushed at: about 7 years ago - Stars: 36 - Forks: 4

gpassero/uol-redacoes-xml
O banco de redações da UOL (http://educacao.uol.com.br/bancoderedacoes/) em XML como modelo de testes e validação de técnicas de PLN (Processamento de Linguagem Natural) sobre redações.
Language: Python - Size: 23.4 MB - Last synced at: 15 days ago - Pushed at: almost 5 years ago - Stars: 34 - Forks: 10

nicksherron/proxi 📦
Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.
Language: Go - Size: 1.09 MB - Last synced at: 3 days ago - Pushed at: about 5 years ago - Stars: 34 - Forks: 4

commoncrawl/nutch Fork of Aloisius/nutch
Common Crawl fork of Apache Nutch
Language: Java - Size: 132 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 33 - Forks: 2

xofred/deviantart-gallery-downloader
fetch deviantart's images using mechanize
Language: Ruby - Size: 69.3 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 33 - Forks: 4

ScrapingAnt/zoominfo_scraper
Zoominfo scraper with using of rotating proxies and headless Chrome from ScrapingAnt
Language: Python - Size: 7.81 KB - Last synced at: 7 days ago - Pushed at: about 4 years ago - Stars: 33 - Forks: 9

Keep-Current/web-miner
Crawls sites, to find new content and scrap it
Language: Python - Size: 215 KB - Last synced at: about 1 year ago - Pushed at: about 4 years ago - Stars: 33 - Forks: 29

LeaFrock/SpiderX
A simple web-crawler development framework based on .Net Core.
Language: C# - Size: 2.34 MB - Last synced at: 2 months ago - Pushed at: 11 months ago - Stars: 32 - Forks: 10
