An open API service providing repository metadata for many open source software ecosystems.

Topic: "web-crawler"

mendableai/firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Language: TypeScript - Size: 57.1 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 39,788 - Forks: 3,705

ScrapeGraphAI/Scrapegraph-ai

Python scraper based on AI

Language: Python - Size: 15.4 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 19,984 - Forks: 1,699

apify/crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Language: TypeScript - Size: 141 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 17,898 - Forks: 833

crawlab-team/crawlab

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Language: Go - Size: 23.9 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 11,769 - Forks: 1,837

ssssssss-team/spider-flow

新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。

Language: Java - Size: 3.23 MB - Last synced at: 29 days ago - Pushed at: about 2 years ago - Stars: 9,965 - Forks: 1,916

BruceDone/awesome-crawler

A collection of awesome web crawler,spider in different languages

Size: 74.2 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 6,744 - Forks: 716

adithya-s-k/omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

Language: Python - Size: 592 KB - Last synced at: 29 days ago - Pushed at: 2 months ago - Stars: 6,543 - Forks: 527

apify/crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Language: Python - Size: 28.3 MB - Last synced at: 1 day ago - Pushed at: 2 days ago - Stars: 5,737 - Forks: 391

mendableai/firecrawl-mcp-server

Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

Language: JavaScript - Size: 337 KB - Last synced at: 6 days ago - Pushed at: 14 days ago - Stars: 3,408 - Forks: 324

apache/nutch

Apache Nutch is an extensible and scalable web crawler

Language: Java - Size: 132 MB - Last synced at: 6 days ago - Pushed at: 3 months ago - Stars: 3,030 - Forks: 1,253

sjdirect/abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Language: C# - Size: 6.92 MB - Last synced at: 29 days ago - Pushed at: 9 months ago - Stars: 2,281 - Forks: 560

xianhu/PSpider

简单易用的Python爬虫框架,QQ交流群:597510560

Language: Python - Size: 814 KB - Last synced at: 27 days ago - Pushed at: about 3 years ago - Stars: 1,837 - Forks: 501

jasonxtn/Argus

The Ultimate Information Gathering Toolkit

Language: Python - Size: 131 KB - Last synced at: 6 months ago - Pushed at: 8 months ago - Stars: 1,413 - Forks: 148

MarginaliaSearch/MarginaliaSearch

Internet search engine for text-oriented websites. Indexing the small, old and weird web.

Language: HTML - Size: 15.2 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1,375 - Forks: 33

Algebra-FUN/WeReadScan

扫描“微信读书”已购图书并下载本地PDF的爬虫

Language: Python - Size: 520 KB - Last synced at: 27 days ago - Pushed at: over 1 year ago - Stars: 949 - Forks: 168

apache/stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

Language: Java - Size: 7.28 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 915 - Forks: 265

platonai/PulsarRPA

PulsarRPA: An AI-Enabled, Super-Fast, Thread-Safe Browser Automation Solution! 💖

Language: Kotlin - Size: 29.9 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 882 - Forks: 128

gildas-lormeau/single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

Language: JavaScript - Size: 5.16 MB - Last synced at: 21 days ago - Pushed at: 3 months ago - Stars: 830 - Forks: 83

postmodern/spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Language: Ruby - Size: 685 KB - Last synced at: 29 days ago - Pushed at: 5 months ago - Stars: 818 - Forks: 107

webrecorder/browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container

Language: TypeScript - Size: 53.1 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 805 - Forks: 104

cxcscmu/Craw4LLM

Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

Language: Python - Size: 79.1 KB - Last synced at: 2 months ago - Pushed at: 4 months ago - Stars: 608 - Forks: 56

scrapfly/scrapfly-scrapers

Scalable Python web scraping scripts for +40 popular domains

Language: Python - Size: 4.79 MB - Last synced at: 8 days ago - Pushed at: 11 days ago - Stars: 534 - Forks: 129

devflowinc/firecrawl-simple Fork of mendableai/firecrawl

➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.

Language: TypeScript - Size: 40 MB - Last synced at: 5 days ago - Pushed at: 26 days ago - Stars: 470 - Forks: 37

VIDA-NYU/ache

ACHE is a web crawler for domain-specific search.

Language: Java - Size: 66.6 MB - Last synced at: 26 days ago - Pushed at: almost 2 years ago - Stars: 468 - Forks: 134

hyunwoongko/kochat

Opensource Korean chatbot framework

Language: Python - Size: 310 MB - Last synced at: 6 days ago - Pushed at: about 2 years ago - Stars: 457 - Forks: 185

mendableai/firecrawl-app-examples

🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.

Language: Jupyter Notebook - Size: 13.6 MB - Last synced at: about 20 hours ago - Pushed at: 16 days ago - Stars: 416 - Forks: 111

USCDataScience/sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Language: Java - Size: 23.1 MB - Last synced at: 3 months ago - Pushed at: about 2 years ago - Stars: 412 - Forks: 141

brendonboshell/supercrawler

A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

Language: JavaScript - Size: 664 KB - Last synced at: 22 days ago - Pushed at: over 2 years ago - Stars: 381 - Forks: 61

lefterisloukas/edgar-crawler

The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.

Language: Python - Size: 63 MB - Last synced at: 2 months ago - Pushed at: 3 months ago - Stars: 365 - Forks: 100

crwlrsoft/crawler

Library for Rapid (Web) Crawler and Scraper Development

Language: PHP - Size: 1.02 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 364 - Forks: 13

commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

Language: Java - Size: 247 KB - Last synced at: 7 days ago - Pushed at: 4 months ago - Stars: 346 - Forks: 37

rivermont/spidy

The simple, easy to use command line web crawler.

Language: Python - Size: 81.8 MB - Last synced at: 22 days ago - Pushed at: 10 months ago - Stars: 346 - Forks: 69

lewisdonovan/google-news-scraper

Lightweight scraper for Google News

Language: TypeScript - Size: 895 KB - Last synced at: 18 days ago - Pushed at: 3 months ago - Stars: 330 - Forks: 66

infinilabs/crawler

🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

Language: Go - Size: 54.6 MB - Last synced at: about 1 month ago - Pushed at: about 4 years ago - Stars: 308 - Forks: 82

graphlit/graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

Language: TypeScript - Size: 625 KB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 301 - Forks: 34

s0rg/crawley

The unix-way web crawler

Language: Go - Size: 213 KB - Last synced at: 26 days ago - Pushed at: about 1 month ago - Stars: 296 - Forks: 16

yields/ant

A web crawler for Go

Language: Go - Size: 168 KB - Last synced at: 26 days ago - Pushed at: 3 months ago - Stars: 278 - Forks: 17

microfisher/Strong-Web-Crawler

基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。

Language: C# - Size: 18.3 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 275 - Forks: 154

duyet/awesome-web-scraper

A collection of awesome web scaper, crawler.

Size: 48.8 KB - Last synced at: about 1 month ago - Pushed at: about 1 year ago - Stars: 273 - Forks: 46

antchfx/antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Language: Go - Size: 56.6 KB - Last synced at: 15 days ago - Pushed at: about 5 years ago - Stars: 262 - Forks: 41

lucasxlu/LagouJob

Data Analysis & Mining for lagou.com

Language: Python - Size: 28.1 MB - Last synced at: 7 months ago - Pushed at: about 6 years ago - Stars: 259 - Forks: 127

TurnerSoftware/InfinityCrawler

A simple but powerful web crawler library for .NET

Language: C# - Size: 326 KB - Last synced at: 14 days ago - Pushed at: over 1 year ago - Stars: 252 - Forks: 37

jpjacobpadilla/Stealth-Requests

Undetected web-scraping & seamless HTML parsing in Python!

Language: Python - Size: 714 KB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 251 - Forks: 13

crawler-commons/crawler-commons

A set of reusable Java components that implement functionality common to any web crawler

Language: Java - Size: 3.73 MB - Last synced at: 13 days ago - Pushed at: 13 days ago - Stars: 244 - Forks: 80

crawlab-team/crawlab-lite

Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台

Language: Vue - Size: 2.36 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 224 - Forks: 75

xiayouran/Musicer

旨在将网易云、酷狗、QQ、酷我等各音乐平台集于一体

Language: Python - Size: 10 MB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 202 - Forks: 21

elliotxx/zhihu-crawler-people

A simple distributed crawler for zhihu && data analysis

Language: Python - Size: 183 KB - Last synced at: 2 months ago - Pushed at: over 2 years ago - Stars: 192 - Forks: 89

Norconex/crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

Language: Java - Size: 15.6 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 188 - Forks: 69

Hecate2/Ignareo-ISML-auto-voter

Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)

Language: Python - Size: 33.1 MB - Last synced at: 23 days ago - Pushed at: 23 days ago - Stars: 187 - Forks: 11

internetarchive/Zeno

State-of-the-art web crawler 🔱

Language: Go - Size: 2.78 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 178 - Forks: 34

saeeddhqan/evine

Interactive CLI Web Crawler

Language: Go - Size: 789 KB - Last synced at: 11 months ago - Pushed at: over 3 years ago - Stars: 175 - Forks: 32

Madi-S/Lead-Generation

Python script, which empowers people with no programming background to generate robust leads on a mass scale. This repo will be compiled of various versatile techniques used in lead generation.

Language: Python - Size: 9.67 MB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 153 - Forks: 38

abaykan/CrawlBox 📦

Easy way to brute-force web directory.

Language: Python - Size: 71.3 KB - Last synced at: 9 days ago - Pushed at: about 6 years ago - Stars: 152 - Forks: 40

sjdirect/abotx

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.

Language: C# - Size: 17 MB - Last synced at: 12 days ago - Pushed at: over 1 year ago - Stars: 134 - Forks: 23

skytruine/OSpider

开源矢量地理数据获取与预处理工具(POI/AOI/行政区/路网/土地利用)

Language: Python - Size: 81.6 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 130 - Forks: 22

brianmadden/krawler

A web crawling framework written in Kotlin

Language: Kotlin - Size: 403 KB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 130 - Forks: 16

mazzzystar/Proxy

A simple tool for fetching usable proxies from several websites.

Language: Python - Size: 64.5 KB - Last synced at: about 2 months ago - Pushed at: over 4 years ago - Stars: 127 - Forks: 68

hominee/dyer

Dyer is designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.

Language: Rust - Size: 75 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 126 - Forks: 7

gosom/scrapemate

Golang Crawling and scraping framework

Language: Go - Size: 193 KB - Last synced at: 9 days ago - Pushed at: about 2 months ago - Stars: 125 - Forks: 17

DwarfThief/Raspagem-de-dados-para-iniciantes

Raspagem de dados para iniciante usando Scrapy e outras libs básicas

Language: Python - Size: 2.04 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 122 - Forks: 20

platonai/PulsarRPAPro

Fully automated and hands-free, accurately extracting and understanding web content — powered by machine learning agents.

Language: Kotlin - Size: 24.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 119 - Forks: 27

MaxValue/Terpene-Profile-Parser-for-Cannabis-Strains

Parser and database to index the terpene profile of different strains of Cannabis from online databases

Language: Python - Size: 21.4 MB - Last synced at: about 2 months ago - Pushed at: about 2 years ago - Stars: 118 - Forks: 18

monkey-soft/SchweizerMesser

🎯Python 3 网络爬虫实战、数据分析合集 | 当当 | 网易云音乐 | unsplash | 必胜客 | 猫眼 |

Language: HTML - Size: 517 KB - Last synced at: over 2 years ago - Pushed at: over 5 years ago - Stars: 97 - Forks: 73

creekorful/bathyscaphe

Fast, highly configurable, cloud native dark web crawler.

Language: Go - Size: 830 KB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 93 - Forks: 22

KaidiGuo/keyword_based_Sina_weibo_crawler

A web crawler for Sina, search and retrieve microblogs that contain certain keywords 一个简单的python爬虫实践,爬取包含关键词的新浪微博

Language: Python - Size: 47.9 KB - Last synced at: almost 2 years ago - Pushed at: over 6 years ago - Stars: 93 - Forks: 31

Viveckh/LilHomie

A Machine Learning Project implemented from scratch which involves web scraping, data engineering, exploratory data analysis and machine learning to predict housing prices in New York Tri-State Area.

Language: Jupyter Notebook - Size: 10.5 MB - Last synced at: 3 months ago - Pushed at: over 2 years ago - Stars: 92 - Forks: 19

tech-engine/goscrapy

GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.

Language: Go - Size: 6.16 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 89 - Forks: 2

havanagrawal/GoodreadsScraper

Scrape data from Goodreads using Scrapy and Selenium :books:

Language: Python - Size: 5.11 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 86 - Forks: 20

joe-habel/YouTube-View-Bot

A rotating proxy solution to bot YouTube views

Language: Python - Size: 2.93 KB - Last synced at: over 2 years ago - Pushed at: over 6 years ago - Stars: 86 - Forks: 81

spider-rs/spider-py

Spider ported to Python

Language: Rust - Size: 1.36 MB - Last synced at: 25 days ago - Pushed at: 5 months ago - Stars: 83 - Forks: 13

ScrapingAnt/amazon_scraper

Amazon products scraper with using of rotating proxies and headless Chrome from ScrapingAnt

Language: JavaScript - Size: 52.7 KB - Last synced at: 10 days ago - Pushed at: over 1 year ago - Stars: 82 - Forks: 19

redcode-labs/UnChain

A tool to find redirection chains in multiple URLs

Language: Go - Size: 3.3 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 80 - Forks: 13

wyu-du/StockForecast

:dart: predict the price trend of individual stocks using deep learning and natural language processing

Language: Python - Size: 29.4 MB - Last synced at: about 1 year ago - Pushed at: over 7 years ago - Stars: 74 - Forks: 34

mattdeitke/CVPR2019

Displays all the 2019 CVPR Accepted Papers in a way that they are easy to parse.

Language: HTML - Size: 27.8 MB - Last synced at: 7 days ago - Pushed at: over 4 years ago - Stars: 70 - Forks: 10

devopsgroup-io/siteshooter

:camera: Automate full website screenshots and PDF generation with multiple viewport support.

Language: JavaScript - Size: 496 KB - Last synced at: 1 day ago - Pushed at: about 6 years ago - Stars: 65 - Forks: 13

abo123456789/leek

Distributed task redisqueue(最简单python分布式函数调度框架)

Language: Python - Size: 412 KB - Last synced at: 4 days ago - Pushed at: over 1 year ago - Stars: 63 - Forks: 19

gicornachini/bolsa

Biblioteca feita em Python com o objetivo de facilitar o acesso a dados de seus investimentos na bolsa de valores(B3/CEI) através do Portal CEI.

Language: Python - Size: 125 KB - Last synced at: 2 months ago - Pushed at: almost 4 years ago - Stars: 62 - Forks: 18

amalrajan/learncpp-download

Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.

Language: Python - Size: 494 KB - Last synced at: 11 months ago - Pushed at: 11 months ago - Stars: 60 - Forks: 19

Misterhex/WebCrawler

Just a simple web crawler which return crawled links as IObservable using reactive extension and async await.

Language: C# - Size: 7.13 MB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 60 - Forks: 33

qfcy/Python

This repository contains the python source code, containing more than 40 python projects, involving many fields.仓库用于储存python源代码, 包含40多个python项目,涉及爬虫、算法、OpenGL、tkinter、面向对象编程等多个领域。

Language: Python - Size: 188 MB - Last synced at: 10 days ago - Pushed at: about 1 month ago - Stars: 58 - Forks: 4

Cheng-Lin-Li/Market-Trend-Prediction

This is a project of build knowledge graph course. The project leverages historical stock price, and integrates social media listening from customers to predict market Trend On Dow Jones Industrial Average (DJIA).

Language: Julia - Size: 143 MB - Last synced at: about 2 months ago - Pushed at: about 7 years ago - Stars: 58 - Forks: 28

avilum/smart-url-fuzzer

Explore URLs of domains fast and efficiently using fuzzing techniques

Language: Python - Size: 338 KB - Last synced at: 3 months ago - Pushed at: about 4 years ago - Stars: 56 - Forks: 18

ScrapeGraphAI/scrapegraph-sdk

🕷️ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. 🤖 Hassle-free web scraping made simple.

Language: Jupyter Notebook - Size: 6.64 MB - Last synced at: about 3 hours ago - Pushed at: about 17 hours ago - Stars: 54 - Forks: 8

ScaleUnlimited/flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-commons

Language: Java - Size: 1.38 MB - Last synced at: about 1 year ago - Pushed at: about 6 years ago - Stars: 52 - Forks: 18

shenfe/puppeteer-service

🎠 Run headless Chrome (aka Puppeteer) as a service.

Language: JavaScript - Size: 133 KB - Last synced at: 5 days ago - Pushed at: over 7 years ago - Stars: 49 - Forks: 6

anlp-team/LTI_Neural_Navigator

"Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases" by Jiarui Li and Ye Yuan and Zehua Zhang

Language: HTML - Size: 32.3 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 45 - Forks: 3

spk/maman

Rust Web Crawler saving pages on Redis

Language: Rust - Size: 203 KB - Last synced at: 28 days ago - Pushed at: about 4 years ago - Stars: 44 - Forks: 5

ahmedshahriar/youtube-comment-scraper

This script will dump youtube video comments to a CSV from youtube video links. Video links can be placed inside a variable or list or CSV

Language: Jupyter Notebook - Size: 256 KB - Last synced at: about 2 months ago - Pushed at: over 3 years ago - Stars: 42 - Forks: 15

yuanyuanzijin/dutsso

快速登录大连理工大学统一身份认证系统(SSO)的Python模块,可轻松实现成绩提醒、抢课、玉兰卡信息、个人信息查询等功能。

Language: Python - Size: 109 KB - Last synced at: over 2 years ago - Pushed at: about 6 years ago - Stars: 40 - Forks: 10

spk/validate-website

Web crawler for checking the validity of your documents.

Language: HTML - Size: 934 KB - Last synced at: 15 days ago - Pushed at: almost 2 years ago - Stars: 39 - Forks: 9

GoncaloMark/CobWeb-lnx

CobWeb is a Python library for web scraping. The library consists of two classes: Spider and Scraper.

Language: Python - Size: 7.75 MB - Last synced at: 6 days ago - Pushed at: over 1 year ago - Stars: 38 - Forks: 2

SylvainDe/ComicBookMaker

Script to fetch webcomics and use them to create ebooks.

Language: Python - Size: 2.48 MB - Last synced at: 2 months ago - Pushed at: almost 4 years ago - Stars: 36 - Forks: 7

weiyu666/Graduation_Design-Distributed_Web_Spider

基于微博用户信息数据的分布式爬虫所做的毕业设计,有一小部分简单的数据分析。这个也是为了纪念大学四年!里面包括了源代码,论文的一稿二稿等等还有查重终稿,UML图 、PPT等等

Language: Python - Size: 92.5 MB - Last synced at: over 2 years ago - Pushed at: about 7 years ago - Stars: 36 - Forks: 4

gpassero/uol-redacoes-xml

O banco de redações da UOL (http://educacao.uol.com.br/bancoderedacoes/) em XML como modelo de testes e validação de técnicas de PLN (Processamento de Linguagem Natural) sobre redações.

Language: Python - Size: 23.4 MB - Last synced at: 15 days ago - Pushed at: almost 5 years ago - Stars: 34 - Forks: 10

nicksherron/proxi 📦

Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.

Language: Go - Size: 1.09 MB - Last synced at: 3 days ago - Pushed at: about 5 years ago - Stars: 34 - Forks: 4

commoncrawl/nutch Fork of Aloisius/nutch

Common Crawl fork of Apache Nutch

Language: Java - Size: 132 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 33 - Forks: 2

xofred/deviantart-gallery-downloader

fetch deviantart's images using mechanize

Language: Ruby - Size: 69.3 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 33 - Forks: 4

ScrapingAnt/zoominfo_scraper

Zoominfo scraper with using of rotating proxies and headless Chrome from ScrapingAnt

Language: Python - Size: 7.81 KB - Last synced at: 7 days ago - Pushed at: about 4 years ago - Stars: 33 - Forks: 9

Keep-Current/web-miner

Crawls sites, to find new content and scrap it

Language: Python - Size: 215 KB - Last synced at: about 1 year ago - Pushed at: about 4 years ago - Stars: 33 - Forks: 29

LeaFrock/SpiderX

A simple web-crawler development framework based on .Net Core.

Language: C# - Size: 2.34 MB - Last synced at: 2 months ago - Pushed at: 11 months ago - Stars: 32 - Forks: 10