GitHub topics: commoncrawl

Repositories

fhamborg/news-please

news-please - an integrated web crawler and information extractor for news that just works

Language: Python - Size: 3 MB - Last synced at: 8 days ago - Pushed at: 3 months ago - Stars: 2,303 - Forks: 444

👻 GhostPath — A powerful modular reconnaissance toolkit built for hackers, OSINT professionals & bug bounty hunters — passive + active recon in a sleek CLI shell. Discover subdomains, probe paths, mine archives and hunt certificates — all from one interactive terminal interface.

Language: Python - Size: 5.09 MB - Last synced at: 10 days ago - Pushed at: 11 days ago - Stars: 2 - Forks: 0

commoncrawl/cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

Language: Python - Size: 387 MB - Last synced at: 10 days ago - Pushed at: 10 days ago - Stars: 188 - Forks: 12

toimik/CommonCrawl

Common Crawl's processing tools

Language: C# - Size: 89.8 KB - Last synced at: 11 days ago - Pushed at: 11 months ago - Stars: 11 - Forks: 0

openculinary/tardir 📦

Migrating to: https://codeberg.org/openculinary/tardir

Language: Python - Size: 15.6 KB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

flairNLP/fundus

A very simple news crawler with a funny name

Language: Python - Size: 22 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 398 - Forks: 88

commoncrawl/cc-downloader

A polite and user-friendly downloader for Common Crawl data

Language: Rust - Size: 144 KB - Last synced at: 27 days ago - Pushed at: 2 months ago - Stars: 51 - Forks: 3

commoncrawl/cc-index-table

Index Common Crawl archives in tabular format

Language: Java - Size: 206 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 123 - Forks: 12

commoncrawl/nutch Fork of Aloisius/nutch

Common Crawl fork of Apache Nutch

Language: Java - Size: 132 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 34 - Forks: 2

preciz/common_crawl

Work with Common Crawl data from Elixir.

Language: Elixir - Size: 195 KB - Last synced at: 11 days ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

Language: Python - Size: 586 KB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 129 - Forks: 16

michaelharms/comcrawl 📦

A python utility for downloading Common Crawl data

Language: Python - Size: 135 KB - Last synced at: 20 days ago - Pushed at: about 2 years ago - Stars: 242 - Forks: 41

BojanMakivic/WARC-File-Processor

Download Common Crawl .gz files, extract relevant segments, parse the required content from robots.txt entries, and compile the results into an Excel file.

Language: Python - Size: 43.9 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

commoncrawl/cc-webgraph

Tools to construct and process Common Crawl webgraphs

Language: Java - Size: 154 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 92 - Forks: 5

centic9/CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

Language: Java - Size: 1.03 MB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 68 - Forks: 17

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 169 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 432 - Forks: 89

commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

Language: Jupyter Notebook - Size: 3.01 MB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 54 - Forks: 11

ahcm/tantivy_warc_indexer

builds a tantivy index from common crawl warc.wet files

Language: Rust - Size: 26.4 KB - Last synced at: 25 days ago - Pushed at: 9 months ago - Stars: 12 - Forks: 1

commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

Language: Java - Size: 247 KB - Last synced at: 3 months ago - Pushed at: 7 months ago - Stars: 346 - Forks: 37

rix4uni/uforall

uforall is a fast url crawler this tool crawl all URLs number of different sources, alienvault,WayBackMachine,urlscan,commoncrawl

Language: Go - Size: 50.8 KB - Last synced at: 5 months ago - Pushed at: 9 months ago - Stars: 40 - Forks: 8

cisnlp/GlotCC

🕸 GlotCC Dataset and Pipline -- NeurIPS 2024

Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 18 - Forks: 0

cloudtracer/paskto

Paskto - Passive Web Scanner

Language: JavaScript - Size: 53.2 MB - Last synced at: 20 days ago - Pushed at: over 6 years ago - Stars: 151 - Forks: 37

richardjennings/cc_crawl

Common Crawl Crawler

Language: Go - Size: 4.88 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

code402/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

Language: Shell - Size: 24.4 KB - Last synced at: 2 months ago - Pushed at: over 4 years ago - Stars: 6 - Forks: 7

karust/gogetcrawl

Extract web archive data using Wayback Machine and Common Crawl

Language: Go - Size: 58.6 KB - Last synced at: 5 months ago - Pushed at: 10 months ago - Stars: 155 - Forks: 17

cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Language: Python - Size: 209 KB - Last synced at: 5 months ago - Pushed at: 8 months ago - Stars: 169 - Forks: 31

oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

Language: Rust - Size: 4.72 MB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 167 - Forks: 15

thunderpoot/cc-getpage

Lightweight Python utility for retrieving individual pages from the Common Crawl archives.

Language: Python - Size: 1.72 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 2 - Forks: 0

commoncrawl/cc-warc-examples Fork of Smerity/cc-warc-examples

CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

Language: Java - Size: 30.3 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 38 - Forks: 19

CI-Research/KeywordAnalysis

Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

Size: 27.9 MB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 57 - Forks: 13

generals-space/site-mirror-go

来自[码云](https://gitee.com/generals-space/site-mirror-go) 通用爬虫, 仿站工具, 整站下载

Language: Go - Size: 400 KB - Last synced at: about 1 year ago - Pushed at: over 6 years ago - Stars: 25 - Forks: 3

commoncrawl/cc-mrjob Fork of Smerity/cc-mrjob

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

Language: Python - Size: 1020 KB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 166 - Forks: 65

isplab-unil/CommonCrawlSRI

Analysing SRI usage on CommonCrawl

Language: Python - Size: 33.2 MB - Last synced at: 10 months ago - Pushed at: about 5 years ago - Stars: 1 - Forks: 0

generals-space/site-mirror-py

[码云](https://gitee.com/generals-space/site-mirror-py) 通用爬虫, 仿站工具, 整站下载

Language: Python - Size: 403 KB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 54 - Forks: 19

ngc7292/query_of_cc

This project is dataset and model checkpoints for the paper "Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora".

Size: 674 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 0

networkdynamics/seldonite

A News Article Collection Library

Language: Python - Size: 268 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 19 - Forks: 3

ArtificialOSS/WebCrawl

Crawls the web to generate a huge dataset for training

Language: Python - Size: 18.6 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 0

vladserkoff/common-crawler

Load htmls from Common Crawl

Language: Python - Size: 6.84 KB - Last synced at: over 1 year ago - Pushed at: about 6 years ago - Stars: 1 - Forks: 0

astralway/webindex 📦

Apache Fluo application that creates a web index using Common Crawl data

Language: Java - Size: 646 KB - Last synced at: over 1 year ago - Pushed at: over 7 years ago - Stars: 4 - Forks: 3

uhussain/WebCrawlerForOnlineInflation

Price Crawler - Tracking Price Inflation

Language: Python - Size: 387 KB - Last synced at: almost 2 years ago - Pushed at: about 5 years ago - Stars: 155 - Forks: 47

imfht/super-Django-CC

super-Django-CC is a simle web interface for commoncrawl.org

Language: Python - Size: 16.6 KB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 12 - Forks: 4

lavafroth/gauf

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, Common Crawl, and URLScan

Language: Rust - Size: 88.9 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

ChrisCates/CommonCrawler 📦

🕸 A simple way to extract data from Common Crawl

Language: Go - Size: 2.42 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 33 - Forks: 12

lxucs/commoncrawl-warc-retrieval

Python tools to retrieve text from CommonCrawl WARC files based on cdx index.

Language: Python - Size: 11.7 KB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 17 - Forks: 3

Krisalyd/aws-s3-file-downloader

Testing file download from AWS's S3 Bucket with Python.

Language: Python - Size: 1.95 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Tarasa24/PWA-Store

The largest collection of publicly accessible Progressive Web Apps*

Language: HTML - Size: 25.5 MB - Last synced at: over 2 years ago - Pushed at: about 3 years ago - Stars: 3 - Forks: 0

vrkansagara/common-crawler 📦

Common Crawler Index

Language: PHP - Size: 74.2 KB - Last synced at: 24 days ago - Pushed at: over 7 years ago - Stars: 3 - Forks: 1

jgonsior/dwtc-table-manual-classificator

A tool for manually classification of dwtc tables. The result is then being used as a training data set.

Language: Java - Size: 17.4 MB - Last synced at: over 1 year ago - Pushed at: about 2 years ago - Stars: 2 - Forks: 1

Damian89/commonCrawlParser

Simple multi threaded tool to extract domain related data from commoncrawl.org

Language: Python - Size: 13.7 KB - Last synced at: over 2 years ago - Pushed at: about 7 years ago - Stars: 29 - Forks: 10

BhagyashriT/DICLAB2-DataAggregationBigDataAnalysisAndVisualization

Collected data about from three sources, one opinion-based social media in twitter, research data in New York Times, and the third is the common crawl data for the same topic or key phrase, and from similar time periods. Processed the three data sets collected individually using classical big data methods like Map Reduce in Google Dataproc Clusters. And then compared the outcomes using popular visualization methods in tableau.

Language: Python - Size: 39.8 MB - Last synced at: almost 2 years ago - Pushed at: almost 6 years ago - Stars: 2 - Forks: 0