GitHub topics: commoncrawl
oritwoen/omnichron
Unified TypeScript interface for multiple web archive platforms.
Language: TypeScript - Size: 63.5 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

flairNLP/fundus
A very simple news crawler with a funny name
Language: Python - Size: 21.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 368 - Forks: 88

commoncrawl/cc-crawl-statistics
Statistics of Common Crawl monthly archives mined from URL index files
Language: Python - Size: 350 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 177 - Forks: 11

rix4uni/uforall
uforall is a fast url crawler this tool crawl all URLs number of different sources, alienvault,WayBackMachine,urlscan,commoncrawl
Language: Go - Size: 50.8 KB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 40 - Forks: 8

fhamborg/news-please
news-please - an integrated web crawler and information extractor for news that just works
Language: Python - Size: 2.99 MB - Last synced at: 12 days ago - Pushed at: 27 days ago - Stars: 2,197 - Forks: 431

cisnlp/GlotCC
🕸 GlotCC Dataset and Pipline -- NeurIPS 2024
Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 18 - Forks: 0

centic9/CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
Language: Java - Size: 1000 KB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 65 - Forks: 17

commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
Language: Jupyter Notebook - Size: 3.01 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 51 - Forks: 9

cloudtracer/paskto
Paskto - Passive Web Scanner
Language: JavaScript - Size: 53.2 MB - Last synced at: 17 days ago - Pushed at: over 6 years ago - Stars: 151 - Forks: 37

preciz/common_crawl
Work with Common Crawl data from Elixir.
Language: Elixir - Size: 187 KB - Last synced at: 11 days ago - Pushed at: 21 days ago - Stars: 2 - Forks: 0

michaelharms/comcrawl 📦
A python utility for downloading Common Crawl data
Language: Python - Size: 135 KB - Last synced at: 13 days ago - Pushed at: almost 2 years ago - Stars: 237 - Forks: 42

code402/warc-benchmark
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Language: Shell - Size: 24.4 KB - Last synced at: 23 days ago - Pushed at: almost 4 years ago - Stars: 6 - Forks: 7

commoncrawl/cc-downloader
A polite and user-friendly downloader for Common Crawl data
Language: Rust - Size: 116 KB - Last synced at: 19 days ago - Pushed at: 29 days ago - Stars: 36 - Forks: 1

shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
Language: Python - Size: 586 KB - Last synced at: 17 days ago - Pushed at: almost 2 years ago - Stars: 122 - Forks: 14

karust/gogetcrawl
Extract web archive data using Wayback Machine and Common Crawl
Language: Go - Size: 58.6 KB - Last synced at: 15 days ago - Pushed at: 6 months ago - Stars: 155 - Forks: 17

cocrawler/cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Language: Python - Size: 209 KB - Last synced at: 14 days ago - Pushed at: 4 months ago - Stars: 169 - Forks: 31

oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
Language: Rust - Size: 4.72 MB - Last synced at: 17 days ago - Pushed at: over 1 year ago - Stars: 167 - Forks: 15

commoncrawl/nutch Fork of Aloisius/nutch
Common Crawl fork of Apache Nutch
Language: Java - Size: 132 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 32 - Forks: 2

commoncrawl/cc-index-table
Index Common Crawl archives in tabular format
Language: Java - Size: 192 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 113 - Forks: 9

thunderpoot/cc-getpage
Lightweight Python utility for retrieving individual pages from the Common Crawl archives.
Language: Python - Size: 1.72 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

toimik/CommonCrawl
Common Crawl's processing tools
Language: C# - Size: 89.8 KB - Last synced at: 2 days ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

ahcm/tantivy_warc_indexer
builds a tantivy index from common crawl warc.wet files
Language: Rust - Size: 26.4 KB - Last synced at: 22 days ago - Pushed at: 4 months ago - Stars: 11 - Forks: 1

commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
Language: Python - Size: 155 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 416 - Forks: 88

openculinary/tardir
Time And Relative Dimensions In Recipes
Language: Python - Size: 14.6 KB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

commoncrawl/cc-warc-examples Fork of Smerity/cc-warc-examples
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
Language: Java - Size: 30.3 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 38 - Forks: 19

CI-Research/KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
Size: 27.9 MB - Last synced at: 7 months ago - Pushed at: about 1 year ago - Stars: 57 - Forks: 13

commoncrawl/cc-webgraph
Tools to construct and process webgraphs from Common Crawl data
Language: Java - Size: 159 KB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 63 - Forks: 4

generals-space/site-mirror-go
来自[码云](https://gitee.com/generals-space/site-mirror-go) 通用爬虫, 仿站工具, 整站下载
Language: Go - Size: 400 KB - Last synced at: 10 months ago - Pushed at: almost 6 years ago - Stars: 25 - Forks: 3

commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
Language: Java - Size: 231 KB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 251 - Forks: 31

commoncrawl/cc-mrjob Fork of Smerity/cc-mrjob
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
Language: Python - Size: 1020 KB - Last synced at: 12 months ago - Pushed at: about 3 years ago - Stars: 166 - Forks: 65

isplab-unil/CommonCrawlSRI
Analysing SRI usage on CommonCrawl
Language: Python - Size: 33.2 MB - Last synced at: 6 months ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 0

generals-space/site-mirror-py
[码云](https://gitee.com/generals-space/site-mirror-py) 通用爬虫, 仿站工具, 整站下载
Language: Python - Size: 403 KB - Last synced at: 12 months ago - Pushed at: almost 6 years ago - Stars: 54 - Forks: 19

ngc7292/query_of_cc
This project is dataset and model checkpoints for the paper "Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora".
Size: 674 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

networkdynamics/seldonite
A News Article Collection Library
Language: Python - Size: 268 KB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 19 - Forks: 3

ArtificialOSS/WebCrawl
Crawls the web to generate a huge dataset for training
Language: Python - Size: 18.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

vladserkoff/common-crawler
Load htmls from Common Crawl
Language: Python - Size: 6.84 KB - Last synced at: over 1 year ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 0

astralway/webindex 📦
Apache Fluo application that creates a web index using Common Crawl data
Language: Java - Size: 646 KB - Last synced at: over 1 year ago - Pushed at: about 7 years ago - Stars: 4 - Forks: 3

uhussain/WebCrawlerForOnlineInflation
Price Crawler - Tracking Price Inflation
Language: Python - Size: 387 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 155 - Forks: 47

imfht/super-Django-CC
super-Django-CC is a simle web interface for commoncrawl.org
Language: Python - Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 12 - Forks: 4

lavafroth/gauf
Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, Common Crawl, and URLScan
Language: Rust - Size: 88.9 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ChrisCates/CommonCrawler 📦
🕸 A simple way to extract data from Common Crawl
Language: Go - Size: 2.42 MB - Last synced at: 10 months ago - Pushed at: about 5 years ago - Stars: 33 - Forks: 12

lxucs/commoncrawl-warc-retrieval
Python tools to retrieve text from CommonCrawl WARC files based on cdx index.
Language: Python - Size: 11.7 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 17 - Forks: 3

Krisalyd/aws-s3-file-downloader
Testing file download from AWS's S3 Bucket with Python.
Language: Python - Size: 1.95 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

Tarasa24/PWA-Store
The largest collection of publicly accessible Progressive Web Apps*
Language: HTML - Size: 25.5 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

vrkansagara/common-crawler 📦
Common Crawler Index
Language: PHP - Size: 74.2 KB - Last synced at: 5 months ago - Pushed at: about 7 years ago - Stars: 3 - Forks: 1

jgonsior/dwtc-table-manual-classificator
A tool for manually classification of dwtc tables. The result is then being used as a training data set.
Language: Java - Size: 17.4 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 1

Damian89/commonCrawlParser
Simple multi threaded tool to extract domain related data from commoncrawl.org
Language: Python - Size: 13.7 KB - Last synced at: about 2 years ago - Pushed at: almost 7 years ago - Stars: 29 - Forks: 10

BhagyashriT/DICLAB2-DataAggregationBigDataAnalysisAndVisualization
Collected data about from three sources, one opinion-based social media in twitter, research data in New York Times, and the third is the common crawl data for the same topic or key phrase, and from similar time periods. Processed the three data sets collected individually using classical big data methods like Map Reduce in Google Dataproc Clusters. And then compared the outcomes using popular visualization methods in tableau.
Language: Python - Size: 39.8 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 2 - Forks: 0

fabianmurariu/OfflineESIndexGenerator Fork of andybab/OfflineESIndexGenerator
Offline Elasticsearch index generator
Language: Scala - Size: 15.9 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 0

umanlp/webisadb-extractor
Relation Extractor for WebIsADb
Language: Java - Size: 14.6 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

sara-nl/spark-warcutils-example
Example of using warcutils with Apach Spark
Language: Scala - Size: 55.7 KB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 1
