An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: commoncrawl

oritwoen/omnichron

Unified TypeScript interface for multiple web archive platforms.

Language: TypeScript - Size: 63.5 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 1 - Forks: 0

flairNLP/fundus

A very simple news crawler with a funny name

Language: Python - Size: 21.2 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 368 - Forks: 88

commoncrawl/cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

Language: Python - Size: 350 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 177 - Forks: 11

rix4uni/uforall

uforall is a fast url crawler this tool crawl all URLs number of different sources, alienvault,WayBackMachine,urlscan,commoncrawl

Language: Go - Size: 50.8 KB - Last synced at: 6 days ago - Pushed at: 4 months ago - Stars: 40 - Forks: 8

fhamborg/news-please

news-please - an integrated web crawler and information extractor for news that just works

Language: Python - Size: 2.99 MB - Last synced at: 12 days ago - Pushed at: 27 days ago - Stars: 2,197 - Forks: 431

cisnlp/GlotCC

🕸 GlotCC Dataset and Pipline -- NeurIPS 2024

Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 18 - Forks: 0

centic9/CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

Language: Java - Size: 1000 KB - Last synced at: 14 days ago - Pushed at: 15 days ago - Stars: 65 - Forks: 17

commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

Language: Jupyter Notebook - Size: 3.01 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 51 - Forks: 9

cloudtracer/paskto

Paskto - Passive Web Scanner

Language: JavaScript - Size: 53.2 MB - Last synced at: 17 days ago - Pushed at: over 6 years ago - Stars: 151 - Forks: 37

preciz/common_crawl

Work with Common Crawl data from Elixir.

Language: Elixir - Size: 187 KB - Last synced at: 11 days ago - Pushed at: 21 days ago - Stars: 2 - Forks: 0

michaelharms/comcrawl 📦

A python utility for downloading Common Crawl data

Language: Python - Size: 135 KB - Last synced at: 13 days ago - Pushed at: almost 2 years ago - Stars: 237 - Forks: 42

code402/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

Language: Shell - Size: 24.4 KB - Last synced at: 23 days ago - Pushed at: almost 4 years ago - Stars: 6 - Forks: 7

commoncrawl/cc-downloader

A polite and user-friendly downloader for Common Crawl data

Language: Rust - Size: 116 KB - Last synced at: 19 days ago - Pushed at: 29 days ago - Stars: 36 - Forks: 1

shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

Language: Python - Size: 586 KB - Last synced at: 17 days ago - Pushed at: almost 2 years ago - Stars: 122 - Forks: 14

karust/gogetcrawl

Extract web archive data using Wayback Machine and Common Crawl

Language: Go - Size: 58.6 KB - Last synced at: 15 days ago - Pushed at: 6 months ago - Stars: 155 - Forks: 17

cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Language: Python - Size: 209 KB - Last synced at: 14 days ago - Pushed at: 4 months ago - Stars: 169 - Forks: 31

oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

Language: Rust - Size: 4.72 MB - Last synced at: 17 days ago - Pushed at: over 1 year ago - Stars: 167 - Forks: 15

commoncrawl/nutch Fork of Aloisius/nutch

Common Crawl fork of Apache Nutch

Language: Java - Size: 132 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 32 - Forks: 2

commoncrawl/cc-index-table

Index Common Crawl archives in tabular format

Language: Java - Size: 192 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 113 - Forks: 9

thunderpoot/cc-getpage

Lightweight Python utility for retrieving individual pages from the Common Crawl archives.

Language: Python - Size: 1.72 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

toimik/CommonCrawl

Common Crawl's processing tools

Language: C# - Size: 89.8 KB - Last synced at: 2 days ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

ahcm/tantivy_warc_indexer

builds a tantivy index from common crawl warc.wet files

Language: Rust - Size: 26.4 KB - Last synced at: 22 days ago - Pushed at: 4 months ago - Stars: 11 - Forks: 1

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 155 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 416 - Forks: 88

openculinary/tardir

Time And Relative Dimensions In Recipes

Language: Python - Size: 14.6 KB - Last synced at: 12 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

commoncrawl/cc-warc-examples Fork of Smerity/cc-warc-examples

CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

Language: Java - Size: 30.3 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 38 - Forks: 19

CI-Research/KeywordAnalysis

Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

Size: 27.9 MB - Last synced at: 7 months ago - Pushed at: about 1 year ago - Stars: 57 - Forks: 13

commoncrawl/cc-webgraph

Tools to construct and process webgraphs from Common Crawl data

Language: Java - Size: 159 KB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 63 - Forks: 4

generals-space/site-mirror-go

来自[码云](https://gitee.com/generals-space/site-mirror-go) 通用爬虫, 仿站工具, 整站下载

Language: Go - Size: 400 KB - Last synced at: 10 months ago - Pushed at: almost 6 years ago - Stars: 25 - Forks: 3

commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

Language: Java - Size: 231 KB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 251 - Forks: 31

commoncrawl/cc-mrjob Fork of Smerity/cc-mrjob

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

Language: Python - Size: 1020 KB - Last synced at: 12 months ago - Pushed at: about 3 years ago - Stars: 166 - Forks: 65

isplab-unil/CommonCrawlSRI

Analysing SRI usage on CommonCrawl

Language: Python - Size: 33.2 MB - Last synced at: 6 months ago - Pushed at: almost 5 years ago - Stars: 1 - Forks: 0

generals-space/site-mirror-py

[码云](https://gitee.com/generals-space/site-mirror-py) 通用爬虫, 仿站工具, 整站下载

Language: Python - Size: 403 KB - Last synced at: 12 months ago - Pushed at: almost 6 years ago - Stars: 54 - Forks: 19

ngc7292/query_of_cc

This project is dataset and model checkpoints for the paper "Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora".

Size: 674 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

networkdynamics/seldonite

A News Article Collection Library

Language: Python - Size: 268 KB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 19 - Forks: 3

ArtificialOSS/WebCrawl

Crawls the web to generate a huge dataset for training

Language: Python - Size: 18.6 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

vladserkoff/common-crawler

Load htmls from Common Crawl

Language: Python - Size: 6.84 KB - Last synced at: over 1 year ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 0

astralway/webindex 📦

Apache Fluo application that creates a web index using Common Crawl data

Language: Java - Size: 646 KB - Last synced at: over 1 year ago - Pushed at: about 7 years ago - Stars: 4 - Forks: 3

uhussain/WebCrawlerForOnlineInflation

Price Crawler - Tracking Price Inflation

Language: Python - Size: 387 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 155 - Forks: 47

imfht/super-Django-CC

super-Django-CC is a simle web interface for commoncrawl.org

Language: Python - Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 12 - Forks: 4

lavafroth/gauf

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, Common Crawl, and URLScan

Language: Rust - Size: 88.9 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

ChrisCates/CommonCrawler 📦

🕸 A simple way to extract data from Common Crawl

Language: Go - Size: 2.42 MB - Last synced at: 10 months ago - Pushed at: about 5 years ago - Stars: 33 - Forks: 12

lxucs/commoncrawl-warc-retrieval

Python tools to retrieve text from CommonCrawl WARC files based on cdx index.

Language: Python - Size: 11.7 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 17 - Forks: 3

Krisalyd/aws-s3-file-downloader

Testing file download from AWS's S3 Bucket with Python.

Language: Python - Size: 1.95 KB - Last synced at: about 2 years ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 0

Tarasa24/PWA-Store

The largest collection of publicly accessible Progressive Web Apps*

Language: HTML - Size: 25.5 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

vrkansagara/common-crawler 📦

Common Crawler Index

Language: PHP - Size: 74.2 KB - Last synced at: 5 months ago - Pushed at: about 7 years ago - Stars: 3 - Forks: 1

jgonsior/dwtc-table-manual-classificator

A tool for manually classification of dwtc tables. The result is then being used as a training data set.

Language: Java - Size: 17.4 MB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 2 - Forks: 1

Damian89/commonCrawlParser

Simple multi threaded tool to extract domain related data from commoncrawl.org

Language: Python - Size: 13.7 KB - Last synced at: about 2 years ago - Pushed at: almost 7 years ago - Stars: 29 - Forks: 10

BhagyashriT/DICLAB2-DataAggregationBigDataAnalysisAndVisualization

Collected data about from three sources, one opinion-based social media in twitter, research data in New York Times, and the third is the common crawl data for the same topic or key phrase, and from similar time periods. Processed the three data sets collected individually using classical big data methods like Map Reduce in Google Dataproc Clusters. And then compared the outcomes using popular visualization methods in tableau.

Language: Python - Size: 39.8 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 2 - Forks: 0

fabianmurariu/OfflineESIndexGenerator Fork of andybab/OfflineESIndexGenerator

Offline Elasticsearch index generator

Language: Scala - Size: 15.9 MB - Last synced at: about 2 years ago - Pushed at: almost 6 years ago - Stars: 1 - Forks: 0

umanlp/webisadb-extractor

Relation Extractor for WebIsADb

Language: Java - Size: 14.6 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

sara-nl/spark-warcutils-example

Example of using warcutils with Apach Spark

Language: Scala - Size: 55.7 KB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 1