commoncrawl | Topic | Ecosyste.ms: Repos

Topic: "commoncrawl"

fhamborg/news-please

news-please - an integrated web crawler and information extractor for news that just works

Language: Python - Size: 2.99 MB - Last synced at: 16 days ago - Pushed at: about 2 months ago - Stars: 2,211 - Forks: 432

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 155 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 416 - Forks: 88

flairNLP/fundus

A very simple news crawler with a funny name

Language: Python - Size: 21 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 376 - Forks: 88

commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

Language: Java - Size: 231 KB - Last synced at: about 1 year ago - Pushed at: over 1 year ago - Stars: 251 - Forks: 31

michaelharms/comcrawl 📦

A python utility for downloading Common Crawl data

Language: Python - Size: 135 KB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 237 - Forks: 42

commoncrawl/cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

Language: Python - Size: 356 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 178 - Forks: 11

cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Language: Python - Size: 209 KB - Last synced at: about 1 month ago - Pushed at: 4 months ago - Stars: 169 - Forks: 31

oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

Language: Rust - Size: 4.72 MB - Last synced at: about 1 month ago - Pushed at: over 1 year ago - Stars: 167 - Forks: 15

commoncrawl/cc-mrjob Fork of Smerity/cc-mrjob

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

Language: Python - Size: 1020 KB - Last synced at: about 1 year ago - Pushed at: about 3 years ago - Stars: 166 - Forks: 65

karust/gogetcrawl

Extract web archive data using Wayback Machine and Common Crawl

Language: Go - Size: 58.6 KB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 155 - Forks: 17

uhussain/WebCrawlerForOnlineInflation

Price Crawler - Tracking Price Inflation

Language: Python - Size: 387 KB - Last synced at: over 1 year ago - Pushed at: almost 5 years ago - Stars: 155 - Forks: 47

cloudtracer/paskto

Paskto - Passive Web Scanner

Language: JavaScript - Size: 53.2 MB - Last synced at: 3 days ago - Pushed at: over 6 years ago - Stars: 151 - Forks: 37

shjwudp/c4-dataset-script

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

Language: Python - Size: 586 KB - Last synced at: about 1 month ago - Pushed at: almost 2 years ago - Stars: 122 - Forks: 14

commoncrawl/cc-index-table

Index Common Crawl archives in tabular format

Language: Java - Size: 192 KB - Last synced at: about 13 hours ago - Pushed at: about 13 hours ago - Stars: 119 - Forks: 10

commoncrawl/cc-webgraph

Tools to construct and process Common Crawl webgraphs

Language: Java - Size: 146 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 90 - Forks: 5

centic9/CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

Language: Java - Size: 1000 KB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 65 - Forks: 17

CI-Research/KeywordAnalysis

Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

Size: 27.9 MB - Last synced at: 8 months ago - Pushed at: over 1 year ago - Stars: 57 - Forks: 13

generals-space/site-mirror-py

[码云](https://gitee.com/generals-space/site-mirror-py) 通用爬虫, 仿站工具, 整站下载

Language: Python - Size: 403 KB - Last synced at: about 1 year ago - Pushed at: almost 6 years ago - Stars: 54 - Forks: 19

commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

Language: Jupyter Notebook - Size: 3.01 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 51 - Forks: 9

commoncrawl/cc-downloader

A polite and user-friendly downloader for Common Crawl data

Language: Rust - Size: 123 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 43 - Forks: 1

rix4uni/uforall

uforall is a fast url crawler this tool crawl all URLs number of different sources, alienvault,WayBackMachine,urlscan,commoncrawl

Language: Go - Size: 50.8 KB - Last synced at: 24 days ago - Pushed at: 5 months ago - Stars: 40 - Forks: 8

commoncrawl/cc-warc-examples Fork of Smerity/cc-warc-examples

CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

Language: Java - Size: 30.3 MB - Last synced at: 10 months ago - Pushed at: 10 months ago - Stars: 38 - Forks: 19

commoncrawl/nutch Fork of Aloisius/nutch

Common Crawl fork of Apache Nutch

Language: Java - Size: 132 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 33 - Forks: 2

ChrisCates/CommonCrawler 📦

🕸 A simple way to extract data from Common Crawl

Language: Go - Size: 2.42 MB - Last synced at: 11 months ago - Pushed at: about 5 years ago - Stars: 33 - Forks: 12

Damian89/commonCrawlParser

Simple multi threaded tool to extract domain related data from commoncrawl.org

Language: Python - Size: 13.7 KB - Last synced at: about 2 years ago - Pushed at: almost 7 years ago - Stars: 29 - Forks: 10

generals-space/site-mirror-go

来自[码云](https://gitee.com/generals-space/site-mirror-go) 通用爬虫, 仿站工具, 整站下载

Language: Go - Size: 400 KB - Last synced at: 11 months ago - Pushed at: almost 6 years ago - Stars: 25 - Forks: 3

networkdynamics/seldonite

A News Article Collection Library

Language: Python - Size: 268 KB - Last synced at: about 1 year ago - Pushed at: about 2 years ago - Stars: 19 - Forks: 3

cisnlp/GlotCC

🕸 GlotCC Dataset and Pipline -- NeurIPS 2024

Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 18 - Forks: 0

lxucs/commoncrawl-warc-retrieval

Python tools to retrieve text from CommonCrawl WARC files based on cdx index.

Language: Python - Size: 11.7 KB - Last synced at: about 2 years ago - Pushed at: about 3 years ago - Stars: 17 - Forks: 3

imfht/super-Django-CC

super-Django-CC is a simle web interface for commoncrawl.org

Language: Python - Size: 16.6 KB - Last synced at: over 1 year ago - Pushed at: over 2 years ago - Stars: 12 - Forks: 4

ahcm/tantivy_warc_indexer

builds a tantivy index from common crawl warc.wet files

Language: Rust - Size: 26.4 KB - Last synced at: 8 days ago - Pushed at: 5 months ago - Stars: 11 - Forks: 1

toimik/CommonCrawl

Common Crawl's processing tools

Language: C# - Size: 89.8 KB - Last synced at: 20 days ago - Pushed at: 7 months ago - Stars: 11 - Forks: 0

oritwoen/omnichron

Unified TypeScript interface for multiple web archive platforms.

Language: TypeScript - Size: 709 KB - Last synced at: 16 days ago - Pushed at: 16 days ago - Stars: 6 - Forks: 0

code402/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

Language: Shell - Size: 24.4 KB - Last synced at: 17 days ago - Pushed at: about 4 years ago - Stars: 6 - Forks: 7

astralway/webindex 📦

Apache Fluo application that creates a web index using Common Crawl data

Language: Java - Size: 646 KB - Last synced at: over 1 year ago - Pushed at: about 7 years ago - Stars: 4 - Forks: 3

ngc7292/query_of_cc

This project is dataset and model checkpoints for the paper "Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora".

Size: 674 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 3 - Forks: 0

ArtificialOSS/WebCrawl

Crawls the web to generate a huge dataset for training

Language: Python - Size: 18.6 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 3 - Forks: 0

Tarasa24/PWA-Store

The largest collection of publicly accessible Progressive Web Apps*

Language: HTML - Size: 25.5 MB - Last synced at: about 2 years ago - Pushed at: over 2 years ago - Stars: 3 - Forks: 0

vrkansagara/common-crawler 📦

Common Crawler Index

Language: PHP - Size: 74.2 KB - Last synced at: 5 months ago - Pushed at: about 7 years ago - Stars: 3 - Forks: 1

preciz/common_crawl

Work with Common Crawl data from Elixir.

Language: Elixir - Size: 188 KB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2 - Forks: 0

thunderpoot/cc-getpage

Lightweight Python utility for retrieving individual pages from the Common Crawl archives.

Language: Python - Size: 1.72 MB - Last synced at: about 2 months ago - Pushed at: 2 months ago - Stars: 2 - Forks: 0

jgonsior/dwtc-table-manual-classificator

A tool for manually classification of dwtc tables. The result is then being used as a training data set.

Language: Java - Size: 17.4 MB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 1

BhagyashriT/DICLAB2-DataAggregationBigDataAnalysisAndVisualization

Collected data about from three sources, one opinion-based social media in twitter, research data in New York Times, and the third is the common crawl data for the same topic or key phrase, and from similar time periods. Processed the three data sets collected individually using classical big data methods like Map Reduce in Google Dataproc Clusters. And then compared the outcomes using popular visualization methods in tableau.

Language: Python - Size: 39.8 MB - Last synced at: over 1 year ago - Pushed at: over 5 years ago - Stars: 2 - Forks: 0