Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: common-crawl

ilyankou/cc-gpx

CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl

Language: Jupyter Notebook - Size: 16.7 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 1 - Forks: 0

commoncrawl/cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

Language: Python - Size: 283 MB - Last synced: 11 days ago - Pushed: 11 days ago - Stars: 117 - Forks: 8

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 127 KB - Last synced: 17 days ago - Pushed: about 2 months ago - Stars: 379 - Forks: 84

commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

Language: Java - Size: 231 KB - Last synced: 17 days ago - Pushed: 5 months ago - Stars: 251 - Forks: 31

oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

Language: Rust - Size: 4.72 MB - Last synced: 8 days ago - Pushed: 5 months ago - Stars: 150 - Forks: 14

commoncrawl/cc-webgraph

Tools to construct and process webgraphs from Common Crawl data

Language: Java - Size: 159 KB - Last synced: 17 days ago - Pushed: 22 days ago - Stars: 63 - Forks: 4

toimik/CommonCrawl

Common Crawl's processing tools

Language: C# - Size: 85.9 KB - Last synced: 22 days ago - Pushed: 23 days ago - Stars: 5 - Forks: 0

commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

Language: Jupyter Notebook - Size: 2.39 MB - Last synced: 17 days ago - Pushed: almost 2 years ago - Stars: 40 - Forks: 8

michaelharms/comcrawl

A python utility for downloading Common Crawl data

Language: Python - Size: 135 KB - Last synced: 19 days ago - Pushed: 12 months ago - Stars: 210 - Forks: 36

ashvardanian/StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc πŸ¦–

Language: C++ - Size: 7.91 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 1,749 - Forks: 51

fizerkhan/KeywordAnalysis Fork of CI-Research/KeywordAnalysis

Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

Language: Python - Size: 27.9 MB - Last synced: about 1 month ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0

fizerkhan/CommonCrawlDocumentDownload Fork of centic9/CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika

Language: Java - Size: 482 KB - Last synced: about 1 month ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0

fizerkhan/cdx-index-client Fork of CI-Research/cdx-index-client

A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

Language: Python - Size: 30.3 KB - Last synced: about 1 month ago - Pushed: over 7 years ago - Stars: 1 - Forks: 0

crissyfield/troll-a

Drill into WARC web archives

Language: Go - Size: 199 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 84 - Forks: 9

oscar-project/goclassy πŸ“¦

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

Language: Go - Size: 377 KB - Last synced: about 1 month ago - Pushed: about 3 years ago - Stars: 84 - Forks: 6

oscar-project/oscar-website

The website of the Oscar Project

Language: TeX - Size: 32 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 9 - Forks: 13

seanbethard/corpuswork

Corpuswork

Language: Jupyter Notebook - Size: 2.09 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 1 - Forks: 0

connor-marchand/gau-python

This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau

Language: Python - Size: 24.4 KB - Last synced: 29 days ago - Pushed: 10 months ago - Stars: 2 - Forks: 0

code402/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

Language: Shell - Size: 24.4 KB - Last synced: 3 months ago - Pushed: about 3 years ago - Stars: 4 - Forks: 8

HRN-Projects/common_crawl_with_scrapy

Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.

Language: Python - Size: 23.9 MB - Last synced: 10 months ago - Pushed: almost 3 years ago - Stars: 4 - Forks: 5

IBM/cc-dbp

A dataset for knowledge base population research using Common Crawl and DBpedia.

Language: Java - Size: 3.43 MB - Last synced: 10 days ago - Pushed: over 2 years ago - Stars: 28 - Forks: 19

alumik/common-crawl-downloader

Distributed download scripts for Common Crawl data

Language: Python - Size: 64.5 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 2 - Forks: 0

mwoss/mors

Application of topic models for information retrieval and search engine optimization.

Language: Python - Size: 39 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 2 - Forks: 0

siddheswarc/EDA-using-MapReduce

Exploratory Data Analysis using MapReduce with Hadoop is a project developed as partial fulfillment of the requirements for the Data Intensive Computing (CSE 587) course at the University at Buffalo

Language: Python - Size: 17.5 MB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 2 - Forks: 0

bminixhofer/gerpt2

German small and large versions of GPT2.

Language: Python - Size: 60.5 KB - Last synced: about 1 year ago - Pushed: about 2 years ago - Stars: 17 - Forks: 0

hadrianw/abracabra

Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.

Language: Rust - Size: 9.77 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0

Dahouabdelhalim/Discourse-marksers-and-Web-crawling

Discourse Markers identification in French Language

Language: HTML - Size: 70.9 MB - Last synced: 12 months ago - Pushed: about 2 years ago - Stars: 0 - Forks: 0

hrbrmstr/cc

⛏Extract metadata of a specific target based on the results of "commoncrawl.org"

Language: R - Size: 7.81 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 7 - Forks: 0

socket-var/nyt-twitter-cc-hadoop

Perform big data analysis on New york times, Twitter and Common Crawl APIs

Language: Jupyter Notebook - Size: 16.7 MB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 2 - Forks: 0

bottomless-archive-project/url-collector

An application that crawls the Common Crawl corpus for URLs with the specified file extensions.

Language: Java - Size: 175 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0

samye760/Common-Crawl-Analysis

Parsing the common crawl database using Scala and Spark

Language: Scala - Size: 1.06 MB - Last synced: 12 months ago - Pushed: about 3 years ago - Stars: 0 - Forks: 0

tokenmill/common-crawl-utils

Various Common Crawl utilities in Clojure.

Language: Clojure - Size: 54.7 KB - Last synced: 17 days ago - Pushed: 6 months ago - Stars: 6 - Forks: 1

bottomless-archive-project/common-crawl-client

This library is a very lightweight client to Common Crawl's WARC files.

Language: Java - Size: 55.7 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0

ggodreau/huhdewp

Hadoop streaming EMR job

Language: Python - Size: 27.3 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

srmocher/fake-science

Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)

Language: Python - Size: 351 KB - Last synced: 9 months ago - Pushed: almost 6 years ago - Stars: 0 - Forks: 0

ErikGartner/prometheus-cc-extractor

This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.

Language: Python - Size: 173 KB - Last synced: about 1 month ago - Pushed: about 7 years ago - Stars: 0 - Forks: 0

Vzzarr/Common-Crawl-Client

Language: Java - Size: 33.4 MB - Last synced: 9 months ago - Pushed: almost 7 years ago - Stars: 0 - Forks: 2

Vikasg7/warc-reader

ES6 Class to read .warc or .warc.gz file member by member in nodejs

Language: TypeScript - Size: 5.86 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0