Ecosyste.ms: Repos
An open API service providing repository metadata for many open source software ecosystems.
GitHub topics: common-crawl
ilyankou/cc-gpx
CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl
Language: Jupyter Notebook - Size: 16.7 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 1 - Forks: 0
commoncrawl/cc-crawl-statistics
Statistics of Common Crawl monthly archives mined from URL index files
Language: Python - Size: 283 MB - Last synced: 11 days ago - Pushed: 11 days ago - Stars: 117 - Forks: 8
commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
Language: Python - Size: 127 KB - Last synced: 17 days ago - Pushed: about 2 months ago - Stars: 379 - Forks: 84
commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
Language: Java - Size: 231 KB - Last synced: 17 days ago - Pushed: 5 months ago - Stars: 251 - Forks: 31
oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
Language: Rust - Size: 4.72 MB - Last synced: 8 days ago - Pushed: 5 months ago - Stars: 150 - Forks: 14
commoncrawl/cc-webgraph
Tools to construct and process webgraphs from Common Crawl data
Language: Java - Size: 159 KB - Last synced: 17 days ago - Pushed: 22 days ago - Stars: 63 - Forks: 4
toimik/CommonCrawl
Common Crawl's processing tools
Language: C# - Size: 85.9 KB - Last synced: 22 days ago - Pushed: 23 days ago - Stars: 5 - Forks: 0
commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
Language: Jupyter Notebook - Size: 2.39 MB - Last synced: 17 days ago - Pushed: almost 2 years ago - Stars: 40 - Forks: 8
michaelharms/comcrawl
A python utility for downloading Common Crawl data
Language: Python - Size: 135 KB - Last synced: 19 days ago - Pushed: 12 months ago - Stars: 210 - Forks: 36
ashvardanian/StringZilla
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc π¦
Language: C++ - Size: 7.91 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 1,749 - Forks: 51
fizerkhan/KeywordAnalysis Fork of CI-Research/KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
Language: Python - Size: 27.9 MB - Last synced: about 1 month ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0
fizerkhan/CommonCrawlDocumentDownload Fork of centic9/CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
Language: Java - Size: 482 KB - Last synced: about 1 month ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0
fizerkhan/cdx-index-client Fork of CI-Research/cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
Language: Python - Size: 30.3 KB - Last synced: about 1 month ago - Pushed: over 7 years ago - Stars: 1 - Forks: 0
crissyfield/troll-a
Drill into WARC web archives
Language: Go - Size: 199 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 84 - Forks: 9
oscar-project/goclassy π¦
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Language: Go - Size: 377 KB - Last synced: about 1 month ago - Pushed: about 3 years ago - Stars: 84 - Forks: 6
oscar-project/oscar-website
The website of the Oscar Project
Language: TeX - Size: 32 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 9 - Forks: 13
seanbethard/corpuswork
Corpuswork
Language: Jupyter Notebook - Size: 2.09 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 1 - Forks: 0
connor-marchand/gau-python
This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau
Language: Python - Size: 24.4 KB - Last synced: 29 days ago - Pushed: 10 months ago - Stars: 2 - Forks: 0
code402/warc-benchmark
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Language: Shell - Size: 24.4 KB - Last synced: 3 months ago - Pushed: about 3 years ago - Stars: 4 - Forks: 8
HRN-Projects/common_crawl_with_scrapy
Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
Language: Python - Size: 23.9 MB - Last synced: 10 months ago - Pushed: almost 3 years ago - Stars: 4 - Forks: 5
IBM/cc-dbp
A dataset for knowledge base population research using Common Crawl and DBpedia.
Language: Java - Size: 3.43 MB - Last synced: 10 days ago - Pushed: over 2 years ago - Stars: 28 - Forks: 19
alumik/common-crawl-downloader
Distributed download scripts for Common Crawl data
Language: Python - Size: 64.5 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 2 - Forks: 0
mwoss/mors
Application of topic models for information retrieval and search engine optimization.
Language: Python - Size: 39 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 2 - Forks: 0
siddheswarc/EDA-using-MapReduce
Exploratory Data Analysis using MapReduce with Hadoop is a project developed as partial fulfillment of the requirements for the Data Intensive Computing (CSE 587) course at the University at Buffalo
Language: Python - Size: 17.5 MB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 2 - Forks: 0
bminixhofer/gerpt2
German small and large versions of GPT2.
Language: Python - Size: 60.5 KB - Last synced: about 1 year ago - Pushed: about 2 years ago - Stars: 17 - Forks: 0
hadrianw/abracabra
Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
Language: Rust - Size: 9.77 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0
Dahouabdelhalim/Discourse-marksers-and-Web-crawling
Discourse Markers identification in French Language
Language: HTML - Size: 70.9 MB - Last synced: 12 months ago - Pushed: about 2 years ago - Stars: 0 - Forks: 0
hrbrmstr/cc
βExtract metadata of a specific target based on the results of "commoncrawl.org"
Language: R - Size: 7.81 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 7 - Forks: 0
socket-var/nyt-twitter-cc-hadoop
Perform big data analysis on New york times, Twitter and Common Crawl APIs
Language: Jupyter Notebook - Size: 16.7 MB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 2 - Forks: 0
bottomless-archive-project/url-collector
An application that crawls the Common Crawl corpus for URLs with the specified file extensions.
Language: Java - Size: 175 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0
samye760/Common-Crawl-Analysis
Parsing the common crawl database using Scala and Spark
Language: Scala - Size: 1.06 MB - Last synced: 12 months ago - Pushed: about 3 years ago - Stars: 0 - Forks: 0
tokenmill/common-crawl-utils
Various Common Crawl utilities in Clojure.
Language: Clojure - Size: 54.7 KB - Last synced: 17 days ago - Pushed: 6 months ago - Stars: 6 - Forks: 1
bottomless-archive-project/common-crawl-client
This library is a very lightweight client to Common Crawl's WARC files.
Language: Java - Size: 55.7 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0
ggodreau/huhdewp
Hadoop streaming EMR job
Language: Python - Size: 27.3 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0
srmocher/fake-science
Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)
Language: Python - Size: 351 KB - Last synced: 9 months ago - Pushed: almost 6 years ago - Stars: 0 - Forks: 0
ErikGartner/prometheus-cc-extractor
This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.
Language: Python - Size: 173 KB - Last synced: about 1 month ago - Pushed: about 7 years ago - Stars: 0 - Forks: 0
Vzzarr/Common-Crawl-Client
Language: Java - Size: 33.4 MB - Last synced: 9 months ago - Pushed: almost 7 years ago - Stars: 0 - Forks: 2
Vikasg7/warc-reader
ES6 Class to read .warc or .warc.gz file member by member in nodejs
Language: TypeScript - Size: 5.86 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0