GitHub topics: common-crawl

Repositories

ashvardanian/StringZilla

Up to 10x faster strings for C, C++, Python, Rust, Swift & Go, leveraging NEON, AVX2, AVX-512, SVE, & SWAR to accelerate search, hashing, sort, edit distances, and memory ops 🦖

Language: C - Size: 8.69 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 2,530 - Forks: 88

commoncrawl/cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

Language: Python - Size: 350 MB - Last synced at: 11 days ago - Pushed at: 12 days ago - Stars: 177 - Forks: 11

cisnlp/GlotCC

🕸 GlotCC Dataset and Pipline -- NeurIPS 2024

Language: Jupyter Notebook - Size: 2.31 MB - Last synced at: 19 days ago - Pushed at: 19 days ago - Stars: 18 - Forks: 0

commoncrawl/cc-notebooks

Various Jupyter notebooks about Common Crawl data

Language: Jupyter Notebook - Size: 3.01 MB - Last synced at: 24 days ago - Pushed at: 24 days ago - Stars: 51 - Forks: 9

michaelharms/comcrawl 📦

A python utility for downloading Common Crawl data

Language: Python - Size: 135 KB - Last synced at: 17 days ago - Pushed at: almost 2 years ago - Stars: 237 - Forks: 42

oscar-project/oscar-website

The website of the Oscar Project

Language: TeX - Size: 32.1 MB - Last synced at: 29 days ago - Pushed at: 29 days ago - Stars: 11 - Forks: 14

alumik/common-crawl-downloader

Distributed download scripts for Common Crawl data

Language: Python - Size: 64.5 KB - Last synced at: 11 days ago - Pushed at: almost 4 years ago - Stars: 7 - Forks: 0

code402/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

Language: Shell - Size: 24.4 KB - Last synced at: 3 days ago - Pushed at: almost 4 years ago - Stars: 6 - Forks: 7

hrbrmstr/cc

⛏Extract metadata of a specific target based on the results of "commoncrawl.org"

Language: R - Size: 7.81 KB - Last synced at: 17 days ago - Pushed at: over 6 years ago - Stars: 5 - Forks: 0

oscar-project/ungoliant

:spider: The pipeline for the OSCAR corpus

Language: Rust - Size: 4.72 MB - Last synced at: 22 days ago - Pushed at: over 1 year ago - Stars: 167 - Forks: 15

thunderpoot/cc-getpage

Lightweight Python utility for retrieving individual pages from the Common Crawl archives.

Language: Python - Size: 1.72 MB - Last synced at: about 1 month ago - Pushed at: about 2 months ago - Stars: 2 - Forks: 0

toimik/CommonCrawl

Common Crawl's processing tools

Language: C# - Size: 89.8 KB - Last synced at: 7 days ago - Pushed at: 6 months ago - Stars: 11 - Forks: 0

tokenmill/common-crawl-utils

Various Common Crawl utilities in Clojure.

Language: Clojure - Size: 54.7 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 7 - Forks: 1

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 155 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 416 - Forks: 88

ilyankou/cc-gpx

CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl

Language: Jupyter Notebook - Size: 17.2 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 4 - Forks: 0

bminixhofer/gerpt2

German small and large versions of GPT2.

Language: Python - Size: 60.5 KB - Last synced at: 13 days ago - Pushed at: almost 3 years ago - Stars: 20 - Forks: 0

crissyfield/troll-a

Drill into WARC web archives

Language: Go - Size: 241 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 133 - Forks: 11

Mgosi/Big-Data-Analysis-using-MapReduce-in-Hadoop

We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.

Language: Jupyter Notebook - Size: 16.8 MB - Last synced at: 5 months ago - Pushed at: over 5 years ago - Stars: 6 - Forks: 3

IBM/cc-dbp 📦

A dataset for knowledge base population research using Common Crawl and DBpedia.

Language: Java - Size: 3.43 MB - Last synced at: 4 months ago - Pushed at: about 3 years ago - Stars: 28 - Forks: 19

Dahouabdelhalim/Discourse-marksers-and-Web-crawling

Discourse Markers identification in French Language

Language: HTML - Size: 70.9 MB - Last synced at: 10 months ago - Pushed at: about 3 years ago - Stars: 0 - Forks: 0

commoncrawl/cc-webgraph

Tools to construct and process webgraphs from Common Crawl data

Language: Java - Size: 159 KB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 63 - Forks: 4

commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

Language: Java - Size: 231 KB - Last synced at: 12 months ago - Pushed at: over 1 year ago - Stars: 251 - Forks: 31

oscar-project/goclassy 📦

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

Language: Go - Size: 377 KB - Last synced at: 11 months ago - Pushed at: about 4 years ago - Stars: 85 - Forks: 6

seanbethard/corpuswork

Corpuswork

Language: Jupyter Notebook - Size: 2.09 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 1 - Forks: 0

connor-marchand/gau-python

This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau

Language: Python - Size: 24.4 KB - Last synced at: about 1 year ago - Pushed at: almost 2 years ago - Stars: 2 - Forks: 0

HRN-Projects/common_crawl_with_scrapy

Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.

Language: Python - Size: 23.9 MB - Last synced at: over 1 year ago - Pushed at: almost 4 years ago - Stars: 4 - Forks: 5

fizerkhan/CommonCrawlDocumentDownload Fork of centic9/CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika

Language: Java - Size: 482 KB - Last synced at: about 11 hours ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 0

fizerkhan/KeywordAnalysis Fork of CI-Research/KeywordAnalysis

Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

Language: Python - Size: 27.9 MB - Last synced at: about 11 hours ago - Pushed at: over 7 years ago - Stars: 1 - Forks: 0

fizerkhan/cdx-index-client Fork of CI-Research/cdx-index-client

A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

Language: Python - Size: 30.3 KB - Last synced at: about 11 hours ago - Pushed at: over 8 years ago - Stars: 1 - Forks: 0

mwoss/mors

Application of topic models for information retrieval and search engine optimization.

Language: Python - Size: 39 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 2 - Forks: 0

siddheswarc/EDA-using-MapReduce

Exploratory Data Analysis using MapReduce with Hadoop is a project developed as partial fulfillment of the requirements for the Data Intensive Computing (CSE 587) course at the University at Buffalo

Language: Python - Size: 17.5 MB - Last synced at: about 2 years ago - Pushed at: over 5 years ago - Stars: 2 - Forks: 0

hadrianw/abracabra

Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.

Language: Rust - Size: 9.77 KB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

socket-var/nyt-twitter-cc-hadoop

Perform big data analysis on New york times, Twitter and Common Crawl APIs

Language: Jupyter Notebook - Size: 16.7 MB - Last synced at: 11 months ago - Pushed at: about 6 years ago - Stars: 2 - Forks: 0

bottomless-archive-project/url-collector

An application that crawls the Common Crawl corpus for URLs with the specified file extensions.

Language: Java - Size: 175 KB - Last synced at: about 2 years ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

samye760/Common-Crawl-Analysis

Parsing the common crawl database using Scala and Spark

Language: Scala - Size: 1.06 MB - Last synced at: almost 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

bottomless-archive-project/common-crawl-client

This library is a very lightweight client to Common Crawl's WARC files.

Language: Java - Size: 55.7 KB - Last synced at: about 2 years ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

ggodreau/huhdewp

Hadoop streaming EMR job

Language: Python - Size: 27.3 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 0

srmocher/fake-science

Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)

Language: Python - Size: 351 KB - Last synced at: over 1 year ago - Pushed at: almost 7 years ago - Stars: 0 - Forks: 0

ErikGartner/prometheus-cc-extractor

This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.

Language: Python - Size: 173 KB - Last synced at: 26 days ago - Pushed at: about 8 years ago - Stars: 0 - Forks: 0

Vzzarr/Common-Crawl-Client

Language: Java - Size: 33.4 MB - Last synced at: over 1 year ago - Pushed at: almost 8 years ago - Stars: 0 - Forks: 2

Vikasg7/warc-reader

ES6 Class to read .warc or .warc.gz file member by member in nodejs

Language: TypeScript - Size: 5.86 KB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0

Related Keywords

common-crawl 41 commoncrawl 11 warc 9 big-data 5 crawler 5 corpus-linguistics 4 nlp 4 spark 3 python 3 deep-learning 3 web-crawling 3 common-crawl-data 3 language-model 2 machine-learning 2 hadoop-mapreduce 2 webgraph-framework 2 search-engine 2 dataset 2 scrapy 2 wet 2 tableau 2 wat-files 2 twitter-api 2 common-crawl-python 2 warc-files 2 language-classification 2 fasttext 2 common-crawl-with-python 2 beautifulsoup 1 common-crawl-with-scrapy 1 data-mining 1 web-scraping 1 parse-common-crawl 1 python3 1 pagerank 1 webgraph 1 apache-storm 1 news 1 storm-crawler 1 web-crawler 1 author-identification 1 computational-linguistics 1 emotion-classification 1 leipzig-corpora-collection 1 linguistics 1 natural-language-processing 1 natural-language-understanding 1 openwebtext 1 postgresql 1 sentiment-analysis 1 speech-recognition 1 alienvault 1 gau-python 1 scraper 1 wayback-machine 1 common-crawl-scrapy 1 webarchive 1 url-crawler 1 emr 1 emr-cluster 1 s3 1 s3-bucket 1 scala 1 bigdata 1 hadoop 1 hadoop-streaming 1 fake-news 1 data-extraction 1 mapreduce 1 generator 1 next 1 nodejs 1 warc-headers 1 warc-reader 1 warc-record 1 yield 1 webarchive-data-scraping 1 django 1 doc2vec 1 gensim 1 hacktoberfest 1 lda 1 search 1 tfidf 1 data-aggregation 1 data-analysis 1 data-cleaning 1 data-clustering 1 data-engineering 1 restful-api 1 adblock 1 adblocking 1 rust 1 rust-lang 1 nyt-api 1 glot 1 glotcc 1 glotlid 1 language-identification 1 multilingual-dataset 1