An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: warc-files

commoncrawl/cc-pyspark

Process Common Crawl data with Python and Spark

Language: Python - Size: 157 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 430 - Forks: 89

toimik/WarcProtocol

Parser for WARC (aka WebArchive) files

Language: C# - Size: 181 KB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 12 - Forks: 3

datacoon/metawarc

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

Language: Python - Size: 84 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 30 - Forks: 1

N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

Language: JavaScript - Size: 7.99 MB - Last synced at: 13 days ago - Pushed at: 4 months ago - Stars: 98 - Forks: 22

commoncrawl/arc2warc-conversion

Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format

Size: 24.4 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

hrbrmstr/warc

:card_index: Tools to Work with the Web Archive Ecosystem in R

Language: R - Size: 2.52 MB - Last synced at: about 2 months ago - Pushed at: almost 8 years ago - Stars: 20 - Forks: 3

commoncrawl/ia-web-commons Fork of Aloisius/ia-web-commons

Web archiving utility library

Language: Java - Size: 8.21 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 11 - Forks: 6

toimik/CommonCrawl

Common Crawl's processing tools

Language: C# - Size: 89.8 KB - Last synced at: 16 days ago - Pushed at: 8 months ago - Stars: 11 - Forks: 0

javieraespinosa/lifranum

Discovering French Digital Literature (LIFRANUM ANR project)

Language: Jupyter Notebook - Size: 871 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

sebastian-nagel/warc-crawler

Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr

Language: FLUX - Size: 44.9 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 6 - Forks: 1

nouranHisham/wget_warc_files

This is part of my 2022 Summer Internship, it's mainly about web scraping.

Language: Jupyter Notebook - Size: 46.9 KB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

pierlauro/MDBubing

From WARC records to MongoDB documents

Language: Java - Size: 145 KB - Last synced at: 2 months ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0