GitHub topics: warc-files
commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
Language: Python - Size: 157 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 430 - Forks: 89

toimik/WarcProtocol
Parser for WARC (aka WebArchive) files
Language: C# - Size: 181 KB - Last synced at: 4 days ago - Pushed at: 11 months ago - Stars: 12 - Forks: 3

datacoon/metawarc
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Language: Python - Size: 84 KB - Last synced at: 18 days ago - Pushed at: 18 days ago - Stars: 30 - Forks: 1

N0taN3rd/node-warc
Parse And Create Web ARChive (WARC) files with node.js
Language: JavaScript - Size: 7.99 MB - Last synced at: 13 days ago - Pushed at: 4 months ago - Stars: 98 - Forks: 22

commoncrawl/arc2warc-conversion
Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format
Size: 24.4 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

hrbrmstr/warc
:card_index: Tools to Work with the Web Archive Ecosystem in R
Language: R - Size: 2.52 MB - Last synced at: about 2 months ago - Pushed at: almost 8 years ago - Stars: 20 - Forks: 3

commoncrawl/ia-web-commons Fork of Aloisius/ia-web-commons
Web archiving utility library
Language: Java - Size: 8.21 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 11 - Forks: 6

toimik/CommonCrawl
Common Crawl's processing tools
Language: C# - Size: 89.8 KB - Last synced at: 16 days ago - Pushed at: 8 months ago - Stars: 11 - Forks: 0

javieraespinosa/lifranum
Discovering French Digital Literature (LIFRANUM ANR project)
Language: Jupyter Notebook - Size: 871 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

sebastian-nagel/warc-crawler
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
Language: FLUX - Size: 44.9 KB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 6 - Forks: 1

nouranHisham/wget_warc_files
This is part of my 2022 Summer Internship, it's mainly about web scraping.
Language: Jupyter Notebook - Size: 46.9 KB - Last synced at: about 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

pierlauro/MDBubing
From WARC records to MongoDB documents
Language: Java - Size: 145 KB - Last synced at: 2 months ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 0
