Ecosyste.ms: Repos
An open API service providing repository metadata for many open source software ecosystems.
GitHub topics: warc
forensic-toolkit/warc-browser
a cli toolkit for working with web archives
Language: Go - Size: 469 KB - Last synced: about 13 hours ago - Pushed: 4 months ago - Stars: 2 - Forks: 0
AlexGustafsson/larch
A self-hosted service and toolset for managing, archiving, viewing and sharing bookmarks
Language: Go - Size: 256 KB - Last synced: about 13 hours ago - Pushed: almost 3 years ago - Stars: 4 - Forks: 0
harvard-lil/warc-gpt
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
Language: Python - Size: 1.7 MB - Last synced: 1 day ago - Pushed: 3 days ago - Stars: 41 - Forks: 7
toimik/WarcProtocol
Parser for WARC (aka WebArchive) files
Language: C# - Size: 180 KB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 8 - Forks: 3
webrecorder/browsertrix-crawler
Run a high-fidelity browser-based crawler in a single Docker container
Language: TypeScript - Size: 52.4 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 549 - Forks: 68
internetarchive/heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Language: Java - Size: 10.5 MB - Last synced: 2 days ago - Pushed: about 1 month ago - Stars: 2,698 - Forks: 756
Florents-Tselai/WarcDB
WarcDB: Web crawl data as SQLite databases.
Language: Python - Size: 51.7 MB - Last synced: 3 days ago - Pushed: 3 months ago - Stars: 384 - Forks: 11
CorentinB/warc
Read and write WARC files in Go
Language: Go - Size: 3.69 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 8 - Forks: 2
elbosso/warc2sitemap
This project is intended to turn a WARC file into a sitemap or into something (a graph description) one could build a sitemap from. The first release only offers to create a Graphviz file that can then be rendered - for example into SVG.
Language: Java - Size: 1010 KB - Last synced: 5 days ago - Pushed: 5 months ago - Stars: 1 - Forks: 0
openzim/warc2zim
Command line tool to convert a file in the WARC format to a file in the ZIM format
Language: Python - Size: 21.3 MB - Last synced: 5 days ago - Pushed: 5 days ago - Stars: 34 - Forks: 5
commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
Language: Java - Size: 231 KB - Last synced: 5 days ago - Pushed: 5 months ago - Stars: 251 - Forks: 31
chatnoir-eu/chatnoir-resiliparse
A robust web archive analytics toolkit
Language: Cython - Size: 1.87 MB - Last synced: 4 days ago - Pushed: 13 days ago - Stars: 43 - Forks: 8
ArchiveBox/ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Language: Python - Size: 7.73 MB - Last synced: 11 days ago - Pushed: 11 days ago - Stars: 19,808 - Forks: 1,077
helgeho/ArchiveSpark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Language: Scala - Size: 1.15 MB - Last synced: 4 days ago - Pushed: about 1 month ago - Stars: 141 - Forks: 19
natliblux/warc-safe
A tool for detecting viruses and NSFW material in WARC files
Language: Python - Size: 487 KB - Last synced: 8 days ago - Pushed: 9 days ago - Stars: 2 - Forks: 0
internetarchive/cdx-summary
Summarize web archive capture index (CDX) files.
Language: Python - Size: 227 KB - Last synced: 3 days ago - Pushed: almost 2 years ago - Stars: 47 - Forks: 7
webrecorder/webrecorder-player 📦
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Language: JavaScript - Size: 6 MB - Last synced: 4 days ago - Pushed: over 3 years ago - Stars: 423 - Forks: 39
webrecorder/browsertrix
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
Language: TypeScript - Size: 9.84 MB - Last synced: 11 days ago - Pushed: 11 days ago - Stars: 121 - Forks: 26
maxcountryman/warc-parquet
🗄️ A simple CLI for converting WARC to Parquet.
Language: Rust - Size: 124 KB - Last synced: 10 days ago - Pushed: 11 days ago - Stars: 99 - Forks: 0
webrecorder/replayweb.page
Serverless replay of web archives directly in the browser
Language: TypeScript - Size: 76.2 MB - Last synced: 9 days ago - Pushed: 10 days ago - Stars: 624 - Forks: 50
openzim/zimit-frontend
Zimit Public Web UI
Language: Vue - Size: 466 KB - Last synced: 9 days ago - Pushed: 10 days ago - Stars: 7 - Forks: 8
toimik/CommonCrawl
Common Crawl's processing tools
Language: C# - Size: 85.9 KB - Last synced: 10 days ago - Pushed: 10 days ago - Stars: 5 - Forks: 0
cooljeanius/wget-warc Fork of alard/wget-warc 📦
This is an old version of the WARC patches. Wget v1.14 and higher has WARC support.
Size: 4.31 MB - Last synced: 10 days ago - Pushed: over 3 years ago - Stars: 1 - Forks: 0
CGamesPlay/chronicler
Offline-first web browser
Language: JavaScript - Size: 243 KB - Last synced: 4 days ago - Pushed: over 5 years ago - Stars: 83 - Forks: 5
machawk1/wail
:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
Language: Roff - Size: 832 MB - Last synced: 4 days ago - Pushed: 6 months ago - Stars: 343 - Forks: 32
cocrawler/cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Language: Python - Size: 183 KB - Last synced: 7 days ago - Pushed: 3 months ago - Stars: 153 - Forks: 29
oduwsdl/off-topic-memento-toolkit
This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Language: Python - Size: 93.7 MB - Last synced: 12 days ago - Pushed: over 2 years ago - Stars: 8 - Forks: 4
webrecorder/warcio
Streaming WARC/ARC library for fast web archive IO
Language: Python - Size: 285 KB - Last synced: 4 days ago - Pushed: 13 days ago - Stars: 345 - Forks: 54
trenton-telge/WebArchiveX
A more compressed alternative to WARC web archival. Command line tool built in Kotlin.
Language: Kotlin - Size: 12.7 KB - Last synced: 13 days ago - Pushed: almost 6 years ago - Stars: 1 - Forks: 0
mikwielgus/forum-dl
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
Language: Python - Size: 391 KB - Last synced: 9 days ago - Pushed: 8 months ago - Stars: 60 - Forks: 1
oduwsdl/ipwb
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Language: Python - Size: 6.25 MB - Last synced: 6 days ago - Pushed: 18 days ago - Stars: 590 - Forks: 39
datacoon/metawarc
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Language: Python - Size: 74.2 KB - Last synced: 14 days ago - Pushed: 18 days ago - Stars: 23 - Forks: 0
jedireza/warc
:gear: A Rust library for reading and writing WARC files
Language: Rust - Size: 71.3 KB - Last synced: 3 days ago - Pushed: 4 months ago - Stars: 40 - Forks: 10
internetarchive/scrapy-warcio
Support for writing WARC files with Scrapy
Language: Python - Size: 31.3 KB - Last synced: 3 days ago - Pushed: over 4 years ago - Stars: 14 - Forks: 6
centic9/CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
Language: Java - Size: 975 KB - Last synced: 10 days ago - Pushed: 21 days ago - Stars: 58 - Forks: 20
nlnwa/warchaeology
Command line tool for digging into WARC files
Language: Go - Size: 3.65 MB - Last synced: 28 days ago - Pushed: 30 days ago - Stars: 21 - Forks: 3
N0taN3rd/node-warc
Parse And Create Web ARChive (WARC) files with node.js
Language: JavaScript - Size: 7.99 MB - Last synced: 27 days ago - Pushed: over 1 year ago - Stars: 91 - Forks: 23
govau/warcraider
Convert WARC files into Avro for big data processing
Language: HTML - Size: 181 KB - Last synced: 28 days ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0
govau/wofg-web-filters
Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service
Language: Groovy - Size: 54.7 KB - Last synced: 28 days ago - Pushed: about 5 years ago - Stars: 1 - Forks: 0
ArchiveTeam/grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Language: Python - Size: 1.24 MB - Last synced: 28 days ago - Pushed: about 2 months ago - Stars: 1,257 - Forks: 121
Rhizome-Conifer/conifer
Collect and revisit web pages.
Language: Python - Size: 25.5 MB - Last synced: 28 days ago - Pushed: 6 months ago - Stars: 1,459 - Forks: 117
webrecorder/cdxj-indexer
CDXJ Indexing of WARC/ARCs
Language: Python - Size: 82 KB - Last synced: 11 days ago - Pushed: almost 2 years ago - Stars: 21 - Forks: 9
pirate/internet-archiving-talk
🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.
Language: JavaScript - Size: 27.6 MB - Last synced: 10 days ago - Pushed: over 3 years ago - Stars: 47 - Forks: 5
ArchiveBox/DigestBox
DigestBox takes any webpage URL (news article, video link, comment thread, etc.) and gives you just the raw content. It's powered by ArchiveBox.io under the hood.
Language: HTML - Size: 1.75 MB - Last synced: 11 days ago - Pushed: 3 months ago - Stars: 11 - Forks: 0
ArchiveTeam/wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Language: C - Size: 28.8 MB - Last synced: 28 days ago - Pushed: 3 months ago - Stars: 80 - Forks: 14
orottier/rust-warc
A high performance and easy to use Web Archive (WARC) file reader
Language: Rust - Size: 10.7 KB - Last synced: 3 days ago - Pushed: almost 5 years ago - Stars: 9 - Forks: 3
machawk1/warcreate
Chrome extension to "Create WARC files from any webpage"
Language: JavaScript - Size: 2.23 MB - Last synced: 28 days ago - Pushed: 5 months ago - Stars: 192 - Forks: 12
cocrawler/cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Language: Python - Size: 911 KB - Last synced: 25 days ago - Pushed: about 2 years ago - Stars: 176 - Forks: 25
N0taN3rd/wail Fork of machawk1/wail
:whale2: One-Click User Instigated Preservation
Language: JavaScript - Size: 421 MB - Last synced: 28 days ago - Pushed: over 5 years ago - Stars: 119 - Forks: 9
crissyfield/troll-a
Drill into WARC web archives
Language: Go - Size: 199 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 84 - Forks: 9
ArchiveTeam/WebArchiver
Decentralized web archiving
Language: Python - Size: 323 KB - Last synced: 28 days ago - Pushed: almost 6 years ago - Stars: 19 - Forks: 3
wabarc/warcraft
A toolkit to help download webpage as warc file
Language: Go - Size: 44.9 KB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 1 - Forks: 0
jasonmtroos/ccwarcs
R package to provide access to Common Crawl WARC files via Amazon Web Services
Language: R - Size: 566 KB - Last synced: 5 months ago - Pushed: over 4 years ago - Stars: 1 - Forks: 0
Mixnode/mixnode-warcreader-php
Read Web ARChive (WARC) files in PHP.
Language: PHP - Size: 7.81 KB - Last synced: 17 days ago - Pushed: about 7 years ago - Stars: 21 - Forks: 3
ruarxive/awesome-digital-preservation
Awesome list dedicated to digital and data preservation tools, sources, services and so on.
Size: 7.81 KB - Last synced: about 15 hours ago - Pushed: over 1 year ago - Stars: 14 - Forks: 2
sepastian/warc2corpus
Extract structured data from HTML pages in WARCs through CSS selectors.
Language: HTML - Size: 5.25 MB - Last synced: 18 days ago - Pushed: about 1 year ago - Stars: 4 - Forks: 0
bitextor/bitextor
Bitextor generates translation memories from multilingual websites
Language: Python - Size: 177 MB - Last synced: 6 months ago - Pushed: 8 months ago - Stars: 265 - Forks: 45
PromyLOPh/crocoite 📦
Web archiving using Google Chrome
Language: Python - Size: 424 KB - Last synced: 4 days ago - Pushed: over 4 years ago - Stars: 42 - Forks: 7
antiufo/iabak-sharp
A C# implementation for the INTERNETARCHIVE.BAK project
Language: C# - Size: 138 KB - Last synced: 7 months ago - Pushed: over 1 year ago - Stars: 4 - Forks: 0
wsdookadr/femtocrawl
minimalistic crawler
Language: Python - Size: 29.3 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 2 - Forks: 0
archivesunleashed/warclight 📦
A Rails engine supporting the discovery of web archives.
Language: Ruby - Size: 13.3 MB - Last synced: 4 days ago - Pushed: 11 months ago - Stars: 48 - Forks: 10
shawnmjones/OffTopic-Detection Fork of yasmina85/OffTopic-Detection
This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Language: Python - Size: 712 MB - Last synced: 9 months ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0
marhop/vim-warc
Vim syntax highlighting for WARC files
Language: Vim script - Size: 2.93 KB - Last synced: 9 months ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0
datatogether/archive 📦
golang package for creating/working with warc & cdxj archives
Language: Go - Size: 20.5 KB - Last synced: 10 months ago - Pushed: about 6 years ago - Stars: 2 - Forks: 1
Mixnode/mixnode-warcreader-java
Read Web ARChive (WARC) files in Java.
Language: Java - Size: 17.6 KB - Last synced: 11 months ago - Pushed: about 7 years ago - Stars: 9 - Forks: 5
marinoandrea/wikidata-entity-linking
CLI to extract named entities from web pages and link them to potential entity candidates in the WikiData knowledge base.
Language: Python - Size: 3.5 MB - Last synced: 12 months ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0
code402/warc-benchmark
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Language: Shell - Size: 24.4 KB - Last synced: 3 months ago - Pushed: about 3 years ago - Stars: 4 - Forks: 8
ukwa/ukwa-manage
Shepherding our web archives from crawl to access.
Language: Jupyter Notebook - Size: 122 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 10 - Forks: 5
archivetheweb/arweave-warc-renderer
ANS-108 implementation for the Archive The Web. This allows gateways on arweave to render ATW's warc files
Language: JavaScript - Size: 698 KB - Last synced: about 1 year ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0
edgi-govdata-archiving/eis-WARC-archiver 📦
ARCHIVED--Docker app to crawl URLs and generate WARCs
Language: Python - Size: 28.1 MB - Last synced: 27 days ago - Pushed: about 7 years ago - Stars: 10 - Forks: 5
datatogether/warc 📦
Golang WARC (Web ARChive) Library
Language: Go - Size: 229 KB - Last synced: 10 months ago - Pushed: almost 5 years ago - Stars: 29 - Forks: 7
laxika/java-warc Fork of Mixnode/mixnode-warcreader-java 📦
Read Web ARChive (WARC) files in Java.
Language: Java - Size: 130 KB - Last synced: 10 months ago - Pushed: over 4 years ago - Stars: 3 - Forks: 1
sebastian-nagel/warc-crawler
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
Language: FLUX - Size: 44.9 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 6 - Forks: 1
bottomless-archive-project/java-warc Fork of laxika/java-warc
Read Web ARChive (WARC) files in Java.
Language: Java - Size: 185 KB - Last synced: 11 months ago - Pushed: over 2 years ago - Stars: 5 - Forks: 0
antiufo/Shaman.Dokan.Warc
Mounts WARC files on Windows
Language: C# - Size: 241 KB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 17 - Forks: 1
pisa-engine/warcpp
A C++ parser for the Web Archive (WARC) format.
Language: C++ - Size: 43.9 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0
hadrianw/abracabra
Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
Language: Rust - Size: 9.77 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0
helgeho/WarcPartitioner
Partition (W)ARC Files by MIME Type and Year
Language: Java - Size: 8.79 KB - Last synced: 4 days ago - Pushed: about 7 years ago - Stars: 1 - Forks: 1
hrbrmstr/jwatr
:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R
Language: R - Size: 38.6 MB - Last synced: 3 months ago - Pushed: over 6 years ago - Stars: 7 - Forks: 1
hrbrmstr/warc
:card_index: Tools to Work with the Web Archive Ecosystem in R
Language: R - Size: 2.52 MB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 21 - Forks: 3
helgeho/HadoopConcatGz
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Language: Java - Size: 51.8 KB - Last synced: 4 days ago - Pushed: over 6 years ago - Stars: 9 - Forks: 3
tokenmill/common-crawl-utils
Various Common Crawl utilities in Clojure.
Language: Clojure - Size: 54.7 KB - Last synced: 5 days ago - Pushed: 5 months ago - Stars: 6 - Forks: 1
antiufo/Shaman.Scraping
A C# library for reading/writing WARC files and scraping websites.
Language: C# - Size: 79.1 KB - Last synced: 3 months ago - Pushed: about 5 years ago - Stars: 7 - Forks: 3
austinfrey/pull-warc
pull-streaming WARC file operations
Language: JavaScript - Size: 19.5 KB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0
hadrianw/abracabra-legacy
A search engine, but currently a filtering pipeline for WARC files. Legacy repo, look for abracabra repo.
Language: Go - Size: 21.5 KB - Last synced: about 1 year ago - Pushed: almost 5 years ago - Stars: 0 - Forks: 0
pierlauro/MDBubing
From WARC records to MongoDB documents
Language: Java - Size: 145 KB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 1 - Forks: 0
miku/ttarc
Minimalistic TikTok trending archiver.
Language: HTML - Size: 5.27 MB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 2 - Forks: 0
ukwa/waybacks
This module builds our Waybacks in the various different configurations we require.
Language: Java - Size: 23.2 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 2 - Forks: 2
bottomless-archive-project/common-crawl-client
This library is a very lightweight client to Common Crawl's WARC files.
Language: Java - Size: 55.7 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0
hrbrmstr/jwatjars
Java '.jar' Files for 'jwatr'
Language: R - Size: 401 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 2 - Forks: 0
cldellow/gzip
A fork of java.util.zip.GZIPInputStream that emits the offsets of nested streams.
Language: Java - Size: 21.5 KB - Last synced: 11 months ago - Pushed: about 5 years ago - Stars: 1 - Forks: 0
bobpoekert/ocamlwarc
WARC parser for ocaml
Language: OCaml - Size: 5.72 MB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 0 - Forks: 0
info-labs/owlbot
WARC archive crawler
Language: Python - Size: 45.9 KB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 0 - Forks: 0
ggodreau/huhdewp
Hadoop streaming EMR job
Language: Python - Size: 27.3 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0
VAle512/WarcExtractor
A simple WARC extractor that extract HTML from WARC!
Language: Java - Size: 23.4 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 6 - Forks: 0
dlrobertson/warc-c
A WIP WARC parser in C
Language: C - Size: 85 KB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 2 - Forks: 0
sara-nl/spark-warcutils-example
Example of using warcutils with Apach Spark
Language: Scala - Size: 55.7 KB - Last synced: about 1 year ago - Pushed: almost 7 years ago - Stars: 0 - Forks: 1
Vikasg7/warc-reader
ES6 Class to read .warc or .warc.gz file member by member in nodejs
Language: TypeScript - Size: 5.86 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0
Vikasg7/warc-stream
Transform stream to read .warc or .warc.gz file member by member in nodejs
Language: TypeScript - Size: 5.86 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0