Ecosyste.ms: Repos
An open API service providing repository metadata for many open source software ecosystems.
GitHub topics: web-archiving
nla/pandas4
Web archive workflow system
Language: Java - Size: 2.07 MB - Last synced: about 12 hours ago - Pushed: 1 day ago - Stars: 4 - Forks: 2
programminghistorian/ph-submissions
The repository and website hosting the peer review process for new Programming Historian lessons
Language: Jupyter Notebook - Size: 964 MB - Last synced: about 20 hours ago - Pushed: about 22 hours ago - Stars: 135 - Forks: 109
harvard-lil/perma
Indelible links
Language: JavaScript - Size: 58.8 MB - Last synced: about 24 hours ago - Pushed: 1 day ago - Stars: 400 - Forks: 72
oduwsdl/MemGator
A Memento Aggregator CLI and Server in Go
Language: Go - Size: 15 MB - Last synced: 2 days ago - Pushed: 2 days ago - Stars: 53 - Forks: 11
webrecorder/browsertrix-crawler
Run a high-fidelity browser-based crawler in a single Docker container
Language: TypeScript - Size: 52.4 MB - Last synced: 2 days ago - Pushed: 2 days ago - Stars: 549 - Forks: 68
Florents-Tselai/WarcDB
WarcDB: Web crawl data as SQLite databases.
Language: Python - Size: 51.7 MB - Last synced: 2 days ago - Pushed: 3 months ago - Stars: 384 - Forks: 11
ArchiveBox/docs
Source for the Github Wiki / ReadTheDocs documentation for AchiveBox, the self-hosted internet archiving solution.
Language: CSS - Size: 6.94 MB - Last synced: 4 days ago - Pushed: 4 days ago - Stars: 11 - Forks: 3
akamhy/waybackpy
Wayback Machine API interface & a command-line tool
Language: Python - Size: 575 KB - Last synced: 4 days ago - Pushed: 3 months ago - Stars: 435 - Forks: 33
nla/bamboo
Web archive collection manager
Language: Java - Size: 2.71 MB - Last synced: 4 days ago - Pushed: 5 days ago - Stars: 8 - Forks: 4
webrecorder/pywb
Core Python Web Archiving Toolkit for replay and recording of web archives
Language: JavaScript - Size: 32.7 MB - Last synced: 10 days ago - Pushed: 15 days ago - Stars: 1,302 - Forks: 205
ArchiveBox/ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Language: Python - Size: 7.73 MB - Last synced: 10 days ago - Pushed: 10 days ago - Stars: 19,808 - Forks: 1,077
helgeho/ArchiveSpark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Language: Scala - Size: 1.15 MB - Last synced: 3 days ago - Pushed: about 1 month ago - Stars: 141 - Forks: 19
bellingcat/auto-archiver
Automatically archive links to videos, images, and social media content from Google Sheets (and more).
Language: Python - Size: 5.23 MB - Last synced: 4 days ago - Pushed: 21 days ago - Stars: 470 - Forks: 53
Own-Data-Privateer/pwebarc
A suite of tools for mirroring and hoarding web pages you visit for later offline viewing. I.e. your own personal Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data, which also follows "archive everything now, figure out what to do with it later" philosophy.
Language: Python - Size: 637 KB - Last synced: 7 days ago - Pushed: 7 days ago - Stars: 22 - Forks: 0
knot126/WebWar
Really hacky proof of concept http archival using mitmproxy
Language: Python - Size: 6.84 KB - Last synced: 7 days ago - Pushed: 7 days ago - Stars: 0 - Forks: 0
nla/outbackcdx
Web archive index server based on RocksDB
Language: Java - Size: 805 KB - Last synced: 3 days ago - Pushed: 26 days ago - Stars: 29 - Forks: 20
nla/nla-pywb
pywb config overlay for the Australian Web Archive
Language: HTML - Size: 34.2 KB - Last synced: 8 days ago - Pushed: 8 days ago - Stars: 2 - Forks: 0
webrecorder/webrecorder-player 📦
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Language: JavaScript - Size: 6 MB - Last synced: 4 days ago - Pushed: over 3 years ago - Stars: 423 - Forks: 39
webrecorder/browsertrix
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
Language: TypeScript - Size: 9.84 MB - Last synced: 10 days ago - Pushed: 11 days ago - Stars: 121 - Forks: 26
maxcountryman/warc-parquet
🗄️ A simple CLI for converting WARC to Parquet.
Language: Rust - Size: 124 KB - Last synced: 9 days ago - Pushed: 11 days ago - Stars: 99 - Forks: 0
webrecorder/replayweb.page
Serverless replay of web archives directly in the browser
Language: TypeScript - Size: 76.2 MB - Last synced: 8 days ago - Pushed: 9 days ago - Stars: 624 - Forks: 50
machawk1/wail
:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
Language: Roff - Size: 832 MB - Last synced: 4 days ago - Pushed: 6 months ago - Stars: 343 - Forks: 32
cocrawler/cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Language: Python - Size: 183 KB - Last synced: 6 days ago - Pushed: 3 months ago - Stars: 153 - Forks: 29
ArchiveBox/archivebox-proxy
Official ArchiveBox MITM proxy: saves URLs of all requests passing through to an ArchiveBox server for archival.
Language: Python - Size: 8.79 KB - Last synced: 10 days ago - Pushed: 4 months ago - Stars: 7 - Forks: 0
webrecorder/warcio
Streaming WARC/ARC library for fast web archive IO
Language: Python - Size: 285 KB - Last synced: 3 days ago - Pushed: 12 days ago - Stars: 345 - Forks: 54
oduwsdl/ipwb
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Language: Python - Size: 6.25 MB - Last synced: 6 days ago - Pushed: 17 days ago - Stars: 590 - Forks: 39
ArchiveBox/archivebox-browser-extension
Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.
Language: TypeScript - Size: 114 KB - Last synced: 10 days ago - Pushed: about 1 month ago - Stars: 159 - Forks: 13
webrecorder/archiveweb.page
A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
Language: JavaScript - Size: 52.6 MB - Last synced: 10 days ago - Pushed: about 1 month ago - Stars: 731 - Forks: 51
internetarchive/scrapy-warcio
Support for writing WARC files with Scrapy
Language: Python - Size: 31.3 KB - Last synced: 3 days ago - Pushed: over 4 years ago - Stars: 14 - Forks: 6
oduwsdl/archivenow
A Tool To Push Web Resources Into Web Archives
Language: Python - Size: 20.4 MB - Last synced: 4 days ago - Pushed: 4 months ago - Stars: 391 - Forks: 41
ArchiveBox/pip-archivebox
Official Python package for ArchiveBox, the self-hosted internet archiving solution.
Size: 15.4 MB - Last synced: 10 days ago - Pushed: 18 days ago - Stars: 13 - Forks: 2
gildas-lormeau/single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file
Language: JavaScript - Size: 2.94 MB - Last synced: 28 days ago - Pushed: 28 days ago - Stars: 468 - Forks: 49
sul-dlss-deprecated/openwayback Fork of iipc/openwayback 📦
(used on swap vm 6/2020) Stanford's fork of iipc/openwayback, which is used on our "swap" (Stanford Web Archiving Portal) machines. (See also sul-dlss/swap which is intended as a replacement)
Language: Java - Size: 29.1 MB - Last synced: 25 days ago - Pushed: over 3 years ago - Stars: 1 - Forks: 1
internetarchive/pdf_trio Fork of tralfamadude/pdf_trio
A PDF classifier ensemble with REST API service
Language: Python - Size: 15.5 MB - Last synced: 3 days ago - Pushed: about 3 years ago - Stars: 23 - Forks: 1
yuzhoumo/piazzabox
Piazza course archiver and viewer
Language: Python - Size: 2.46 MB - Last synced: 28 days ago - Pushed: 28 days ago - Stars: 0 - Forks: 0
N0taN3rd/node-warc
Parse And Create Web ARChive (WARC) files with node.js
Language: JavaScript - Size: 7.99 MB - Last synced: 26 days ago - Pushed: over 1 year ago - Stars: 91 - Forks: 23
Rhizome-Conifer/conifer
Collect and revisit web pages.
Language: Python - Size: 25.5 MB - Last synced: 27 days ago - Pushed: 6 months ago - Stars: 1,459 - Forks: 117
ArchiveBox/electron-archivebox
Desktop Electron app for ArchiveBox internet archiver. (ALPHA: not ready for general use)
Language: JavaScript - Size: 156 KB - Last synced: 10 days ago - Pushed: about 1 year ago - Stars: 173 - Forks: 15
webrecorder/cdxj-indexer
CDXJ Indexing of WARC/ARCs
Language: Python - Size: 82 KB - Last synced: 10 days ago - Pushed: almost 2 years ago - Stars: 21 - Forks: 9
pirate/internet-archiving-talk
🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.
Language: JavaScript - Size: 27.6 MB - Last synced: 9 days ago - Pushed: over 3 years ago - Stars: 47 - Forks: 5
internetarchive/fatcat
Perpetual Access To The Scholarly Record
Language: Python - Size: 8.4 MB - Last synced: 28 days ago - Pushed: 6 months ago - Stars: 109 - Forks: 19
nla/pywb Fork of webrecorder/pywb
Core Python Web Archiving Toolkit for replay and recording of web archives
Language: JavaScript - Size: 23.3 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 1 - Forks: 0
gwu-libraries/sfm-ui
Social Feed Manager user interface application.
Language: Python - Size: 44.6 MB - Last synced: 4 days ago - Pushed: 9 months ago - Stars: 150 - Forks: 27
internetarchive/sandcrawler
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
Language: HTML - Size: 2.55 MB - Last synced: 28 days ago - Pushed: over 1 year ago - Stars: 23 - Forks: 2
ArchiveBox/DigestBox
DigestBox takes any webpage URL (news article, video link, comment thread, etc.) and gives you just the raw content. It's powered by ArchiveBox.io under the hood.
Language: HTML - Size: 1.75 MB - Last synced: 10 days ago - Pushed: 3 months ago - Stars: 11 - Forks: 0
ArchiveBox/debian-archivebox
Home of the official apt/deb package for Ubuntu/Debian-based systems.
Language: Python - Size: 3.34 MB - Last synced: 10 days ago - Pushed: about 1 month ago - Stars: 17 - Forks: 5
nla/warcquet
Language: Java - Size: 44.9 KB - Last synced: about 2 months ago - Pushed: 11 months ago - Stars: 0 - Forks: 0
nla/pandora-labs
Australian web archive tools and experiments
Language: Python - Size: 8.79 KB - Last synced: about 2 months ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0
q-m/replayweb.page-docker
Docker image for ReplayWeb.page
Language: Dockerfile - Size: 2.93 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 2 - Forks: 0
meequrox/flb-archiver
Flareboard web archiver in C using libcurl
Language: C - Size: 107 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 0 - Forks: 0
webrecorder/dat-share
A prototype server to swarm multiple DATs for Webrecorder
Language: JavaScript - Size: 238 KB - Last synced: 10 days ago - Pushed: about 5 years ago - Stars: 12 - Forks: 4
ArchiveBox/homebrew-archivebox
Homebrew formula for the ArchiveBox self-hosted internet archiving solution.
Language: Ruby - Size: 61.8 MB - Last synced: 10 days ago - Pushed: 3 months ago - Stars: 24 - Forks: 3
machawk1/warcreate
Chrome extension to "Create WARC files from any webpage"
Language: JavaScript - Size: 2.23 MB - Last synced: 27 days ago - Pushed: 5 months ago - Stars: 192 - Forks: 12
rahiel/archiveror
Archiveror will help you preserve the webpages you love. 💾
Language: JavaScript - Size: 168 KB - Last synced: 2 months ago - Pushed: over 4 years ago - Stars: 384 - Forks: 43
TarekJor/bookmark-archiver Fork of ArchiveBox/ArchiveBox
🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...
Language: Python - Size: 2.65 MB - Last synced: 2 months ago - Pushed: over 5 years ago - Stars: 29 - Forks: 1
mrrfv/webArchive
Crawls websites and saves found URLs to a file.
Language: JavaScript - Size: 18.6 KB - Last synced: 17 days ago - Pushed: 3 months ago - Stars: 3 - Forks: 0
N0taN3rd/wail Fork of machawk1/wail
:whale2: One-Click User Instigated Preservation
Language: JavaScript - Size: 421 MB - Last synced: 27 days ago - Pushed: over 5 years ago - Stars: 119 - Forks: 9
nla/wombat Fork of webrecorder/wombat
Wombat.js client-side rewriting library
Language: JavaScript - Size: 1.87 MB - Last synced: about 2 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0
nla/httrack2warc
Converts HTTrack crawls to WARC files
Language: Java - Size: 155 KB - Last synced: 3 days ago - Pushed: 22 days ago - Stars: 27 - Forks: 6
xarantolus/Collect
A server to collect & archive websites that also supports video downloads
Language: TypeScript - Size: 2.07 MB - Last synced: 3 months ago - Pushed: about 1 year ago - Stars: 75 - Forks: 10
wdhdev/web-archiver 📦
Easily scrape, download and preview websites.
Language: EJS - Size: 664 KB - Last synced: about 9 hours ago - Pushed: 4 months ago - Stars: 1 - Forks: 0
zytedata/web-snap
Create "perfect" snapshots of web pages
Language: JavaScript - Size: 501 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 24 - Forks: 2
httpreserve/linkstat
CLI implementation of httpreserve that can test links and retrieve internet archive replacements
Language: Go - Size: 35.2 KB - Last synced: 3 days ago - Pushed: 9 months ago - Stars: 7 - Forks: 0
ArchivingToolsForWBM/AdvancedInternetArchiving
Makes saving pages in bulk to the wayback machine much easier
Language: HTML - Size: 396 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 2 - Forks: 1
Rhizome-Conifer/conifer-deploy
Conifer setup and deployment via Ansible
Language: Shell - Size: 22.5 KB - Last synced: 7 months ago - Pushed: almost 4 years ago - Stars: 13 - Forks: 6
rybesh/capture-urls
Archive a list of URLs using the Wayback Machine
Language: Python - Size: 31.3 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 5 - Forks: 0
LouayMagdy/webarchive-commons-py
Python Implementation for iipc/webarchive-commons
Language: Python - Size: 300 KB - Last synced: about 2 months ago - Pushed: 8 months ago - Stars: 0 - Forks: 0
pebnn/AutoInternetArchive
AutoInternetArchive is a very simple program designed to automatically archive webpages to The wayback machine with hourly intervals. AutoInternetArchive was designed to be run though a console window and left open for days or even months
Language: Python - Size: 22.5 KB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 2 - Forks: 0
nla/chronicrawl 📦
Experimental continouous web crawler for web archiving
Language: Java - Size: 329 KB - Last synced: about 2 months ago - Pushed: over 1 year ago - Stars: 9 - Forks: 0
oduwsdl/oduwsdl.github.io
ODU Web Science and Digital Libraries Research Group (WS-DL) home page.
Language: HTML - Size: 47.1 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 2 - Forks: 36
shawnmjones/OffTopic-Detection Fork of yasmina85/OffTopic-Detection
This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Language: Python - Size: 712 MB - Last synced: 9 months ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0
mkrzmr/mkrzmr.github.io
Michael Kurzmeier, 4th year Phd Digital Humanities @Maynooth University
Size: 1.39 MB - Last synced: 9 months ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0
nla/jwebrenderer
Simple web service to render pages with headless chrome
Language: Java - Size: 17.6 KB - Last synced: about 2 months ago - Pushed: 11 months ago - Stars: 1 - Forks: 0
nla/chropro 📦
Chrome debugging protocol client for Java
Language: Java - Size: 115 KB - Last synced: about 2 months ago - Pushed: about 4 years ago - Stars: 10 - Forks: 2
dbeley/archiveboxmatic
ArchiveBoxMatic: configure ArchiveBox with the simplicity of a yaml file.
Language: Python - Size: 57.6 KB - Last synced: 2 months ago - Pushed: about 3 years ago - Stars: 14 - Forks: 3
webis-de/scriptor
Plug-and-play reproducible web analysis.
Language: JavaScript - Size: 1.59 MB - Last synced: 21 days ago - Pushed: 3 months ago - Stars: 6 - Forks: 1
nla/outbackproxy
HTTP/S proxy server which replays content from a web archive
Language: Java - Size: 26.4 KB - Last synced: about 2 months ago - Pushed: about 1 year ago - Stars: 3 - Forks: 0
nla/heritrixctl 📦
Heritrix runner and API client for Java
Language: Java - Size: 27.3 KB - Last synced: about 2 months ago - Pushed: over 4 years ago - Stars: 1 - Forks: 0
nla/heritrix3 Fork of internetarchive/heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Language: Java - Size: 10.3 MB - Last synced: about 2 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0
nla/butterflynet 📦
Streamline single-document web archiving tool
Language: Java - Size: 163 KB - Last synced: about 2 months ago - Pushed: about 1 year ago - Stars: 1 - Forks: 0
oduwsdl/warrick
Recover lost websites from the Web Infrastructure
Language: HTML - Size: 2.66 MB - Last synced: about 1 year ago - Pushed: about 3 years ago - Stars: 78 - Forks: 10
ukwa/ukwa-manage
Shepherding our web archives from crawl to access.
Language: Jupyter Notebook - Size: 122 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 10 - Forks: 5
caltechlibrary/eprints2archives
Send records from an EPrints server to the Internet Archive and other web archives
Language: Python - Size: 504 KB - Last synced: about 1 month ago - Pushed: 12 months ago - Stars: 3 - Forks: 0
httpreserve/conventoarchiver
Repository for collecting scripts to help capture MyConvento newsroom press-releases from the MyConvento PR management suite. The README provides an analysis of the MyConvento URL architecture for users hoping to develop a solution for themselves.
Language: Python - Size: 23.4 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0
ukwa/ukwa-ui
A new user interface for the UK Web Archive
Language: Java - Size: 170 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 6
helgeho/WarcPartitioner
Partition (W)ARC Files by MIME Type and Year
Language: Java - Size: 8.79 KB - Last synced: 3 days ago - Pushed: about 7 years ago - Stars: 1 - Forks: 1
TarekJor/instaloader Fork of instaloader/instaloader
Download pictures (or videos) along with their captions and other metadata from Instagram.
Language: Python - Size: 662 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0
mhucka/devonagent-hacks
Scripts and other things for working with DEVONagent.
Language: AppleScript - Size: 16.6 KB - Last synced: 13 days ago - Pushed: almost 4 years ago - Stars: 1 - Forks: 0
httpreserve/mementoqa
QA Mementos using Screenshots
Language: HTML - Size: 410 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0
httpreserve/wadl-2017
Resources for WADL 2017
Size: 4.84 MB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0
httpreserve/million-dollar-webpage Fork of ross-spencer/million-dollar-webpage
HTTPreserve Analysis of Million Dollar Web Page
Size: 299 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0
helgeho/HadoopConcatGz
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Language: Java - Size: 51.8 KB - Last synced: 3 days ago - Pushed: over 6 years ago - Stars: 9 - Forks: 3
austinfrey/pull-warc
pull-streaming WARC file operations
Language: JavaScript - Size: 19.5 KB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0
ukwa/waybacks
This module builds our Waybacks in the various different configurations we require.
Language: Java - Size: 23.2 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 2 - Forks: 2
ngeraci/ucr-covid-bing-search
Quick script using Bing Web Search API to retrieve list of URLs for web archiving
Language: Python - Size: 8.79 KB - Last synced: about 1 year ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0
ross-spencer/million-dollar-webpage
HTTPreserve Analysis of Million Dollar Web Page
Size: 299 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 2
TarekJor/wpull Fork of ArchiveTeam/wpull
Wget-compatible web downloader and crawler.
Language: HTML - Size: 3.87 MB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0
TarekJor/DiscordMediaLoader Fork of Serraniel/DiscordMediaLoader
Discord Media Loader - Simply download all attachments
Language: C# - Size: 842 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0
TarekJor/PixivUtil2 Fork of Nandaka/PixivUtil2
Download images from Pixiv and more!
Language: Python - Size: 11.4 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 0 - Forks: 0
TarekJor/TumblThree Fork of johanneszab/TumblThree
A Tumblr Blog Backup Application
Language: C# - Size: 3 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 0 - Forks: 0