Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: web-archiving

nla/pandas4

Web archive workflow system

Language: Java - Size: 2.07 MB - Last synced: about 12 hours ago - Pushed: 1 day ago - Stars: 4 - Forks: 2

programminghistorian/ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons

Language: Jupyter Notebook - Size: 964 MB - Last synced: about 20 hours ago - Pushed: about 22 hours ago - Stars: 135 - Forks: 109

harvard-lil/perma

Indelible links

Language: JavaScript - Size: 58.8 MB - Last synced: about 24 hours ago - Pushed: 1 day ago - Stars: 400 - Forks: 72

oduwsdl/MemGator

A Memento Aggregator CLI and Server in Go

Language: Go - Size: 15 MB - Last synced: 2 days ago - Pushed: 2 days ago - Stars: 53 - Forks: 11

webrecorder/browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

Language: TypeScript - Size: 52.4 MB - Last synced: 2 days ago - Pushed: 2 days ago - Stars: 549 - Forks: 68

Florents-Tselai/WarcDB

WarcDB: Web crawl data as SQLite databases.

Language: Python - Size: 51.7 MB - Last synced: 2 days ago - Pushed: 3 months ago - Stars: 384 - Forks: 11

ArchiveBox/docs

Source for the Github Wiki / ReadTheDocs documentation for AchiveBox, the self-hosted internet archiving solution.

Language: CSS - Size: 6.94 MB - Last synced: 4 days ago - Pushed: 4 days ago - Stars: 11 - Forks: 3

akamhy/waybackpy

Wayback Machine API interface & a command-line tool

Language: Python - Size: 575 KB - Last synced: 4 days ago - Pushed: 3 months ago - Stars: 435 - Forks: 33

nla/bamboo

Web archive collection manager

Language: Java - Size: 2.71 MB - Last synced: 4 days ago - Pushed: 5 days ago - Stars: 8 - Forks: 4

webrecorder/pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

Language: JavaScript - Size: 32.7 MB - Last synced: 10 days ago - Pushed: 15 days ago - Stars: 1,302 - Forks: 205

ArchiveBox/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Language: Python - Size: 7.73 MB - Last synced: 10 days ago - Pushed: 10 days ago - Stars: 19,808 - Forks: 1,077

helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Language: Scala - Size: 1.15 MB - Last synced: 3 days ago - Pushed: about 1 month ago - Stars: 141 - Forks: 19

bellingcat/auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

Language: Python - Size: 5.23 MB - Last synced: 4 days ago - Pushed: 21 days ago - Stars: 470 - Forks: 53

Own-Data-Privateer/pwebarc

A suite of tools for mirroring and hoarding web pages you visit for later offline viewing. I.e. your own personal Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data, which also follows "archive everything now, figure out what to do with it later" philosophy.

Language: Python - Size: 637 KB - Last synced: 7 days ago - Pushed: 7 days ago - Stars: 22 - Forks: 0

knot126/WebWar

Really hacky proof of concept http archival using mitmproxy

Language: Python - Size: 6.84 KB - Last synced: 7 days ago - Pushed: 7 days ago - Stars: 0 - Forks: 0

nla/outbackcdx

Web archive index server based on RocksDB

Language: Java - Size: 805 KB - Last synced: 3 days ago - Pushed: 26 days ago - Stars: 29 - Forks: 20

nla/nla-pywb

pywb config overlay for the Australian Web Archive

Language: HTML - Size: 34.2 KB - Last synced: 8 days ago - Pushed: 8 days ago - Stars: 2 - Forks: 0

webrecorder/webrecorder-player 📦

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

Language: JavaScript - Size: 6 MB - Last synced: 4 days ago - Pushed: over 3 years ago - Stars: 423 - Forks: 39

webrecorder/browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

Language: TypeScript - Size: 9.84 MB - Last synced: 10 days ago - Pushed: 11 days ago - Stars: 121 - Forks: 26

maxcountryman/warc-parquet

🗄️ A simple CLI for converting WARC to Parquet.

Language: Rust - Size: 124 KB - Last synced: 9 days ago - Pushed: 11 days ago - Stars: 99 - Forks: 0

webrecorder/replayweb.page

Serverless replay of web archives directly in the browser

Language: TypeScript - Size: 76.2 MB - Last synced: 8 days ago - Pushed: 9 days ago - Stars: 624 - Forks: 50

machawk1/wail

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation

Language: Roff - Size: 832 MB - Last synced: 4 days ago - Pushed: 6 months ago - Stars: 343 - Forks: 32

cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Language: Python - Size: 183 KB - Last synced: 6 days ago - Pushed: 3 months ago - Stars: 153 - Forks: 29

ArchiveBox/archivebox-proxy

Official ArchiveBox MITM proxy: saves URLs of all requests passing through to an ArchiveBox server for archival.

Language: Python - Size: 8.79 KB - Last synced: 10 days ago - Pushed: 4 months ago - Stars: 7 - Forks: 0

webrecorder/warcio

Streaming WARC/ARC library for fast web archive IO

Language: Python - Size: 285 KB - Last synced: 3 days ago - Pushed: 12 days ago - Stars: 345 - Forks: 54

oduwsdl/ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

Language: Python - Size: 6.25 MB - Last synced: 6 days ago - Pushed: 17 days ago - Stars: 590 - Forks: 39

ArchiveBox/archivebox-browser-extension

Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.

Language: TypeScript - Size: 114 KB - Last synced: 10 days ago - Pushed: about 1 month ago - Stars: 159 - Forks: 13

webrecorder/archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

Language: JavaScript - Size: 52.6 MB - Last synced: 10 days ago - Pushed: about 1 month ago - Stars: 731 - Forks: 51

internetarchive/scrapy-warcio

Support for writing WARC files with Scrapy

Language: Python - Size: 31.3 KB - Last synced: 3 days ago - Pushed: over 4 years ago - Stars: 14 - Forks: 6

oduwsdl/archivenow

A Tool To Push Web Resources Into Web Archives

Language: Python - Size: 20.4 MB - Last synced: 4 days ago - Pushed: 4 months ago - Stars: 391 - Forks: 41

ArchiveBox/pip-archivebox

Official Python package for ArchiveBox, the self-hosted internet archiving solution.

Size: 15.4 MB - Last synced: 10 days ago - Pushed: 18 days ago - Stars: 13 - Forks: 2

gildas-lormeau/single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file

Language: JavaScript - Size: 2.94 MB - Last synced: 28 days ago - Pushed: 28 days ago - Stars: 468 - Forks: 49

sul-dlss-deprecated/openwayback Fork of iipc/openwayback 📦

(used on swap vm 6/2020) Stanford's fork of iipc/openwayback, which is used on our "swap" (Stanford Web Archiving Portal) machines. (See also sul-dlss/swap which is intended as a replacement)

Language: Java - Size: 29.1 MB - Last synced: 25 days ago - Pushed: over 3 years ago - Stars: 1 - Forks: 1

internetarchive/pdf_trio Fork of tralfamadude/pdf_trio

A PDF classifier ensemble with REST API service

Language: Python - Size: 15.5 MB - Last synced: 3 days ago - Pushed: about 3 years ago - Stars: 23 - Forks: 1

yuzhoumo/piazzabox

Piazza course archiver and viewer

Language: Python - Size: 2.46 MB - Last synced: 28 days ago - Pushed: 28 days ago - Stars: 0 - Forks: 0

N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

Language: JavaScript - Size: 7.99 MB - Last synced: 26 days ago - Pushed: over 1 year ago - Stars: 91 - Forks: 23

Rhizome-Conifer/conifer

Collect and revisit web pages.

Language: Python - Size: 25.5 MB - Last synced: 27 days ago - Pushed: 6 months ago - Stars: 1,459 - Forks: 117

ArchiveBox/electron-archivebox

Desktop Electron app for ArchiveBox internet archiver. (ALPHA: not ready for general use)

Language: JavaScript - Size: 156 KB - Last synced: 10 days ago - Pushed: about 1 year ago - Stars: 173 - Forks: 15

webrecorder/cdxj-indexer

CDXJ Indexing of WARC/ARCs

Language: Python - Size: 82 KB - Last synced: 10 days ago - Pushed: almost 2 years ago - Stars: 21 - Forks: 9

pirate/internet-archiving-talk

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

Language: JavaScript - Size: 27.6 MB - Last synced: 9 days ago - Pushed: over 3 years ago - Stars: 47 - Forks: 5

internetarchive/fatcat

Perpetual Access To The Scholarly Record

Language: Python - Size: 8.4 MB - Last synced: 28 days ago - Pushed: 6 months ago - Stars: 109 - Forks: 19

nla/pywb Fork of webrecorder/pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

Language: JavaScript - Size: 23.3 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 1 - Forks: 0

gwu-libraries/sfm-ui

Social Feed Manager user interface application.

Language: Python - Size: 44.6 MB - Last synced: 4 days ago - Pushed: 9 months ago - Stars: 150 - Forks: 27

internetarchive/sandcrawler

Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki

Language: HTML - Size: 2.55 MB - Last synced: 28 days ago - Pushed: over 1 year ago - Stars: 23 - Forks: 2

ArchiveBox/DigestBox

DigestBox takes any webpage URL (news article, video link, comment thread, etc.) and gives you just the raw content. It's powered by ArchiveBox.io under the hood.

Language: HTML - Size: 1.75 MB - Last synced: 10 days ago - Pushed: 3 months ago - Stars: 11 - Forks: 0

ArchiveBox/debian-archivebox

Home of the official apt/deb package for Ubuntu/Debian-based systems.

Language: Python - Size: 3.34 MB - Last synced: 10 days ago - Pushed: about 1 month ago - Stars: 17 - Forks: 5

nla/warcquet

Language: Java - Size: 44.9 KB - Last synced: about 2 months ago - Pushed: 11 months ago - Stars: 0 - Forks: 0

nla/pandora-labs

Australian web archive tools and experiments

Language: Python - Size: 8.79 KB - Last synced: about 2 months ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0

q-m/replayweb.page-docker

Docker image for ReplayWeb.page

Language: Dockerfile - Size: 2.93 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 2 - Forks: 0

meequrox/flb-archiver

Flareboard web archiver in C using libcurl

Language: C - Size: 107 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 0 - Forks: 0

webrecorder/dat-share

A prototype server to swarm multiple DATs for Webrecorder

Language: JavaScript - Size: 238 KB - Last synced: 10 days ago - Pushed: about 5 years ago - Stars: 12 - Forks: 4

ArchiveBox/homebrew-archivebox

Homebrew formula for the ArchiveBox self-hosted internet archiving solution.

Language: Ruby - Size: 61.8 MB - Last synced: 10 days ago - Pushed: 3 months ago - Stars: 24 - Forks: 3

machawk1/warcreate

Chrome extension to "Create WARC files from any webpage"

Language: JavaScript - Size: 2.23 MB - Last synced: 27 days ago - Pushed: 5 months ago - Stars: 192 - Forks: 12

rahiel/archiveror

Archiveror will help you preserve the webpages you love. 💾

Language: JavaScript - Size: 168 KB - Last synced: 2 months ago - Pushed: over 4 years ago - Stars: 384 - Forks: 43

TarekJor/bookmark-archiver Fork of ArchiveBox/ArchiveBox

🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...

Language: Python - Size: 2.65 MB - Last synced: 2 months ago - Pushed: over 5 years ago - Stars: 29 - Forks: 1

mrrfv/webArchive

Crawls websites and saves found URLs to a file.

Language: JavaScript - Size: 18.6 KB - Last synced: 17 days ago - Pushed: 3 months ago - Stars: 3 - Forks: 0

N0taN3rd/wail Fork of machawk1/wail

:whale2: One-Click User Instigated Preservation

Language: JavaScript - Size: 421 MB - Last synced: 27 days ago - Pushed: over 5 years ago - Stars: 119 - Forks: 9

nla/wombat Fork of webrecorder/wombat

Wombat.js client-side rewriting library

Language: JavaScript - Size: 1.87 MB - Last synced: about 2 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0

nla/httrack2warc

Converts HTTrack crawls to WARC files

Language: Java - Size: 155 KB - Last synced: 3 days ago - Pushed: 22 days ago - Stars: 27 - Forks: 6

xarantolus/Collect

A server to collect & archive websites that also supports video downloads

Language: TypeScript - Size: 2.07 MB - Last synced: 3 months ago - Pushed: about 1 year ago - Stars: 75 - Forks: 10

wdhdev/web-archiver 📦

Easily scrape, download and preview websites.

Language: EJS - Size: 664 KB - Last synced: about 9 hours ago - Pushed: 4 months ago - Stars: 1 - Forks: 0

zytedata/web-snap

Create "perfect" snapshots of web pages

Language: JavaScript - Size: 501 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 24 - Forks: 2

httpreserve/linkstat

CLI implementation of httpreserve that can test links and retrieve internet archive replacements

Language: Go - Size: 35.2 KB - Last synced: 3 days ago - Pushed: 9 months ago - Stars: 7 - Forks: 0

ArchivingToolsForWBM/AdvancedInternetArchiving

Makes saving pages in bulk to the wayback machine much easier

Language: HTML - Size: 396 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 2 - Forks: 1

Rhizome-Conifer/conifer-deploy

Conifer setup and deployment via Ansible

Language: Shell - Size: 22.5 KB - Last synced: 7 months ago - Pushed: almost 4 years ago - Stars: 13 - Forks: 6

rybesh/capture-urls

Archive a list of URLs using the Wayback Machine

Language: Python - Size: 31.3 KB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 5 - Forks: 0

LouayMagdy/webarchive-commons-py

Python Implementation for iipc/webarchive-commons

Language: Python - Size: 300 KB - Last synced: about 2 months ago - Pushed: 8 months ago - Stars: 0 - Forks: 0

pebnn/AutoInternetArchive

AutoInternetArchive is a very simple program designed to automatically archive webpages to The wayback machine with hourly intervals. AutoInternetArchive was designed to be run though a console window and left open for days or even months

Language: Python - Size: 22.5 KB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 2 - Forks: 0

nla/chronicrawl 📦

Experimental continouous web crawler for web archiving

Language: Java - Size: 329 KB - Last synced: about 2 months ago - Pushed: over 1 year ago - Stars: 9 - Forks: 0

oduwsdl/oduwsdl.github.io

ODU Web Science and Digital Libraries Research Group (WS-DL) home page.

Language: HTML - Size: 47.1 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 2 - Forks: 36

shawnmjones/OffTopic-Detection Fork of yasmina85/OffTopic-Detection

This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

Language: Python - Size: 712 MB - Last synced: 9 months ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0

mkrzmr/mkrzmr.github.io

Michael Kurzmeier, 4th year Phd Digital Humanities @Maynooth University

Size: 1.39 MB - Last synced: 9 months ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0

nla/jwebrenderer

Simple web service to render pages with headless chrome

Language: Java - Size: 17.6 KB - Last synced: about 2 months ago - Pushed: 11 months ago - Stars: 1 - Forks: 0

nla/chropro 📦

Chrome debugging protocol client for Java

Language: Java - Size: 115 KB - Last synced: about 2 months ago - Pushed: about 4 years ago - Stars: 10 - Forks: 2

dbeley/archiveboxmatic

ArchiveBoxMatic: configure ArchiveBox with the simplicity of a yaml file.

Language: Python - Size: 57.6 KB - Last synced: 2 months ago - Pushed: about 3 years ago - Stars: 14 - Forks: 3

webis-de/scriptor

Plug-and-play reproducible web analysis.

Language: JavaScript - Size: 1.59 MB - Last synced: 21 days ago - Pushed: 3 months ago - Stars: 6 - Forks: 1

nla/outbackproxy

HTTP/S proxy server which replays content from a web archive

Language: Java - Size: 26.4 KB - Last synced: about 2 months ago - Pushed: about 1 year ago - Stars: 3 - Forks: 0

nla/heritrixctl 📦

Heritrix runner and API client for Java

Language: Java - Size: 27.3 KB - Last synced: about 2 months ago - Pushed: over 4 years ago - Stars: 1 - Forks: 0

nla/heritrix3 Fork of internetarchive/heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Language: Java - Size: 10.3 MB - Last synced: about 2 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0

nla/butterflynet 📦

Streamline single-document web archiving tool

Language: Java - Size: 163 KB - Last synced: about 2 months ago - Pushed: about 1 year ago - Stars: 1 - Forks: 0

oduwsdl/warrick

Recover lost websites from the Web Infrastructure

Language: HTML - Size: 2.66 MB - Last synced: about 1 year ago - Pushed: about 3 years ago - Stars: 78 - Forks: 10

ukwa/ukwa-manage

Shepherding our web archives from crawl to access.

Language: Jupyter Notebook - Size: 122 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 10 - Forks: 5

caltechlibrary/eprints2archives

Send records from an EPrints server to the Internet Archive and other web archives

Language: Python - Size: 504 KB - Last synced: about 1 month ago - Pushed: 12 months ago - Stars: 3 - Forks: 0

httpreserve/conventoarchiver

Repository for collecting scripts to help capture MyConvento newsroom press-releases from the MyConvento PR management suite. The README provides an analysis of the MyConvento URL architecture for users hoping to develop a solution for themselves.

Language: Python - Size: 23.4 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0

ukwa/ukwa-ui

A new user interface for the UK Web Archive

Language: Java - Size: 170 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 6

helgeho/WarcPartitioner

Partition (W)ARC Files by MIME Type and Year

Language: Java - Size: 8.79 KB - Last synced: 3 days ago - Pushed: about 7 years ago - Stars: 1 - Forks: 1

TarekJor/instaloader Fork of instaloader/instaloader

Download pictures (or videos) along with their captions and other metadata from Instagram.

Language: Python - Size: 662 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

mhucka/devonagent-hacks

Scripts and other things for working with DEVONagent.

Language: AppleScript - Size: 16.6 KB - Last synced: 13 days ago - Pushed: almost 4 years ago - Stars: 1 - Forks: 0

httpreserve/mementoqa

QA Mementos using Screenshots

Language: HTML - Size: 410 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0

httpreserve/wadl-2017

Resources for WADL 2017

Size: 4.84 MB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0

httpreserve/million-dollar-webpage Fork of ross-spencer/million-dollar-webpage

HTTPreserve Analysis of Million Dollar Web Page

Size: 299 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0

helgeho/HadoopConcatGz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

Language: Java - Size: 51.8 KB - Last synced: 3 days ago - Pushed: over 6 years ago - Stars: 9 - Forks: 3

austinfrey/pull-warc

pull-streaming WARC file operations

Language: JavaScript - Size: 19.5 KB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0

ukwa/waybacks

This module builds our Waybacks in the various different configurations we require.

Language: Java - Size: 23.2 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 2 - Forks: 2

ngeraci/ucr-covid-bing-search

Quick script using Bing Web Search API to retrieve list of URLs for web archiving

Language: Python - Size: 8.79 KB - Last synced: about 1 year ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0

ross-spencer/million-dollar-webpage

HTTPreserve Analysis of Million Dollar Web Page

Size: 299 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 2

TarekJor/wpull Fork of ArchiveTeam/wpull

Wget-compatible web downloader and crawler.

Language: HTML - Size: 3.87 MB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0

TarekJor/DiscordMediaLoader Fork of Serraniel/DiscordMediaLoader

Discord Media Loader - Simply download all attachments

Language: C# - Size: 842 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

TarekJor/PixivUtil2 Fork of Nandaka/PixivUtil2

Download images from Pixiv and more!

Language: Python - Size: 11.4 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 0 - Forks: 0

TarekJor/TumblThree Fork of johanneszab/TumblThree

A Tumblr Blog Backup Application

Language: C# - Size: 3 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 0 - Forks: 0