Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: warc

forensic-toolkit/warc-browser

a cli toolkit for working with web archives

Language: Go - Size: 469 KB - Last synced: about 13 hours ago - Pushed: 4 months ago - Stars: 2 - Forks: 0

AlexGustafsson/larch

A self-hosted service and toolset for managing, archiving, viewing and sharing bookmarks

Language: Go - Size: 256 KB - Last synced: about 13 hours ago - Pushed: almost 3 years ago - Stars: 4 - Forks: 0

harvard-lil/warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

Language: Python - Size: 1.7 MB - Last synced: 1 day ago - Pushed: 3 days ago - Stars: 41 - Forks: 7

toimik/WarcProtocol

Parser for WARC (aka WebArchive) files

Language: C# - Size: 180 KB - Last synced: 1 day ago - Pushed: 1 day ago - Stars: 8 - Forks: 3

webrecorder/browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

Language: TypeScript - Size: 52.4 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 549 - Forks: 68

internetarchive/heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Language: Java - Size: 10.5 MB - Last synced: 2 days ago - Pushed: about 1 month ago - Stars: 2,698 - Forks: 756

Florents-Tselai/WarcDB

WarcDB: Web crawl data as SQLite databases.

Language: Python - Size: 51.7 MB - Last synced: 3 days ago - Pushed: 3 months ago - Stars: 384 - Forks: 11

CorentinB/warc

Read and write WARC files in Go

Language: Go - Size: 3.69 MB - Last synced: 3 days ago - Pushed: 3 days ago - Stars: 8 - Forks: 2

elbosso/warc2sitemap

This project is intended to turn a WARC file into a sitemap or into something (a graph description) one could build a sitemap from. The first release only offers to create a Graphviz file that can then be rendered - for example into SVG.

Language: Java - Size: 1010 KB - Last synced: 5 days ago - Pushed: 5 months ago - Stars: 1 - Forks: 0

openzim/warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format

Language: Python - Size: 21.3 MB - Last synced: 5 days ago - Pushed: 5 days ago - Stars: 34 - Forks: 5

commoncrawl/news-crawl

News crawling with StormCrawler - stores content as WARC

Language: Java - Size: 231 KB - Last synced: 5 days ago - Pushed: 5 months ago - Stars: 251 - Forks: 31

chatnoir-eu/chatnoir-resiliparse

A robust web archive analytics toolkit

Language: Cython - Size: 1.87 MB - Last synced: 4 days ago - Pushed: 13 days ago - Stars: 43 - Forks: 8

ArchiveBox/ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Language: Python - Size: 7.73 MB - Last synced: 11 days ago - Pushed: 11 days ago - Stars: 19,808 - Forks: 1,077

helgeho/ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Language: Scala - Size: 1.15 MB - Last synced: 4 days ago - Pushed: about 1 month ago - Stars: 141 - Forks: 19

natliblux/warc-safe

A tool for detecting viruses and NSFW material in WARC files

Language: Python - Size: 487 KB - Last synced: 8 days ago - Pushed: 9 days ago - Stars: 2 - Forks: 0

internetarchive/cdx-summary

Summarize web archive capture index (CDX) files.

Language: Python - Size: 227 KB - Last synced: 3 days ago - Pushed: almost 2 years ago - Stars: 47 - Forks: 7

webrecorder/webrecorder-player 📦

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

Language: JavaScript - Size: 6 MB - Last synced: 4 days ago - Pushed: over 3 years ago - Stars: 423 - Forks: 39

webrecorder/browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

Language: TypeScript - Size: 9.84 MB - Last synced: 11 days ago - Pushed: 11 days ago - Stars: 121 - Forks: 26

maxcountryman/warc-parquet

🗄️ A simple CLI for converting WARC to Parquet.

Language: Rust - Size: 124 KB - Last synced: 10 days ago - Pushed: 11 days ago - Stars: 99 - Forks: 0

webrecorder/replayweb.page

Serverless replay of web archives directly in the browser

Language: TypeScript - Size: 76.2 MB - Last synced: 9 days ago - Pushed: 10 days ago - Stars: 624 - Forks: 50

openzim/zimit-frontend

Zimit Public Web UI

Language: Vue - Size: 466 KB - Last synced: 9 days ago - Pushed: 10 days ago - Stars: 7 - Forks: 8

toimik/CommonCrawl

Common Crawl's processing tools

Language: C# - Size: 85.9 KB - Last synced: 10 days ago - Pushed: 10 days ago - Stars: 5 - Forks: 0

cooljeanius/wget-warc Fork of alard/wget-warc 📦

This is an old version of the WARC patches. Wget v1.14 and higher has WARC support.

Size: 4.31 MB - Last synced: 10 days ago - Pushed: over 3 years ago - Stars: 1 - Forks: 0

CGamesPlay/chronicler

Offline-first web browser

Language: JavaScript - Size: 243 KB - Last synced: 4 days ago - Pushed: over 5 years ago - Stars: 83 - Forks: 5

machawk1/wail

:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation

Language: Roff - Size: 832 MB - Last synced: 4 days ago - Pushed: 6 months ago - Stars: 343 - Forks: 32

cocrawler/cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Language: Python - Size: 183 KB - Last synced: 7 days ago - Pushed: 3 months ago - Stars: 153 - Forks: 29

oduwsdl/off-topic-memento-toolkit

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

Language: Python - Size: 93.7 MB - Last synced: 12 days ago - Pushed: over 2 years ago - Stars: 8 - Forks: 4

webrecorder/warcio

Streaming WARC/ARC library for fast web archive IO

Language: Python - Size: 285 KB - Last synced: 4 days ago - Pushed: 13 days ago - Stars: 345 - Forks: 54

trenton-telge/WebArchiveX

A more compressed alternative to WARC web archival. Command line tool built in Kotlin.

Language: Kotlin - Size: 12.7 KB - Last synced: 13 days ago - Pushed: almost 6 years ago - Stars: 1 - Forks: 0

mikwielgus/forum-dl

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

Language: Python - Size: 391 KB - Last synced: 9 days ago - Pushed: 8 months ago - Stars: 60 - Forks: 1

oduwsdl/ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

Language: Python - Size: 6.25 MB - Last synced: 6 days ago - Pushed: 18 days ago - Stars: 590 - Forks: 39

datacoon/metawarc

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

Language: Python - Size: 74.2 KB - Last synced: 14 days ago - Pushed: 18 days ago - Stars: 23 - Forks: 0

jedireza/warc

:gear: A Rust library for reading and writing WARC files

Language: Rust - Size: 71.3 KB - Last synced: 3 days ago - Pushed: 4 months ago - Stars: 40 - Forks: 10

internetarchive/scrapy-warcio

Support for writing WARC files with Scrapy

Language: Python - Size: 31.3 KB - Last synced: 3 days ago - Pushed: over 4 years ago - Stars: 14 - Forks: 6

centic9/CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

Language: Java - Size: 975 KB - Last synced: 10 days ago - Pushed: 21 days ago - Stars: 58 - Forks: 20

nlnwa/warchaeology

Command line tool for digging into WARC files

Language: Go - Size: 3.65 MB - Last synced: 28 days ago - Pushed: 30 days ago - Stars: 21 - Forks: 3

N0taN3rd/node-warc

Parse And Create Web ARChive (WARC) files with node.js

Language: JavaScript - Size: 7.99 MB - Last synced: 27 days ago - Pushed: over 1 year ago - Stars: 91 - Forks: 23

govau/warcraider

Convert WARC files into Avro for big data processing

Language: HTML - Size: 181 KB - Last synced: 28 days ago - Pushed: about 4 years ago - Stars: 0 - Forks: 0

govau/wofg-web-filters

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

Language: Groovy - Size: 54.7 KB - Last synced: 28 days ago - Pushed: about 5 years ago - Stars: 1 - Forks: 0

ArchiveTeam/grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Language: Python - Size: 1.24 MB - Last synced: 28 days ago - Pushed: about 2 months ago - Stars: 1,257 - Forks: 121

Rhizome-Conifer/conifer

Collect and revisit web pages.

Language: Python - Size: 25.5 MB - Last synced: 28 days ago - Pushed: 6 months ago - Stars: 1,459 - Forks: 117

webrecorder/cdxj-indexer

CDXJ Indexing of WARC/ARCs

Language: Python - Size: 82 KB - Last synced: 11 days ago - Pushed: almost 2 years ago - Stars: 21 - Forks: 9

pirate/internet-archiving-talk

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

Language: JavaScript - Size: 27.6 MB - Last synced: 10 days ago - Pushed: over 3 years ago - Stars: 47 - Forks: 5

ArchiveBox/DigestBox

DigestBox takes any webpage URL (news article, video link, comment thread, etc.) and gives you just the raw content. It's powered by ArchiveBox.io under the hood.

Language: HTML - Size: 1.75 MB - Last synced: 11 days ago - Pushed: 3 months ago - Stars: 11 - Forks: 0

ArchiveTeam/wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

Language: C - Size: 28.8 MB - Last synced: 28 days ago - Pushed: 3 months ago - Stars: 80 - Forks: 14

orottier/rust-warc

A high performance and easy to use Web Archive (WARC) file reader

Language: Rust - Size: 10.7 KB - Last synced: 3 days ago - Pushed: almost 5 years ago - Stars: 9 - Forks: 3

machawk1/warcreate

Chrome extension to "Create WARC files from any webpage"

Language: JavaScript - Size: 2.23 MB - Last synced: 28 days ago - Pushed: 5 months ago - Stars: 192 - Forks: 12

cocrawler/cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Language: Python - Size: 911 KB - Last synced: 25 days ago - Pushed: about 2 years ago - Stars: 176 - Forks: 25

N0taN3rd/wail Fork of machawk1/wail

:whale2: One-Click User Instigated Preservation

Language: JavaScript - Size: 421 MB - Last synced: 28 days ago - Pushed: over 5 years ago - Stars: 119 - Forks: 9

crissyfield/troll-a

Drill into WARC web archives

Language: Go - Size: 199 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 84 - Forks: 9

ArchiveTeam/WebArchiver

Decentralized web archiving

Language: Python - Size: 323 KB - Last synced: 28 days ago - Pushed: almost 6 years ago - Stars: 19 - Forks: 3

wabarc/warcraft

A toolkit to help download webpage as warc file

Language: Go - Size: 44.9 KB - Last synced: 9 days ago - Pushed: 9 days ago - Stars: 1 - Forks: 0

jasonmtroos/ccwarcs

R package to provide access to Common Crawl WARC files via Amazon Web Services

Language: R - Size: 566 KB - Last synced: 5 months ago - Pushed: over 4 years ago - Stars: 1 - Forks: 0

Mixnode/mixnode-warcreader-php

Read Web ARChive (WARC) files in PHP.

Language: PHP - Size: 7.81 KB - Last synced: 17 days ago - Pushed: about 7 years ago - Stars: 21 - Forks: 3

ruarxive/awesome-digital-preservation

Awesome list dedicated to digital and data preservation tools, sources, services and so on.

Size: 7.81 KB - Last synced: about 15 hours ago - Pushed: over 1 year ago - Stars: 14 - Forks: 2

sepastian/warc2corpus

Extract structured data from HTML pages in WARCs through CSS selectors.

Language: HTML - Size: 5.25 MB - Last synced: 18 days ago - Pushed: about 1 year ago - Stars: 4 - Forks: 0

bitextor/bitextor

Bitextor generates translation memories from multilingual websites

Language: Python - Size: 177 MB - Last synced: 6 months ago - Pushed: 8 months ago - Stars: 265 - Forks: 45

PromyLOPh/crocoite 📦

Web archiving using Google Chrome

Language: Python - Size: 424 KB - Last synced: 4 days ago - Pushed: over 4 years ago - Stars: 42 - Forks: 7

antiufo/iabak-sharp

A C# implementation for the INTERNETARCHIVE.BAK project

Language: C# - Size: 138 KB - Last synced: 7 months ago - Pushed: over 1 year ago - Stars: 4 - Forks: 0

wsdookadr/femtocrawl

minimalistic crawler

Language: Python - Size: 29.3 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 2 - Forks: 0

archivesunleashed/warclight 📦

A Rails engine supporting the discovery of web archives.

Language: Ruby - Size: 13.3 MB - Last synced: 4 days ago - Pushed: 11 months ago - Stars: 48 - Forks: 10

shawnmjones/OffTopic-Detection Fork of yasmina85/OffTopic-Detection

This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

Language: Python - Size: 712 MB - Last synced: 9 months ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0

marhop/vim-warc

Vim syntax highlighting for WARC files

Language: Vim script - Size: 2.93 KB - Last synced: 9 months ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

datatogether/archive 📦

golang package for creating/working with warc & cdxj archives

Language: Go - Size: 20.5 KB - Last synced: 10 months ago - Pushed: about 6 years ago - Stars: 2 - Forks: 1

Mixnode/mixnode-warcreader-java

Read Web ARChive (WARC) files in Java.

Language: Java - Size: 17.6 KB - Last synced: 11 months ago - Pushed: about 7 years ago - Stars: 9 - Forks: 5

marinoandrea/wikidata-entity-linking

CLI to extract named entities from web pages and link them to potential entity candidates in the WikiData knowledge base.

Language: Python - Size: 3.5 MB - Last synced: 12 months ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0

code402/warc-benchmark

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

Language: Shell - Size: 24.4 KB - Last synced: 3 months ago - Pushed: about 3 years ago - Stars: 4 - Forks: 8

ukwa/ukwa-manage

Shepherding our web archives from crawl to access.

Language: Jupyter Notebook - Size: 122 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 10 - Forks: 5

archivetheweb/arweave-warc-renderer

ANS-108 implementation for the Archive The Web. This allows gateways on arweave to render ATW's warc files

Language: JavaScript - Size: 698 KB - Last synced: about 1 year ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0

edgi-govdata-archiving/eis-WARC-archiver 📦

ARCHIVED--Docker app to crawl URLs and generate WARCs

Language: Python - Size: 28.1 MB - Last synced: 27 days ago - Pushed: about 7 years ago - Stars: 10 - Forks: 5

datatogether/warc 📦

Golang WARC (Web ARChive) Library

Language: Go - Size: 229 KB - Last synced: 10 months ago - Pushed: almost 5 years ago - Stars: 29 - Forks: 7

laxika/java-warc Fork of Mixnode/mixnode-warcreader-java 📦

Read Web ARChive (WARC) files in Java.

Language: Java - Size: 130 KB - Last synced: 10 months ago - Pushed: over 4 years ago - Stars: 3 - Forks: 1

sebastian-nagel/warc-crawler

Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr

Language: FLUX - Size: 44.9 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 6 - Forks: 1

bottomless-archive-project/java-warc Fork of laxika/java-warc

Read Web ARChive (WARC) files in Java.

Language: Java - Size: 185 KB - Last synced: 11 months ago - Pushed: over 2 years ago - Stars: 5 - Forks: 0

antiufo/Shaman.Dokan.Warc

Mounts WARC files on Windows

Language: C# - Size: 241 KB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 17 - Forks: 1

pisa-engine/warcpp

A C++ parser for the Web Archive (WARC) format.

Language: C++ - Size: 43.9 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0

hadrianw/abracabra

Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.

Language: Rust - Size: 9.77 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0

helgeho/WarcPartitioner

Partition (W)ARC Files by MIME Type and Year

Language: Java - Size: 8.79 KB - Last synced: 4 days ago - Pushed: about 7 years ago - Stars: 1 - Forks: 1

hrbrmstr/jwatr

:card_index: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R

Language: R - Size: 38.6 MB - Last synced: 3 months ago - Pushed: over 6 years ago - Stars: 7 - Forks: 1

hrbrmstr/warc

:card_index: Tools to Work with the Web Archive Ecosystem in R

Language: R - Size: 2.52 MB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 21 - Forks: 3

helgeho/HadoopConcatGz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

Language: Java - Size: 51.8 KB - Last synced: 4 days ago - Pushed: over 6 years ago - Stars: 9 - Forks: 3

tokenmill/common-crawl-utils

Various Common Crawl utilities in Clojure.

Language: Clojure - Size: 54.7 KB - Last synced: 5 days ago - Pushed: 5 months ago - Stars: 6 - Forks: 1

antiufo/Shaman.Scraping

A C# library for reading/writing WARC files and scraping websites.

Language: C# - Size: 79.1 KB - Last synced: 3 months ago - Pushed: about 5 years ago - Stars: 7 - Forks: 3

austinfrey/pull-warc

pull-streaming WARC file operations

Language: JavaScript - Size: 19.5 KB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0

hadrianw/abracabra-legacy

A search engine, but currently a filtering pipeline for WARC files. Legacy repo, look for abracabra repo.

Language: Go - Size: 21.5 KB - Last synced: about 1 year ago - Pushed: almost 5 years ago - Stars: 0 - Forks: 0

pierlauro/MDBubing

From WARC records to MongoDB documents

Language: Java - Size: 145 KB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 1 - Forks: 0

miku/ttarc

Minimalistic TikTok trending archiver.

Language: HTML - Size: 5.27 MB - Last synced: about 1 year ago - Pushed: over 3 years ago - Stars: 2 - Forks: 0

ukwa/waybacks

This module builds our Waybacks in the various different configurations we require.

Language: Java - Size: 23.2 MB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 2 - Forks: 2

bottomless-archive-project/common-crawl-client

This library is a very lightweight client to Common Crawl's WARC files.

Language: Java - Size: 55.7 KB - Last synced: about 1 year ago - Pushed: over 4 years ago - Stars: 0 - Forks: 0

hrbrmstr/jwatjars

Java '.jar' Files for 'jwatr'

Language: R - Size: 401 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 2 - Forks: 0

cldellow/gzip

A fork of java.util.zip.GZIPInputStream that emits the offsets of nested streams.

Language: Java - Size: 21.5 KB - Last synced: 11 months ago - Pushed: about 5 years ago - Stars: 1 - Forks: 0

bobpoekert/ocamlwarc

WARC parser for ocaml

Language: OCaml - Size: 5.72 MB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 0 - Forks: 0

info-labs/owlbot

WARC archive crawler

Language: Python - Size: 45.9 KB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 0 - Forks: 0

ggodreau/huhdewp

Hadoop streaming EMR job

Language: Python - Size: 27.3 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

VAle512/WarcExtractor

A simple WARC extractor that extract HTML from WARC!

Language: Java - Size: 23.4 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 6 - Forks: 0

dlrobertson/warc-c

A WIP WARC parser in C

Language: C - Size: 85 KB - Last synced: about 1 year ago - Pushed: almost 6 years ago - Stars: 2 - Forks: 0

sara-nl/spark-warcutils-example

Example of using warcutils with Apach Spark

Language: Scala - Size: 55.7 KB - Last synced: about 1 year ago - Pushed: almost 7 years ago - Stars: 0 - Forks: 1

Vikasg7/warc-reader

ES6 Class to read .warc or .warc.gz file member by member in nodejs

Language: TypeScript - Size: 5.86 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0

Vikasg7/warc-stream

Transform stream to read .warc or .warc.gz file member by member in nodejs

Language: TypeScript - Size: 5.86 KB - Last synced: about 1 year ago - Pushed: over 6 years ago - Stars: 0 - Forks: 0