ArchiveBox

Open source self-hosted web archiving. Takes browser history, bookmarks, Pocket, Pinboard, etc., saves HTML, JS, PDFs, media, and more.

ArchiveBox is a powerful self-hosted internet archiving solution written in Python. You feed it URLs of pages you want to archive, and it saves them to disk in a variety of formats depending on setup and content within.

Run ArchiveBox via Docker Compose (recommended), Docker, Apt, Brew, or Pip (see below).


	    		apt/brew/pip3 install archivebox 
	    		


	    		archivebox init --setup                      # run this in an empty folder 
	    		

	    		archivebox add 'https://example.com'       # start adding URLs to archive 
	    		

	    		curl https://example.com/rss.xml | archivebox add  # or add via stdin 
	    		

	    		archivebox schedule --every=day https://example.com/rss.xml

For each URL added, ArchiveBox saves several types of HTML snapshot (wget, Chrome headless, singlefile), a PDF, a screenshot, a WARC archive, any git repositories, images, audio, video, subtitles, article text, and more….


				archivebox server --createsuperuser 0.0.0.0:8000 # use the interactive web UI
				

				archivebox list 'https://example.com' # use the CLI commands (--help for more)
				

				ls ./archive/*/index.json # or browse directly via the filesystem

You can then manage your snapshots via the filesystem, CLI, Web UI, SQLite DB (./index.sqlite3), Python API (alpha), REST API (alpha), or desktop app (alpha).

At the end of the day, the goal is to sleep soundly knowing that the part of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).

🖥 Supported OSs: Linux/BSD, macOS, Windows 🎮 CPU Architectures: x86, amd64, arm7, arm8 (raspi >=3) 📦 Distributions: docker/apt/brew/pip3/npm (in order of completeness)

Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.

Whether it’s to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010’s flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.

The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don’t think everything should be preserved in an automated fashion, making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org and Archive.is are capable of saving. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

All the archived links are stored by date bookmarked in ./archive/, and everything is indexed nicely with JSON & HTML files. The intent is for all the content to be viewable with common software in 50 - 100 years without needing to run ArchiveBox in a VM.