Skip to main content

Posts about archive

Game Not Over has good newsletter and the newest edition contained a link to the video of their Game Not Over - virtual event, which I really recommend to watch.

They have a fun and interesting chat with John Carmack after which they have a panel discussion. Topics which are touched in the first 30 minutes: open source, Wolfenstein 3D, Doom, Quake and "Hackers - Heroes of the Computer Revolution".

P.S. Please consider donating to the Internet Archive and Wikipedia.

wpull running on Ubuntu 20.10

As some of you might know, I have a tendency to be an archivist sometimes. This makes me appreciate the online resources we have (and often are not well known), such as the internet archive or the wayback machine. Luckily there are more archivists online, for example the archive team and likeminded people at DataHoarder (reddit). And apparently there is also a awesome web archiving list. anarcat also has a good post about Website mirroring and archival.

My current (unfinished) goal is to archive/backup a specific website and being a good netizen about that (e.g. not hammering the website or abusing the archived content). Of course you can use wget or httrack, but I also found wpull.

To get wpull installed and running on my Ubuntu 20.10 machine, I had to do the following:

pip3 install wpull
pip3 install tornado==4.5.3
pip3 install html5lib==0.9999

Note: the last 2 commands I found in issue 384 thanks to m4ntic0r.

Apparently Python >= 3.7 is not really supported (see issue 404), but luckily francisg-gc has a pull request with a fix. So edit ~/.local/lib/python3.8/site-packages/wpull/driver/ and change:

self._stderr_reader = asyncio.async(self._read_stderr())
self._stdout_reader = asyncio.async(self._read_stdout())


_async=getattr(asyncio, 'async')
self._stderr_reader = _async(self._read_stderr())
self._stdout_reader = _async(self._read_stdout())

And off you go. wpull is working and is creating warcs.