Skip to main content

wpull running on Ubuntu 20.10

As some of you might know, I have a tendency to be an archivist sometimes. This makes me appreciate the online resources we have (and often are not well known), such as the internet archive or the wayback machine. Luckily there are more archivists online, for example the archive team and likeminded people at DataHoarder (reddit). And apparently there is also a awesome web archiving list. anarcat also has a good post about Website mirroring and archival.

My current (unfinished) goal is to archive/backup a specific website and being a good netizen about that (e.g. not hammering the website or abusing the archived content). Of course you can use wget or httrack, but I also found wpull.

To get wpull installed and running on my Ubuntu 20.10 machine, I had to do the following:

pip3 install wpull
pip3 install tornado==4.5.3
pip3 install html5lib==0.9999

Note: the last 2 commands I found in issue 384 thanks to m4ntic0r.

Apparently Python >= 3.7 is not really supported (see issue 404), but luckily francisg-gc has a pull request with a fix. So edit ~/.local/lib/python3.8/site-packages/wpull/driver/process.py and change:

self._stderr_reader = asyncio.async(self._read_stderr())
self._stdout_reader = asyncio.async(self._read_stdout())

to:

_async=getattr(asyncio, 'async')
self._stderr_reader = _async(self._read_stderr())
self._stdout_reader = _async(self._read_stdout())

And off you go. wpull is working and is creating warcs.