Mild warning: much of this post is rather tech jargon heavy. My $DAYJOB is in the tech industry, but if this isn't your thing, you may not find this post terribly interesting.
When the closure notice for the old death-valley.net was posted barely 24 hours ago, I rushed to try to capture 18 years worth of posts. The site apparently came into existence in 2002, but the oldest post that I could find was dated December 23, 2002. I didn't register on the site until June 2006, when the site as already 3.5 years old.
Capturing ~18 years of content was a bit of an adventure, as the only good way to capture this data is by taking a real dump of the backend database that supports the site, and only the admin have that ability.
As a result, I was left to scrape the data, which is quite literally crawling the entire site (as you would manually in your web browser) and saving the resulting data. Its not an elegant process, as a lot more than the actual posts end up being saved (tons of HTML, and other bits and pieces that are not worth saving), plus its time consuming.
My first attempt was using a python tool from github called phpBB_scraper. phpBB is the software that the old forum used to host the site, and this tool theoretically had the ability to extract just the posts, without all the other bits that no one cares about. It seemed to work ok, but after multiple attempts, it only captured about 2100 out of the 70,000+ posts. There were no errors, it just silently stopped with no indication on why much of the posts failed to get captured.
When it became clear that approach wasn't going to pan out, I spent a chunk of time googling for alternatives. Sadly, there's not a lot of options for scraping phpBB sites. There were a few other (poorly maintained) tools, and they were even less effective than the one that I already tried above.
That left me with just one option, the 'wget' tool, which is effectively a command line swiss army knife of web tooling. Its officially described as "The non-interactive network downloader", which is a rather vague way of saying that its a command line web browser. It can be used to do everything your web browser can do, but requires an insane number of options for anything beyond the most basic file download attempts. wget actually has an explicit option for mirroring an entire website, and that option alone did much of the heavy lifting to pull down about 259MB representing all 18 years of content.
The command that I ended up using is as follows:
EDIT: redacted
When the closure notice for the old death-valley.net was posted barely 24 hours ago, I rushed to try to capture 18 years worth of posts. The site apparently came into existence in 2002, but the oldest post that I could find was dated December 23, 2002. I didn't register on the site until June 2006, when the site as already 3.5 years old.
Capturing ~18 years of content was a bit of an adventure, as the only good way to capture this data is by taking a real dump of the backend database that supports the site, and only the admin have that ability.
As a result, I was left to scrape the data, which is quite literally crawling the entire site (as you would manually in your web browser) and saving the resulting data. Its not an elegant process, as a lot more than the actual posts end up being saved (tons of HTML, and other bits and pieces that are not worth saving), plus its time consuming.
My first attempt was using a python tool from github called phpBB_scraper. phpBB is the software that the old forum used to host the site, and this tool theoretically had the ability to extract just the posts, without all the other bits that no one cares about. It seemed to work ok, but after multiple attempts, it only captured about 2100 out of the 70,000+ posts. There were no errors, it just silently stopped with no indication on why much of the posts failed to get captured.
When it became clear that approach wasn't going to pan out, I spent a chunk of time googling for alternatives. Sadly, there's not a lot of options for scraping phpBB sites. There were a few other (poorly maintained) tools, and they were even less effective than the one that I already tried above.
That left me with just one option, the 'wget' tool, which is effectively a command line swiss army knife of web tooling. Its officially described as "The non-interactive network downloader", which is a rather vague way of saying that its a command line web browser. It can be used to do everything your web browser can do, but requires an insane number of options for anything beyond the most basic file download attempts. wget actually has an explicit option for mirroring an entire website, and that option alone did much of the heavy lifting to pull down about 259MB representing all 18 years of content.
The command that I ended up using is as follows:
Code:
wget -w 1 -a www.death-valley.net.log -m -e robots=off --no-check-certificate --keep-session-cookies --adjust-extension --convert-links --page-requisites --reject-regex='(/?p=|&p=|mode=reply|view=|mode=post|mode=email|mode=quote|mode=newtopic|login.php|search.php|feed.php)' 'http://www.death-valley.net/forum/'
EDIT: redacted