Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
death-valley.net shutdown & data scrape
#1
Mild warning:  much of this post is rather tech jargon heavy.  My $DAYJOB is in the tech industry, but if this isn't your thing, you may not find this post terribly interesting.



When the closure notice for the old death-valley.net was posted barely 24 hours ago, I rushed to try to capture 18 years worth of posts.  The site apparently came into existence in 2002, but the oldest post that I could find was dated December 23, 2002.  I didn't register on the site until June 2006, when the site as already 3.5 years old.  



Capturing ~18 years of content was a bit of an adventure, as the only good way to capture this data is by taking a real dump of the backend database that supports the site, and only the admin have that ability.



As a result, I was left to scrape the data, which is quite literally crawling the entire site (as you would manually in your web browser) and saving the resulting data.  Its not an elegant process, as a lot more than the actual posts end up being saved (tons of HTML, and other bits and pieces that are not worth saving), plus its time consuming.



My first attempt was using a python tool from github called phpBB_scraper.  phpBB is the software that the old forum used to host the site, and this tool theoretically had the ability to extract just the posts, without all the other bits that no one cares about.  It seemed to work ok, but after multiple attempts, it only captured about 2100 out of the 70,000+ posts.  There were no errors, it just silently stopped with no indication on why much of the posts failed to get captured.



When it became clear that approach wasn't going to pan out, I spent a chunk of time googling for alternatives.  Sadly, there's not a lot of options for scraping phpBB sites.  There were a few other (poorly maintained) tools, and they were even less effective than the one that I already tried above.



That left me with just one option, the 'wget' tool, which is effectively a command line swiss army knife of web tooling.  Its officially described as "The non-interactive network downloader",  which is a rather vague way of saying that its a command line web browser.  It can be used to do everything your web browser can do, but requires an insane number of options for anything beyond the most basic file download attempts.  wget actually has an explicit option for mirroring an entire website, and that option alone did much of the heavy lifting to pull down about 259MB representing all 18 years of content.



The command that I ended up using is as follows:



Code:
wget -w 1 -a www.death-valley.net.log -m -e robots=off --no-check-certificate --keep-session-cookies --adjust-extension --convert-links --page-requisites --reject-regex='(/?p=|&p=|mode=reply|view=|mode=post|mode=email|mode=quote|mode=newtopic|login.php|search.php|feed.php)' 'http://www.death-valley.net/forum/'



EDIT: redacted
Reply
#2
Actually, I found this quite interesting, thanks.

I'd like to help, but I'm crippled by 1) not having a high-speed internet connection at home, and 2) not being anywhere near as techhy as you clearly are!
Reply
#3
Thanks for making your scrape of the site available for download. At the very least it's a great resource / archive for those of us that were there ...
Reply
#4
I'm happy to help reconstitute content if it comes to that.
Reply
#5
Nice!! I just downloaded the file and it works great on my Mac (I used Archive Utility to open the .tar file, then Safari to view index.html, and checked a few random posts). Thank you!
Link to my DV trip reports, and map of named places in DV (official and unofficial): http://kaurijacobphotography.yolasite.com
Reply
#6
A tidbit of history, if it helps.

I don’t recall exactly when D-V.net’s predecessor arrived, but it was DeathValley.us. It was run by someone who called himself Bighorn online. I found the site in its early days and joined. It was far more civilized than the original DV Talk boards, which had a difficult to follow, flow chart thread structure and was occasionally hostile and often contentious. A lot of us authors developed relationships and friendships via DV.us.

Around 2004 or 2005, Bighorn shut down DV.us for reasons I don’t quite understand clearly and I’m not going to speculate based on what little I do know.

About 2006, Dan resurrected DV.us into D-V.net that we’ve all come to know and love. I’m pretty sure he also obtained DV.us’s archives.

It sounds as if you managed to gather the vast majority of archived information. Not being tech savvy, nor having the means to store such, I won’t be able to take advantage of the archives. And in those days, I was as much as a soapbox orator on all things corrupt in the park service, rampant wilderness creation, pilfering and stealing of artifacts, etcetera, as I was in posting historical information. Working in Trona as I did until 2004 can do that to ya ...  Rolleyes
DAW
~When You Live in Nevada, "just down the road" is anywhere in the line of sight within the curvature of the earth.
Reply
#7
I've never been an especially prolific poster, but I've been around since dv.us.

Thanks to netllama for grabbing what you could of d-v.net content. Within that file is a great deal of historical information that would be a shame to lose.

Dan had his reasons for pulling the plug so quickly, and they are not likely to surface publicly. Nonetheless, now is as good a time as any to start fresh. Let us hope that this board will be just as civil as d-v.net has been for its lifespan.

David Bricker / SYR - ITO
Reply
#8
Wow, good old "wget" my long lost friend. In my youth (ahem) I did plenty of screen scraping off web sites (I was repurposing all sorts of stuff for voice access over phones in the 90s). I'll grab that tarball. Going through old posts locally will be a lot faster too Smile

I guess one advantage of not having local storage of images in the forums means the archives are a whole lot smaller.

Thanks much netllama!
Reply
#9
Wow, netllama, if this is what you call "scraping by" then you deserve the Nobel Prize for Preservation by capturing the previous archives and for launching fantastic new forum. Mucho appreciation and good karma to you! And the tech description of your process was quite fascinating and set a great context for me as a non-tech person to appreciate your efforts all the more. If MojaveGeek vouches for you then I'm all in as well.

Oh and thanks to our moderators as well, what a team of troopers you all are!
Reply
#10
(2020-09-17, 04:52 PM)DeathValleyDazed Wrote: Wow, netllama, if this is what you call "scraping by" then you deserve the Nobel Prize for Preservation by capturing the previous archives and for launching fantastic new forum. Mucho appreciation and good karma to you! And the tech description of your process was quite fascinating and set a great context for me as a non-tech person to appreciate your efforts all the more. If MojaveGeek vouches for you then I'm all in as well.

Oh and thanks to our moderators as well, what a team of troopers you all are!

I'm thrilled to see that you've signed up here. Your contributions on the old forum were fantastic, and it would be a huge loss if you didn't join us here.
Reply


Forum Jump:


Users browsing this thread: 10 Guest(s)