Ben Ward

Building the Great Libraries of the Internet with a DNS time machine

. Updated: .

San Francisco

Much is being written today about the apparently inevitable acquisition of Tumblr by Yahoo. Much of much that is being written consists of joking-not-joking about when the Archive Team should start mirroring the service to get a head start on their deletion, and other playful-but-deadly-serious snark at the obvious and somewhat reasonable ‘Geocities 2.0’ narrative.

I worked for Yahoo when Geocities died, and I wrote a lengthy email plea to then CEO Carol Bartz that we should aid in its archiving (I'm sure I wasn't alone in doing so.) To her credit, she responded positively and (perhaps consequently, perhaps coincidentally) Yahoo did add a feature for people to download their own sites. Yet we all know this was insufficient to save internet history, especially on a network so abandoned by its users for so many years.

I share a lot of the attitudes of Jason Scott and Jeremy Keith toward archiving the web. I share in the desperate desire that we do it, I share in the frustration when we fail, and I detest as much as anyone I know the weasel words ‘sunsetting’ and other such indirection.

I also think that in the easy satisfaction of venting toward large companies, we're fighting against something inevitable.

Everybody dies.

We can try very, very hard to live by the mantra that Cool URIs don't change. We implement redirects, and for the lifetime of our sites it's viable. But there's a disconnect between the expectation of ongoing maintenance of URLs, and the actual likelihood of a service doing so after a product ceases to be commercially viable.

In the best case you have services like Tumblr and Posterous, where users could use their own domain with the service. Eventually I'll import my Posterous posts onto benward.me and write some redirect magic in nginx to have posts.benward.me take you to roughly the right place.

Of course, most users do not use their own domain, or take their own backups. Even if you could encourage it, plenty of blogs are hosted on Tumblr for the very purpose of disparate anonymity.

When Twitter shutters Posterous, or Six Apart close Pownce, are those companies obligated to maintain a server to point those user.example.com URLs pointing to their owner's new homes or Archive.org mirrors? What about after SAYmedia bought Six Apart's assets? What about the next owner after that? You could argue they should; it wouldn't be the most expensive piece of infrastructure running at Twitter, certainly. But still, extrapolate: Companies have boards and shareholders and (B-Corporations notwithstanding) they have a legally enshrined responsibility to deliver maximal shareholder value and so not spend money unnecessarily.

Services also have genuine quandaries with regard to data ownership. Transferring a database, even published content, of a user's data to a third party is legally non-trivial, if not impermissible.

If you were starting a new service today you could try to preempt this by including clauses in your ToS promising the transfer of public content to an archive (or released via BitTorrent) in the event of closure. Perhaps you could enshrine both consent from the user and obligation on the part of future owners. But that's not services today, and it won't be all services in the future either.

What happens when any company genuinely goes bust? Their assets in the hands of liquidators. Who is obliged to run that server, and pay ICANN for the domain renewal?

How do those obligations apply to personal sites? I spend $20 per domain every few years to maintain my registrations of ben-ward.co.uk, ben-ward.com, and bnwrd.me. If I ever get that benward.com domain I'll be renewing five domains, pretty much out of pedantic principal. How lucky for me, where dropping $60+ dollars on domain renewals is a comfortable privilege. It's certainly not a frugal use of my money. More people around the world are struggling to use that sum to feed their family for a fortnight and have nothing left over. Those people also use Tumblr.

One great success of Tumblr, Twitter, MySpace, Facebook et al is a generation of tools embraced ever more by society as a whole. The free-to-use models that fund them have enabled internet publishing to move beyond the privileged classes. The archives of this increasingly diverse web must not be an intellectual indulgence of those same few.

The Wayback Machine exists as a quite excellent tool for researching our brief digitised history, but is not a natural part of the web. When domains die, their fate is slow decline: Incompatible redirects (photos.yahoo.com to flickr.com) and derelict, catch-all closure pages (Geocities), which will eventually all be squatted over by SEO shitbathers.

We must concede that though we can maintain the paths of URIs over the lifetime of a service, most domain names are inevitably ephemeral. A two year registration to host a joke, a fifteen year registration to build a company. All will be resold.

What to do? We need to not fight the fragility. We need to look at the very heart of the web, the directory that connects the names of our services to the servers they run on, and we need to apply the concept of the Wayback Machine to it. We need temporal DNS, maintainable by librarians to keep the domains of the past connected to their archived futures. Your DNS provider as Time Lord*; rather than searching for what Geocities was like, picking a date at the DNS level could route all of your internet traffic through 1998.

Search engines would crawl this archived web in parallel to the live data, month by month. If your search results came annotated ‘geocities.com, circa 2004’, then you could set the dial and go.

I think I'm proposing that we break DNS to save the web. Which has a gazillion repercussions, not least is a dependency on responsible DNS Libraries who could maintain a legacy record to geocities.yahoo.com for as long as the service is numb but somehow handle the duality when Tumblr gets rebranded (Note: JOKE.)

We already are putting most of that trust in a single entity: Archive.org. If we put the capabilities of the Wayback Machine lower down the hierarchy of web infrastructure, we could facilitate the creation and interoperability of multiple Great Libraries. What if we moved from archiving being a desperate, retrospective exercise in saving content, to archives and snapshots being a native piece of the web platform?

Meanwhile, we continue to do the best with what we have, and you can help by running Archive Team's Warrior appliance on any spare machines you have.

* Also. We must name this service TARDNS

You can file issues or provide corrections: View Source on Github. Contributor credits.