Portico’s Response to the CNI Spring 2009 Task Force Plenary

At the opening plenary of the CNI Spring 2009 Task Force meeting, David Rosenthal, Chief Scientist, LOCKSS Program, Stanford University, offered his perspective on today’s greatest digital preservation challenges and how these differ from the challenges expected in 1995 by Jeff Rothenberg in his seminal Scientific American article “Ensuring the Longevity of Digital Documents”. Slides from David’s presentation are available at his blog, “dshr’s blog”.

We offer below some thoughts prepared by Sheila Morrissey, Senior Research Developer, Portico in response to David’s perspective. Sheila is a member of Portico’s Data Team which develops tools to move published scholarly content from one (often proprietary) format to an archival format, when needed, as we direct this content into the Portico archive for active long term preservation. From her hands-on perspective Sheila speaks to how the originally anticipated challenges created by a diverse hardware, software and format landscape remain, even as new preservation challenges continue to arise. (This posting is a slightly abbreviated version of comments originally posted to dshr’s blog.)

We welcome your comments on the challenges of digital preservation. Feel free to post your thoughts below and add your perspective to the conversation.

Eileen Fenton
Executive Director, Portico

David Rosenthal invites us to consider whether the some of the risks to long-term preservation of digital
artifacts delineated in Jeff Rotherberg’s 1995 article have been conjured away by the market’s invisible hand. Just as in 1995, we still have multiple hardware architectures and operating systems – not just mainframes and servers and desktops, not just Z10 and Windows XP and Windows Vista and Solaris and the many flavors of Linux – but also cell phones and PDAs, IPODs and Walkmans and Zunes, navigation appliances like TomTom, eBook readers like Kindle, game boxes like Xbox and Wii. And we have whole zoo of new applications and new formats to go with them. Some are merely delivery formats, but some are actual content formats. Some are open; some are proprietary. We have the many variants of geospatial data (GDF, CARiN, S-Dal, PSF, and the myriad sub-species of ShapeFiles). We have formats for eBook readers (proprietary ones like Kindle’s MobiPocket-based AZW format, in addition to open ones like OPF). The many proprietary CAD tools and formats currently in use, which have no satisfactory non-proprietary open-source substitutes, already complicate the present communication, not to say the future preservation, of architectural design artifacts.

The fact is, new content formats are being created all the time – some of them proprietary; some with only proprietary rendition tools; many in wide use in both online and offline content repositories of interest to one digital preservation community or another. The technology market does not stand still. The market will always recapitulate exactly the same process: get out there with your own product, do what you have to do to suck up market share, and crowd out whomever you can, whether they got there before you, or came in after you.

And after the market finishes strip-mining a particular application or content domain, what then? The digital preservation community will be left, then as now, to clean up the mess. There will always be artifacts in some defunct format, for which there will have been no sufficient market to ensure freely available open-source renderers. And there will be artifacts nominally in the format that prevails that will be defective. So, for example, we’ll likely be dealing with “crufty” self-published eBooks that fly below Google’s content-acquisition radar, or with “crufty” geospatial databases, in whatever dominant format, just as today web harvesters have to contend with “crufty” HTML. Simply depositing contemporary open source renderers in a Source-Forge-like depository is no warrant those renderers will perform as needed in the future, even assuming a comprehensive emulation infrastructure. And if those tools fail then, when they are called on “just in time”, where will the living knowledge of these formats be found to make up the deficit?

The market is not going to provide us with a silver bullet that will solve all the problems the market itself creates. Nor will the market alone afford us a way around the unhappy fact that preservation assets have to be managed. And that means, especially for scholarly artifacts, that we’re still going to need metadata: technical metadata, descriptive metadata, rights metadata, provenance metadata. An earlier post explains why I don’t think an open-source renderer, even if one can be found, is a sufficient substitute for technical metadata. I think the library community, and the academic community in general, would at least want to consider how much less effective search tools will be without descriptive metadata (which, incidentally, need not be, and are not currently, necessarily hand-crafted). Even in an ideal world of open access, it is possible to imagine categories of digital assets for which we will be obliged to track rights metadata. And if you take the view as David does that the Web is Winston’s Smith’s dream machine, perhaps provenance metadata, event metadata, collection management metadata, all still have important roles to play in digital preservation, too.

So we have to ask: Is the lesson of history, ancient and modern, that “formats and metadata are non-problems”? That “all we have to do is just collect and keep the bits, and all will be well”? Or does history caution us, with respect to formats and metadata, that it might be a little early yet to be standing in our flight suits on the deck of the good ship Digital Preservation, under a banner reading “Mission Accomplished”?

Are there perhaps some other lessons we can learn from a wider scan of the past and present of digital preservation?

Digital preservation is not free

This is one of David’s key points, and it’s a really crucial insight. Even if preservation can be accomplished with open source tools, open source projects after all are not magically exempt from economic laws. This is as true of the open-source tools that Portico uses, for example, as it is of the fascinating open-source emulators under development in the Dioscuri project, or of open-source software originally intended as a commercial product (such as Open Office, which entailed a 10-year development effort and significant on-going subsidy from SUN Microsystems). Free software, as the saying goes, means “free as in speech, not free as in beer”. All vital, useful open source projects entail costs to get them going, and costs to keep them alive. If we are to judge how economical a preservation solution is, or to amortize the cost or preserved content over its lifetime, we need to know

  • what the sunk costs are
  • what the institutional subsidy in the form of what might be termed charge-backs was and is and is projected to be
  • what the ongoing costs are
  • what the projected costs are
  • what the incremental costs of adding both content nodes and subscribers to the network are
  • what fiscal reserves have to be set aside to ensure the continuance of the solution

Digital preservation is not one thing.

It is many use cases, and many tiers of many solutions. It is many, and many different kinds of, participants, making many different kinds of preservation choices. Different content, different needs, different valuations, different scenarios dictate different cost/benefit analyses, different actions, different solutions.

Of course we have to consider the challenges of scale, as David notes. And of course this means we have to find ways to employ the economies of scale, and then to consider any risks the inevitable centralization to employ those economies entails. But we would do well to remember that, in Arthur’s analysis of the convergence of the high technology market to a single market “solution”, that single solution is often neither the “best” nor the most “efficient” one.

The values of the marketplace, and the solutions of the marketplace, are not the values and solutions of the digital preservation community. The digital preservation community cannot afford a preservation monoculture, which could well be a single point of failure, whether that means a monolithic technology that could fail, or malignant political control of intellectual and cultural assets. We need CLOCKSS and Portico, the Stanford Digital Library and Florida’s DAITSS, the British Library and the Koninklijke Bibliotheek and the German National Library, the Internet Archive, LOCKSS private networks, consortia collections, “retail” preservation-in-the-cloud.

Digital preservation is not a technology.

It is a human activity. It is a collective action. It is people fulfilling the social compact to preserve and to pass along the socially meaningful digital creations of our time, our place, our cultures, our communities. It means people making choices – choices about what to preserve and what not to preserve; choices about what level of investment is appropriate for what categories of objects; choices about how to “divide and conquer” the universe of potentially preservable digital objects; choices about what entities and what technologies are trustworthy agents of preservation.

The market has its place. But surely if we have learned anything from recent events, it is to distrust the glamour of “the new new thing”, the can’t-fail digital preservation appliance that will hoover up all the bits in the world, and collect them in a preservation dust bag. The complexity of the digital preservation solution will have to match the complexity of the problem.

Sheila Morrissey
Senior Research Developer, Portico