Today’s sessions began with Michael Nelson (Old Dominion University), who outlined the Synchronicity project, which will enable end users to recover data that has “vanished” from the Web. Synchronicity is a Firefox Web browser extension that catches that “File Not Found” messages that appear when a user attempts to access a page that no longer exists or no longer contains the information that the user seeks. It then searches Web search engine caches, the Internet Archive, and various research project data caches and retrieves copies of the information that the user seeks.
If these sources fail to provide the desired information, Synchronicity then generates a search engine query based upon what the message is “about” and attempts to find the information on the Web. These queries are based on “lexical signatures” (e.g., MD5 and SHA-1 message digests) and page titles, and preliminary research indicates that these searches are successful about 75 percent of the time. Nelson and his colleagues are currently exploring other methods of locating “lost” content and how to handle pages whose content have changed over time.
I’ve often used the Internet Archive and Google’s search caches to locate information that has vanished from the Web, and I’m really looking forward to installing the Synchronicity plug-in once it becomes available.
Michelle Kimpton of DuraSpace then discussed the DuraCloud project, which seeks to develop a trustworthy, non-proprietary cloud computing environment that will preserve digital information. In cloud computing environments, massively scalable and flexible IT-related capabilities are provided “as a service” over the Internet. They offer unprecedented flexibility and scalability, economies of scale, and ease of implementation. However, cloud computing is an emerging market, providers are motivated by profit, information about system architectures and protocols is hard to come by, and as a result cultural heritage institutions are rightfully reluctant to trust providers.
DuraCloud will enable institutions that maintain DSpace and FEDORA institutional repositories to preserve the materials in their repositories in a cloud computing environment; via a Firefox browser extension, it will also allow users to identify content that should be preserved. A Web interface will enable users to monitor their data and, possibly, run services.
DuraCloud members to create and manage multiple, geographically distributed copies of their holdings, monitor their digital content and verify that they have not been inadvertently or deliberately altered, and take advantage of the cloud’s processing power when doing indexing and other heavy processing jobs. It will also provide search, aggregation, video streaming and file migration services and will enable institutions that don’t want to maintain their institutional repositories locally to do so within a cloud environment.
The DuraCloud software, which is open source, will be released next month, and in a few months DuraSpace itself will conduct pilot testing with a select handful of cloud computing providers (Sun, Amazon, Rackspace, and EMC) and two cultural heritage institutions (the New York Public Library and the Biodiversity Heritage Library).
Fascinating project. We’ve known for some time that DSpace and FEDORA are really access systems, but lots of us have used them as interim preservation systems because we lack better options.
The next session was a “breakout” that consisted of simultaneous panels focusing on one or two NDIIPP projects. The Persistent Digital Archives and Library System (PeDALS) project was featured in a session that focused on digital preservation contracts and agreements. The first half of the session consisted of an overview of the contracts and agreements that support a variety of collaborative digital preservation initiatives:
- Vicki Reich discussed the CLOCKSS Archive, which brings together libraries and publishers on equal terms and provides free public access to materials in the archive that are no longer offered for sale.
- Julie Sweetkind-Singer detailed the provider agreements and content node agreements that govern the operations of the National Geospatial Digital Archive.
- Myron Gutman discussed the development Data-PASS, which grew out of previous collaborations between the project’s partners and lengthy experience preserving social science data.
- Dwayne Buttler, an attorney who was instrumental in crafting the agreements that support the operations of the MetaArchive Cooperative, emphasized that contracts, which focus on enforceability, grow out of a lack of trust and allow for simultaneous sharing and control; in contrast, agreements articulate goals.
- People involved in long-distance collaborative projects need structured, consistent activities and expectations of involvement; both are key to fostering a sense of project ownership.
- Lack of face-to-face interaction makes it harder for people to feel engaged; conference calls and other tools can help bridge the gap, but nothing really takes the place of getting to know other people.
- Working in smaller teams capitalizes upon our strengths -- provided that we make sure that the right mix of IT, archival, and library personnel are involved.
- Team members must be open to learning as they go and creative and innovative.
- Working on this project has brought to light a number of challenges: communication and collaboration over long distances and multiple time zones, differences in organizational cultures, responsibilities, and IT infrastructures, learning to speak each other’s languages, and finding the right IT consultant.
- We are nonetheless rowing in the same direction: we’ve learned to balance local practice with common requirements, and individual partners are beginning to incorporate PeDALS principles and standards into their current cataloging and other work.
The second “breakout” session took place after lunch, and the session I attended focused on building collaborative digital preservation partnerships:
- Bill Pickett discussed the Web History Center’s efforts to provide online access to archival materials documenting the development of the World Wide Web and the organization’s need for partners.
- David Minor outlined the work of the Chronopolis consortium, which is striving to build a national data grid that supports a long-term preservation (but not access) service.
- Martin Halbert detailed the work of the MetaArchive, a functioning distributed digital preservation network and non-profit collaborative.
- Beth Nichol discussed the Alabama Digital Preservation Network, which grew out of work with the MetaArchive and a strong history of informal statewide collaboration.
The end of the day brought all of the attendees back together. Abby Smith of NDIIPP provided an update on the work of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which focuses on materials in which there is clear public interest and seeks to frame digital preservation and access as a sustainable economic activity (i.e., a deliberate, ongoing resource allocation over long periods of time), articulate the problems associated with digital preservation, and provide practical recommendations and guidelines.
Economic sustainability requires recognition of the benefits of preservation, incentives for decision-makers to act, well-articulated criteria for selection of materials for preservation, mechanisms to support ongoing, efficient allocation of resources, and appropriate organization and governance. As a result, the task force’s work -- an interim report released last December and a forthcoming final report -- is directed at people who make decisions about the allocation of resources, not people who are responsible for the day-to-day work of preserving digital information.
Smith wrapped up by making a series of thought-provoking points:
- Preservation is a derived demand, and people will not pay for it. However, they will pay for the product itself. We need to think of digital information as being akin to a car: it’s something that has a long life but requires periodic maintenance.
- Everything in the digital preservation realm is dynamic and path-dependent: content changes over time, users change over time, and uses change over time. Decisions made now close off future options.
- Librarians and archivists are the defenders of the interests of future users, and we need to emphasize that we are accountable to future generations.
- Fear that digital preservation and access are too big to take on is a core problem.