Friday, April 13, 2012

MARAC Spring 2012: Preservation and Conservation of Captured and Born Digital Materials

I'm in Cape May, New Jersey for the Spring 2012 meeting of the Mid-Atlantic Regional Archives Conference and am temporarily closing the Electronic Records Archivists Local 0011000 Hiring Hall so that I can blog about some of the conference sessions and the loveliness of Cape May.

I was really looking forward to the first session, "Preservation and Conservation of Captured and Born Digital Materials," and it more than met my expectations. However, I must state upfront that I slept so wretchedly that I've been making dumb mistakes all day. The following post may contain a few more. Caveat lector!

Isiah Beard of Rutgers University's Scholarly Communications Center, which oversees the university's FEDORA-based institutional repository, kicked off the session by furnishing a definition of the still-mysterious concept of digital curation (per the Digital Curation Centre, it's the creation, preservation, maintenance, collection, and archiving of digital objects) and highlighting the factors that make digital objects more fragile than their analog counterparts:
  • the ease with which electronic files can be deleted or destroyed
  • file format and software dependence (a particular problem with the highly proprietary niche formats that house vast quantities of research data)
  • the speed with which storage media become technologically obsolete
  • the distance and disconnection with which many creators regard materials that don't have appreciable physical form (a pervasive and, in my opinion, all too often overlooked problem)
He then focused on the digital curation lifecycle, a multi-tiered, continuous, and iterative process in which digital objects are evaluated, preserved, maintained, verified, and re-evaluated as the hardware and software environment evolves. Beard and his colleagues often begin the evaluation process by meeting with the creators and asking them to discuss how the materials were created and used, and then engage in a "controlled chaos" (what an apt description of electronic records work!) of evaluating the materials, taking stock of the software, systems, and recording apparatus needed to keep them accessible. They also attempt to determine the file format that will best keep the content accessible over time (which sometimes means keeping them in industry standard proprietary formats) and how users will access the materials. This work culminates in the production of file format-specific guides that outline how incoming materials encoded in a given file format will be handled. All of these guides are periodically reexamined and revised.

In keeping with emerging best practices, Beard and his colleagues migrate some files to new formats in order to increase the chance that they'll remain accessible over time, but always retain a preservation master of the file in its original format and do any needed migration work on derivative copies.

Tim Pyatt of Pennsylvania State University's Special Collections Library highlighted some of the problems associated with current mechanisms for making digitized and born digital materials accessible. At present, many archives provide access to some materials via their traditional research rooms and to other via their online catalogs, their own Web sites, Web sites hosted by creators, social media, and sites hosted by service providers such as the Internet Archive and OCLC; with the exception of linking to sites maintained by creators, my own institution is doing all of these things. As we all know, from an end user's perspective, the proliferation of information silos is mystifying and frustrating. He discussed some of Penn State's strategies for reducing the chaos -- ensuring that every image placed on Flickr has detailed metadata pointing back to Special Collections, including links to an archival Web site now maintained on Penn State's servers in the finding aid describing the collection to which it belongs -- and then identified several repositories that are doing a better job of unifying access:
  • "Good": Duke University's Rare Book and Manuscript Library pulls item-level metadata from finding aids and creates discovery pages that furnish access to digital surrogates of paper-based archival materials. However, at present, none of these discovery pages provide access to born-digital objects.
  • "Better": the University of North Carolina at Chapel Hill's Special Collections Library finding aid platform fully integrates digitized content into finding aids. Clicking on a folder listing in the finding aid will bring up any digital surrogates of items present in the physical folder.
  • "Best": Duraspace's Hypatia application, which is currently under development and which promises to provide a single application that will support accessioning, arrangement, description, discovery, delivery, and long term preservation of born-digital archival collections
Gretchen Gueguen of the University of Virginia's Special Collections Library discussed the Born Digital Collections: An Inter-Institutional Model for Stewardship (AIMS), a two-year, Mellon-funded initiative to develop a framework for stewardship of born-digital materials found in of personal papers held by collecting repositories (and which is also responsible for development of Hypatia). The framework focuses on collection development (i.e., policy and infrastructure), accessioning (physical and intellectual control), arrangement and description, and discovery and access; given that many other initiatives have focused on digital preservation, the project partners decided not to focus on this aspect of stewardship.

The University of Virginia is currently focusing on collection development and accessioning and is establishing policies and developing preliminary workflows. At present, it's revising its donor and depositor agreements to address copyright, access, and ownership issues; in a digital world in which numerous identical copies of a given file may exist, ownership issues are a particular challenge. It's also developing a feasibility testing procedure that addresses a lot of questions that will have to be answered in order to take in and care for digital materials (e.g., file formats, hardware and software needs, need for file format migration or normalization). It will then move on to developing transfer procedures.

While all of this work is going on, Gueguen and her colleagues are also taking steps to deal with the vast array of damaged and obsolete media currently lurking within their collections. They're in the midst of inventorying their legacy media and trying to get data off this media and into a safe and readily accessible (at least to staff) place. (Hunting down legacy media was one of the first things I did when I was an electronic records archivist, but my repository helped to pioneer the More Product, Less Process approach to processing paper records, and as a result my colleagues and I still find floppies and Zip disks lurking in boxes every now and then. We've also discovered that a sizable percentage of this newly discovered media contains non-record material such as retirement party fliers. However, we're a government archives; a special collections unit might have cause to keep similar files found within collections of personal papers.)

When pulling data off legacy and damaged media, Gueguen and her colleagues use a nifty Forensic Recovery of Evidence Device that has a host of SCSI and other ports, built-in drives (5.5" and 3.5" floppy disk, tape, CD/DVD/BluRay, and others), 2 TB of storage, and uses Forensic Toolkit (FTK) digital forensics software to reveal hidden and deleted files (which the University of Virginia doesn't accession), look for possible Social Security Numbers, credit card numbers, and other sensitive data, and extract some metadata. The software is expensive and its output is encoded in proprietary XML, and the device itself is expensive. However, the enterprising archivist can build a similar (albeit far less elegant) hardware array out of component parts, and the Mellon-funded BitCurator project, which may result in creation of an open source, archivally oriented analytic tool might prove to be an alternative to FTK and other proprietary digital forensics tools (I suspect that, for the time being, some of the Open Source Digital Forensics tools might be the best option for archives with limited budgets). They're also using using Archivematica for creation of preservation metadata and access derivatives.

Photo: The Dr. Henry Hunt House at 209 Congress Place, Cape May, New Jersey, 13 April 2012. Cape May is renowned for its Victorian architecture, and this George Stretch-built home, which was built in 1881 and augmented in the 1890s, is a fine example. Can you spot the bunny?

2 comments:

ilhan said...

Excellent summary of the session. I am going to print it and take it with me to our digital library projects meeting. -- Ilhan

cahill said...

Hi

I like this post:

You create good material for community.

Please keep posting.

Let me introduce other material that may be good for net community.

Source: Production technician interview questions

Best rgs
Peter