Saturday, August 30, 2008

SAA: Second day of sessions

NB: Sessions occupied only one time slot today.

Digital Revolution, Archival Evolution: An Archival Web Capture Project
Dean Weber (Ford Motor Company), Judith Endelman (Henry Ford Museum), Pat Findlay (, and Reagan Moore (University of North Carolina at Chapel Hill) discussed their joint effort to use Web crawling software to create preservation copies of the main Web site ( maintained by the Ford Motor Company.

As Findlay emphasized, this site is extremely large and complex: the site contains content created by many different Ford units, pulls content from a large number of different feeds, has Flash and non-Flash and high- and low-speed versions, and has features that allow people to view cars by color, passenger number, etc. As a result, there are literally millions of different page combinations. Moreover, it has strong anti-hacking protection and is hosted on geographically dispersed servers located throughout the world.

The Henry Ford Museum, which wanted to preserve periodic snapshots of the site, worked with San Diego Supercomputing Center (where Moore worked until a very short time ago) to conduct three crawls of the site and store and furnish access to the results. In an effort to improve results, staff from the Henry Ford Museum and SDSC consulted with Ford's IT staff; as Endelman noted, everyone entered into this project thinking that it was about technology, but it was really about management, people, and relationships.

Moore furnished a great overview of the various challenges that the group encountered over the course of the project, and he explicitly linked them to the traditional archival functions of:
  • Appraisal--understanding what was actually present in the Web site and deciding what to preserve;
  • Accessioning--using a crawler to produce copies of the site and place the copies into a preservation environment;
  • Description--gathering essential information needed to identify and access the crawl and system metadata guaranteeing authenticity, etc.
  • Arrangement--preserving the intellectual arrangement of the files and determining their physical arrangement (SDSC actually bundles the files into a single TAR file, which means that it needs to maintain checksums, etc., for only one file per crawl. The iRODS software that SDSC developed can search within TAR and files and pull up content as directed);
  • Preservation--determining whether to store, e.g., banners indicating the archival status of the files, with the files or in a separate location;
  • Access--enabling people using multiple browsers on multiple platforms to examine the files.
I've done quite a bit of Web crawling, and I'm glad to learn that Moore and other researchers are actively trying to figure out how to capture content that current crawlers can't preserve (e.g., database-driven content and Flash). The session was nonetheless a bit disheartening: even with the active cooperation of Ford's IT staff and the involvement of visonary computer scientists, Web crawling remains an imperfect technology. However, for those trying to preserve large sites or large numbers of sites, it nonetheless remains the best of a bunch of bad options.

No comments: