MARAC: Will the Fruit Be Worth the Harvest? Pros and Cons of End of Presidential Term Web Harvesting Projects

We as a profession are still trying to figure out how to deal with Web sites, which exist in the netherworld separating archives and libraries and pose a host of preservation challenges, and this session furnished interesting insight into the contrasting approaches of the U.S. National Archives and Records Administration (NARA) and the Library of Congress (LC).

Session chair Marie Allen (NARA) noted that NARA’s handling of Web records has consistently engendered controversy. Its 2000-01 decision to compel Federal agencies to copy their Web site files, at their expense, at the end of President Bush’s first term of office and transfer them to NARA within eight days of doing so angered agencies, and its 2008 decision not to take a Web snapshot (i.e., a one-time copy) of federal agency sites at the end of President George W. Bush’s second term aroused public concern.

Susan Sullivan (NARA) pointed out that in 2004 NARA had contracted with the Internet Archive to copy publicly accessible federal government Web site that it had identified and to provide access to the copies, then explained the rationale for NARA’s 2008 decision: it has determined that Web records are subject to the Federal Records Act and must be scheduled and managed appropriately. It issued Guidance on Managing Web Records in January 2005 and has since offered a lot of training and assistance to agencies; some of this information is available on NARA’s Toolkit for Managing Electronic Records, an internet portal to resources created by NARA and many other entities.

Sullivan emphasized that snapshots are expensive, have technical and practical shortcomings, and encourage the agency misperception that NARA is managing Web records. In fact, there is no authoritative list of federal government sites, which means that snapshots fail to capture at least some sites. Moreover, snapshots capture sites as they existed at a given point of time, cannot capture Intranet or “deep Web” content, and are plagued by broken links and other technical limitations. In sum, snapshots do not document agency actions or functions in a systematic and complete manner.

NARA is still copying Congressional and Presidential Web sites, which are not covered by the Federal Records Act. Although these snapshots have all of the problems outlined above, NARA regards them as permanent.

Abbie Grotke (LC) then outlined LC’s response to NARA’s 2008-09 decision: in partnership with the Internet Archive, the California Digital Library, the University of North Texas, and the Government Printing Office, it opted to take snapshots of publicly accessible federal government sites. All of the partners seek to collect and preserve at-risk born-digital government information, and all of them believed that the sites had significant research potential.

The partners developed a list of URLs of publicly accessible federal government sites in all three branches of government; they placed particular emphasis on identifying sites that were likely to disappear or change dramatically in early 2009. They then asked a group of volunteer government information specialists to identify sites that were out of scope (e.g., commercial sites) or particularly worthy of crawling (e.g., sites focusing on homeland security). This process ultimately yielded a list of approximately 500 sites.

The partners took a series of comprehensive snapshots and a number of supplemental snapshots focusing on high-priority sites. Much of this work centered on two key dates -- Election Day and Inauguration Day -- but some copying is still taking place.

Grotke outlined the project’s challenges, which will be familiar to any veteran of a multi-institutional collaborative project. The partners had no official funding for this project and thus have had to divert staff and resources from day-to-day operations. They have also had a difficult time managing researcher expectations: users want immediate access to copied sites, but the indexing process is time-consuming. The partners have also had to accept that, owing to the technical limitations of their software and the possibility that some sites escaped their notice, they could not fully capture every federal government site.

The snapshots have nonetheless captured a vast quantity of information that might otherwise be lost, and the project is also paving the way for future collaborations.

Thomas Jenkins (NARA) then explained how Web sites fit into NARA’s three-step appraisal process, which is guided by Directive 1441 (some of which is publicly accessible):
  • Data gathering. When appraising Web sites, an archivist visits each site and analyzes the information found on it, interviews agency Web administrators, assesses the recordkeeping culture of the creating agency, and determines how the site’s content relates to permanent records in NARA’s holdings.
  • Drafting of appraisal memorandum. The archivist prepares a detailed report that assesses the extent to which the site documents significant actions of federal officials, the rights of citizens, or the “national experience.” The report also examines the site’s relationship to other records identified as permanent (i.e., is the Web site the best and most comprehensive source of information?)
  • Stakeholder review. Each appraisal memorandum is circulated within NARA and then published in the Federal Register in order to solicit agency and public input.
Using a site created by the U.S. Department of Justice as an example, Jenkins highlighted how this process works and why NARA ultimately determined that this site, which contains only a fraction of the information contained within other series deemed archival, did not warrant permanent retention. In contrast, NARA has determined that the site of the U.S. Centennial of Flight Commission warrants permanent preservation because it contains significant information not found in other series.

In response to a comment concerning whether Web snapshots capture how an agency presents itself to the public, Jenkins stated that NARA assesses whether the information presented on a given site is unique. Moreover, NARA is aware that other entities are crawling federal government sites. Although there is a risk that this crawling activity will cease, a risk analysis indicated that archival records and other sources of information amply document the agency’s activities.

Although this session illuminated how and why NARA and LC reached such sharply contrasting decisions and highlighted some resources that somehow escaped my attention, it underscored precisely why the profession hasn't reached any sort of consensus and is unlikely to do so in the near future. Many if not most state and local government archives lack the degree of regulatory authority afforded by the Federal Records Act, and as a result many of them will not want to rely upon the kindness of site creators. Archivists working in repositories with broad collecting missions may have great difficulty ensuring that creators properly maintain, copy, and transfer site files. Moreover, some archivists will doubtless differ with NARA's conclusion that documenting how site creators presented themselves to the public is not sufficient reason to take periodic Web site snapshots or otherwise preserve sites comprehensively. As a result, many of us will likely find LC's approach to federal government sites or NARA's handling of Congressional and Presidential Web sites more relevant to our own circumstances than NARA's treatment of executive-branch agency sites.

