Monday, November 2, 2009

MARAC Fall 2009, S1: Solutions to Acquiring and Accessing Electronic Records

Pavonia Arcs, by Robert Pfitzenmeier (2004), Newport, Jersey City, 29 October 2009.

Along with Ricc Ferrante (Smithsonian Institution Archives) and Mark Wolfe (M.E. Grenander Department of Special Collections and Archives, University at Albany), I had the good fortune to participate in this session, which was graciously chaired by Sharmila Bhatia (U.S. National Archives and Records Administration).

Ricc Ferrante discussed the challenges of accessioning and preserving archival e-mail created by employees of the Smithsonian Institution's semi-autonomous museums and research institutes. His experience should resonate with many government and college and university archivists. Until late 2005, the Smithsonian's component facilities used a variety of e-mail applications, and retention guidelines were implemented in 2008. As a result, the archives is both actively soliciting transfers of cohesive groups (i.e., accounts) of documented and backed-up messages at predetermined intervals and passively accepting transfers of older groupings of records in a variety of formats.

Ricc then discussed the processing of these e-mails, which is performed on PC or Mac desktop computers. Incoming transfers are backed up, analyzed and documented, converted to a preservation format, and securely stored. The Smithsonian Institution Archives uses a tool to convert accounts or groupings of messages in formats other than MBOX to the MBOX format, and the Collaborative Electronic Records Project (CERP) parser then converts the MBOX files to an XML-based preservation format. Experimenting with the MBOX conversion tool and the CERP parser has been on my to-do list for some time, so I was really glad I got the chance to hear Ricc discuss these tools.

Mark Wolfe discussed how the M.E. Grenander Department of Special Collection and Archives is using Google Mini, a modestly priced "plug and play" search appliance that will index up to 300,000 documents, to improve access to its student newspapers. Prior to the installation of Google Mini, a paper card file was the only access mechanism for these publications, and Google MIni has made it possible for staff to find information about people who became prominent well after they left the university (e.g., gay rights activist Harvey Milk, '51), respond quickly to reference inquiries, and enhance access to the newspapers.

Mark also highlighted the shortcomings of Google Mini's indexing of digitized materials. When assigning titles, it looks for the most prominent text on a given page, which in a newspaper may be part of an ad, not a story. Dates are another problem. When sorting search results by date, it hones in on the date the digital file was created, not the date of the scanned original. The former problem can be corrected, albeit with considerable effort, by manually changing the author, title, etc. properties of the files, which are in text-based PDF format. However, the date properties, which help to safeguard the authenticity of born-digital files, cannot easily be changed and thus inhibit date-based access to scanned archival materials. There's been a lot of talk lately about how the management of born-digital and born-again digital materials will eventually converge, but Mark's talk is a good reminder that we're not quite there yet.

My presentation concerned our capture of New York State government sites and the redaction (i.e.. removal of legally restricted information from records prior to making them accessible) of electronic records converted to PDF format. In lieu of giving an exhaustive recap, I'll just offer a few words of advice to people contemplating electronic redaction. At present, there are several good tools for redacting PDF files, including the built-in tool bundled with Adobe Acrobat 8 and 9, Redax, and Redact-It. If you are using an older version of Adobe Acrobat and can't or don't want to upgrade or purchase an add-on tool, the National Security Agency has produced a document that outlines a laborious but effective redaction procedure. If you commit to electronic redaction, you need to keep abreast of the relevant legal and digital forensics literature: people are trying to figure out how to crack these tools and techniques and recover redacted information, and one of them may eventually succeed.

There are also several really bad PDF redaction techniques. Never, ever use Adobe Acrobat's Draw or Annotate tool to place black, white, etc. boxes over information you wish to redact. Another spectacularly bad idea: "redacting" a word processing document by changing the font color to white or using a shading or highlighting feature to obscure the text and then converting the document to PDF format.

Want to know why these options are so bad? Read this. And this. And this. And this. And this. And this, too (thanks to John J. @ W&L for drawing my attention to this recent blunder.)

No comments: