I'm leaving this BPE as I've left past BPE's: excited about the prospect of getting back to work yet so tired that I feel as if I'm surrounded by some sort of distortion field.
The last BPE session featured presentations given by Jason Pierson of FamilySearch and Joshua Harman of Ancestry.com, and I just want to pass along a few interesting tidbits and observations:
- Both firms view themselves as technology companies that focus on genealogy, not genealogy companies that make intensive use of technology. They work closely with archives and libraries, but their overall mission and orientation are profoundly different from those of cultural heritage institutions. And that's okay.
- Both firms have opted to encode the preservation masters of their digital surrogates in JPEG2000 format instead of the more popular TIFF format. They've discovered that, if necessary, they can create good TIFF images from JPEG2000 files and that JPEG2000 files are more resistant to bit rot than TIFF files. The loss of a single bit can make a TIFF file completely unrenderable, but JPEG2000 files may be fully renderable even if they're missing several bits. However, the relative robustness of JPEG2000 files can also be problematic: JPEG2000 files that are so badly corrupted that only blurs of color will be displayed may remain technically renderable (i.e., software that can read JPEG2000 files may open and display such files without notifying users that the files are corrupt. One firm discovered well after the fact that it had created tens thousands of completely unusable yet ostensibly readable JPEG2000 files.
- Ancestry has developed some really neat algorithms that automatically adjust the contrast on sections of an image. Most contrast corrections lighten or darken entire images, but Ancestry's tool adjusts the contrast only on those sections of an image that are hard to read because they are either too light or too dark. Ancestry has also developed algorithms that automatically enhance images and facilitate optical character recognition (OCR) scanning of image files. As you might imagine, attendees were really interested in making use of these algorithms, and Harmon and other Ancestry staffers present indicated that the company would be willing to share them provided that doing so wouldn't violate any patents. (I share this interest, but I think that archives owe it to researchers to document the use of such tools. Failure to do so can leave the impression that the original document or microfilm image is in much better shape than it is and cause researchers to suspect that the digital surrogate has also been subjected to other, more sinister manipulations.)
- FamilySearch and Ancestry may well have the largest corporate data troves in the world. FamilySearch is scanning vast quantities of microfilm and paper documents and generates approximately 40 terabytes (yes, terabytes) of data per day. They're currently using Tessella's Safety Deposit Box to process the files and a mammoth tape library to store all this data. At present, they're trying to determine whether Amazon Glacier is an appropriate storage option; if Glacier doesn't work out, FamilySearch will likely build a mammoth data center somewhere in the Midwest. Ancestry is also scanning mammoth quantities of paper and microfilmed records and currently has approximately 10 petabytes (yes, petabytes) of data in its Utah data center.
- After a lot of struggle, Ancestry learned that open source and commercial software work really well for tasks and processes that aren't domain-specific but not so well for unique, highly specialized functions. For example, Ancestry discovered that none of the available tools could handle a high-volume and geographically dispersed scanning operation involving roughly 1,400 discreet types of paper and microfilmed records, so it devoted substantial time and effort to developing its own workflow management system. Archives and libraries typically don't deal with such vast quantities or such varied originals, and I think it makes sense for cultural heritage professionals to focus on developing digitization workflow best practices and standards that are broadly applicable. However, Ancestry's broader point is well-taken; sometimes, building one's own tools makes more sense than trying to make do with someone else's tools.
- FamilySearch and Ancestry have a lot more freedom to innovate -- and to cope with the accompanying risk of failure -- than state archives and state libraries. Pierson and Harman both emphasized the importance of taking big risks and treating failure as an opportunity to learn and grow, but, as one attendee pointed out, government entities tend to be profoundly risk-averse. In some respects, this is understandable: a private corporation that missteps has to answer only to its investors or shareholders, but a government agency or office that blunders is accountable to the news media and the tax paying public. However, if we sit around on our hands and wait for other people to solve our problems, we'll never get anywhere. I've long been of the opinion that those of us who work in government repositories and who are charged with preserving digital information need to keep reminding our colleagues and our managers that as far as digital preservation is concerned, we really have only two choices: do something and accept that we might fail, or do nothing and accept that we will fail. I'm now even more convinced that we need to keep doing so.