Thursday, November 21, 2013

Best Practices Exchange 2013: digital imaging, data management, and innovation

 The 2013 Best Practices Exchange (BPE) ended last Friday, and I wrote this entry as I was flying from Salt Lake City to Ohio, where I spent a few days tending to some family matters. I've been back in Albany for about 46 hours, but I haven't had the presence of mind needed to move this post off my iPad until just now.

I'm leaving this BPE as I've left past BPE's: excited about the prospect of getting back to work yet so tired that I feel as if I'm surrounded by some sort of distortion field.

The last BPE session featured presentations given by Jason Pierson of FamilySearch and Joshua Harman of, and I just want to pass along a few interesting tidbits and observations:
  • Both firms view themselves as technology companies that focus on genealogy, not genealogy companies that make intensive use of technology. They work closely with archives and libraries, but their overall mission and orientation are profoundly different from those of cultural heritage institutions. And that's okay.
  • Both firms have opted to encode the preservation masters of their digital surrogates in JPEG2000 format instead of the more popular TIFF format. They've discovered that, if necessary, they can create good TIFF images from JPEG2000 files and that JPEG2000 files are more resistant to bit rot than TIFF files. The loss of a single bit can make a TIFF file completely unrenderable, but JPEG2000 files may be fully renderable even if they're missing several bits. However, the relative robustness of JPEG2000 files can also be problematic: JPEG2000 files that are so badly corrupted that only blurs of color will be displayed may remain technically renderable (i.e., software that can read JPEG2000 files may open and display such files without notifying users that the files are corrupt. One firm discovered well after the fact that it had created tens thousands of completely unusable yet ostensibly readable JPEG2000 files. 
  • Ancestry has developed some really neat algorithms that automatically adjust the contrast on sections of an image. Most contrast corrections lighten or darken entire images, but Ancestry's tool adjusts the contrast only on those sections of an image that are hard to read because they are either too light or too dark. Ancestry has also developed algorithms that automatically enhance images and facilitate optical character recognition (OCR) scanning of image files. As you might imagine, attendees were really interested in making use of these algorithms, and Harmon and other Ancestry staffers present indicated that the company would be willing to share them provided that doing so wouldn't violate any patents. (I share this interest, but I think that archives owe it to researchers to document the use of such tools. Failure to do so can leave the impression that the original document or microfilm image is in much better shape than it is and cause researchers to suspect that the digital surrogate has also been subjected to other, more sinister manipulations.) 
  • FamilySearch and Ancestry may well have the largest corporate data troves in the world. FamilySearch is scanning vast quantities of microfilm and paper documents and generates approximately 40 terabytes (yes, terabytes) of data per day. They're currently using Tessella's Safety Deposit Box to process the files and a mammoth tape library to store all this data. At present, they're trying to determine whether Amazon Glacier is an appropriate storage option; if Glacier doesn't work out, FamilySearch will likely build a mammoth data center somewhere in the Midwest. Ancestry is also scanning mammoth quantities of paper and microfilmed records and currently has approximately 10 petabytes (yes, petabytes) of data in its Utah data center. 
  • After a lot of struggle, Ancestry learned that open source and commercial software work really well for tasks and processes that aren't domain-specific but not so well for unique, highly specialized functions. For example, Ancestry discovered that none of the available tools could handle a high-volume and geographically dispersed scanning operation involving roughly 1,400 discreet types of paper and microfilmed records, so it devoted substantial time and effort to developing its own workflow management system. Archives and libraries typically don't deal with such vast quantities or such varied originals, and I think it makes sense for cultural heritage professionals to focus on developing digitization workflow best practices and standards that are broadly applicable. However, Ancestry's broader point is well-taken; sometimes, building one's own tools makes more sense than trying to make do with someone else's tools. 
  • FamilySearch and Ancestry have a lot more freedom to innovate -- and to cope with the accompanying risk of failure -- than state archives and state libraries. Pierson and Harman both emphasized the importance of taking big risks and treating failure as an opportunity to learn and grow, but, as one attendee pointed out, government entities tend to be profoundly risk-averse. In some respects, this is understandable: a private corporation that missteps has to answer only to its investors or shareholders, but a government agency or office that blunders is accountable to the news media and the tax paying public. However, if we sit around on our hands and wait for other people to solve our problems, we'll never get anywhere. I've long been of the opinion that those of us who work in government repositories and who are charged with preserving digital information need to keep reminding our colleagues and our managers that as far as digital preservation is concerned, we really have only two choices: do something and accept that we might fail, or do nothing and accept that we will fail. I'm now even more convinced that we need to keep doing so.
Image: the Utah State Capitol, as seen from the 12th floor of the Radisson Salt Lake City Downtown, 14 November 2013.

Friday, November 15, 2013

Best Practices Exchange 2013: movies, partnerships, and stories

 I wasn't feeling particularly well yesterday, and when I walked into the closing discussion of the 2013 Best Practices Exchange (BPE), I found myself thinking I would have a really hard time explaining to my colleagues what I learned at this year's BPE; I loved every session I attended, but I didn't believe that I could pull together any coherent thoughts about them. Fortunately, Patricia Smith-Mansfield (State Archivist of Utah) and Ray Matthews (Utah State Library) were fantastic discussion moderators, and the questions they asked and the insights offered by several other BPE attendees really helped me to make sense of yesterday's events. Thanks, guys!

Yesterday, Milt Shefter of the Science and Technology Council of the Motion Picture Academy of Arts and Sciences delivered an excellent lunchtime presentation that focused upon the Academy's efforts to address the digital preservation issues facing the motion picture industry and individual filmmakers. Filmmaking is becoming a digital enterprise, and filmmakers and production companies are facing a host of new challenges: file formats change so quickly that films produced as recently as five years ago may no longer be renderable, video files require vast quantities of storage space, there are no widely accepted preservation standards, and the need to migrate to newer storage media every five to ten years poses a particular risk to files that may be viewed more as products than as works of art.

Shefter asserted that the industry and filmmakers are keenly aware of the need for open, widely accepted standards and a storage medium robust and durable enough to withstand some benign neglect but lack the clout needed to push hardware and software manufacturers in this direction. Even in my stupor, I was struck by this assertion. The motion picture industry is a multi-billion dollar enterprise; surely it has more clout than the archival and library communities! However, I didn't put two and two together until someone pointed out this morning that perhaps the Academy and the cultural heritage community should consider establishing a formal partnership around storage, format, and preservation issues. This isn't exactly a new idea -- the Library of Congress's National Digital Information Infrastructure Preservation Program brought together archives, libraries, the motion picture industry, the recording industry, the video game industry, and others seeking to preserve digital content -- but it's one that merits further exploration.

Immediately after Shefter's speech ended, Sundance Institute Archives Coordinator Andrew Rabkin introduced a screening of These Amazing Shadows, a documentary that traces the development of the (U.S.) National Film Registry, a Library of Congress-led initiative to identify and preserve motion pictures of cultural, historical, or aesthetic significance. Both Rabkin and the film itself stressed that motion pictures insinuate themselves into our collective consciousness because they tell emotionally compelling stories in visually arresting ways, and during this morning's closing discussion one attendee stated that the film left her convinced that we as a community need to identify compelling stories about the importance of digital preservation and to tell them in a vivid, attention-grabbing manner.

I couldn't agree more. All too often, people (a few archivists among them) think that electronic files lack the gripping content and emotional intensity found in paper records and personal papers.  However, electronic records and personal files can be as compelling as any paper document. We're talking not only about spreadsheets and databases -- both of which can be deeply compelling to someone who has a certain type of information need -- but also about photographs of babies and the remains of the World Trade Center site, videos documenting weddings and natural disasters, audio files capturing the oral histories of relatives who have since died and musical performances of world-class symphonies, geospatial data documenting real property boundaries and the location of hazardous waste sites, e-mail messages containing professions of love and evidence of criminal activity, and a whole bunch of other immensely valuable, emotionally resonant, practically useful things. Most people know this on some level, but they don't fully realize just how fragile these files are or how devastating their loss would be.

Several documentary filmmakers are currently working on films that focus on digital preservation initiatives and the loss of important digital content, but we need more effort on this front. I vividly recall seeing the Council on Library and Information Resources' Into the Future: On the Preservation of Information in the Digital Age (1998) on PBS, and this film -- more than any of the readings I did in graduate school -- kept popping into my head as I pondered whether I really wanted to make the jump from descriptive archivist to electronic records archivist. We need gripping, story-driven films that highlight the terrible risks to which digital content is subject and the ways in which we can ensure that important content is preserved. These films must speak not only to archivists and wannabe archivists but to the general public and to elected officials and other key stakeholders. (And Into the Future, which is now available only on VHS tape, needs to be transferred onto DVD or, better yet, placed online.)

Image: Snow on the Wasatch Mountains, as seen from Interstate 15 northbound between Lehi and Sandy, Utah, 14 November 2013.

Thursday, November 14, 2013

Best Practices Exchange 2013: Web archiving

I've been up for a long time and I'm a little under the weather, so I'm going to highlight a couple of cool things that Scott Reed of Archive-It shared this morning and then call it a day:

First, Scott highlighted a tool that's new to me: computer systems engineer Vangelis Banos has developed Archive Ready, a site that enables website creators and people archiving websites to assess the extent to which a given site can be archived using Web crawling software. Simply type the URL of the site into a text box and click a button, and Archive Ready assesses:
  • The extent to which the site can be accessed by Web crawlers
  • The cohesion of the site's content (i.e., whether content is hosted on a single resource or scattered across multiple resources and services) 
  • The degree to which the site was created according to recognized standards
  • The speed with which the host server responds to access requests
  • The extent to which metadata that supports appropriate rendering and long-term preservation is present
Archive Ready also runs basic HTML/CSS validity checks, analyzes HTTP headers, highlights the presence of Flash and Quicktime objects and externally hosted image files, looks for sitemap.xml files, RSS links, and robots.txt files, and determines whether the Internet Archive has crawled the site and converted the results to WARC format.

Archive Ready has apparently engendered some controversy, but it's a handy resource for anyone seeking to capture and preserve Web content. Based upon my very limited experience with Archive Ready, which is still in beta testing, I have to say that it might not be able to locate streaming videos located deep within large websites. However, I have to say that it's overall assessments seem pretty accurate; I entered the URLs of several sites that we've crawled repeatedly, and the sites that we've been able to capture without incident consistently received high scores from Archive Ready and the sites that have given us lots of problems consistently received low scores.  If you're archiving websites, I encourage you to devote a little time to playing around with this nifty tool.

Scott also reported that the Internet Archive is looking to develop new tools to capture social media content and other types of media-heavy content that Heritrix, its Web crawler, simply can't capture properly.  To the greatest extent possible, the Internet Archive will integrate these new tools and the content they capture into its existing capture and discovery mechanisms. Capturing social media content is a real challenge (and if I weren't so tired, I would blog about the great social media archiving presentation that Rachel Trent of the North Caorlina State Archives gave this afternoon). It's good to see that new options may be on the horizon.

Image: quotation above the doorway to the library of the Church History Library, Salt Lake City, Utah, 14 November 2013. The Church History Library houses archival and library materials that chronicle the history of The Church of Jesus Christ of Latter-day Saints and its members from 1830 to the present.

Wednesday, November 13, 2013

Best Practices Exchange 2013: advocacy

The Best Practices Exchange (BPE), which brings together archivists, librarians, attorneys, information technology professionals, and other people seeking to preserve born-digital state government information, is my favorite archival professional event. The 2013 BPE, which is being held in Salt Lake City, Utah, began this morning, and today has -- for me, at least -- centered upon advocacy and working with stakeholders.

In the interest of keeping this post to a manageable length and getting to bed at a reasonable hour, I'm going to devote this post to this morning's plenary address. I attended a great session on working with stakeholders this afternoon, but I'm too worn-out to do it justice at this time.

Plenary speaker Bob Bennett, who represented Utah in the U.S. Senate from 1993-2010, offered some great advice for archivists and records managers who work with elected officials or who seek to obtain legislative support for their programs:
  • If you're seeking to acquire the records of legislators, approach them at the very beginning of their tenure, while they're still "blinky-eyed." Appeal to their ego and offer to help them set up their record-keeping systems. 
  • One of the most important things to understand about the word "lobbying" is that it's a constitutionally protected liberty. Every citizen has the right to petition the government for redress of grievances. 
  • Never ask a lawmaker to do something that is not in his or her best interest. Tailor your request in terms of what he or she needs, not what you need, and discuss how you're prepared to meet that need. You can almost always find a way to frame your request so that the legislator concludes that it would be good for him or her to do it. 
  • Be nice. People remember and respond to kindness, and you never know when someone will eventually end up in a position of power. Bennett, a conservative Republican, was able to persuade a liberal Democratic legislator to embrace one of his policy positions not only because he framed the issue in terms that appealed to her but also because, years ago, he had treated her courteously when she testified before a committee on which he served. He had forgotten that she had appeared before the committee, but she vividly remembered his civility -- and the harsh treatment she received from the other Republicans on the committee. 
  • Be mindful of the legislator's overall outlook and pet causes. If your political views differ from those of the legislator, don't draw attention to this fact. Focus on what you want the legislator to do and on how doing it will benefit him or her. 
  • Understand that you're always competing with someone else for money. Don't pick on someone else's budget item in an effort to obtain funding for your own cause. Instead, highlight how wise investment will save money in the long run. Bennett and other legislators garnered conservative support for Medicare Part D, which solid research showed would reduce hospitalization costs, by emphasizing that funding Part D would actually decrease Medicare outlays -- a key conservative goal. 
Bennett also gave insightful answers to a couple of questions that frequently confront archivists and librarians seeking to preserve digital content:
  • When asked how we can get legislators to understand that we need more money just to maintain the status quo, Bennett replied that former Librarian of Congress James Billington came to Congress with statistics regarding the volume of born-digital documents being created and the extremely short lifespan of digital files. Billington emphasized that if the Library of Congress didn't receive more funding, it would cease to be relevant within X amount of time. The library would continue to be a national treasure, but it would not remain a current resource for the nation and centuries of past investment would culminate in creation of a relic. He then asked whether the current members of Congress wanted the Library of Congress to become irrelevant on their watch. 
  • When asked how archives and libraries, which tend to focus on "quality of life" concerns, can make the case for investment in electronic records management, Bennett noted that Vietnam is the worst-documented of America's wars. Secretary of Defense Robert McNamara was the most data-driven individual ever to occupy a high-level government position, and the data that propelled his decision-making was stored on open-reel magnetic tapes that can no longer be read and encoded in formats that no one knows how to render properly. Military historians, the military academies, and the armed forces would all like to access this data, but they can't. Don't talk about quality of life. Talk about historical analysis that can inform future decisions and emphasize that libraries and archives ensure that the "seamless web of history" remains intact and accessible to future generations.
Image: Side view of the Salt Lake Temple, Salt Lake City, Utah, 13 November 2013. The Salt Lake Temple, which was completed in 1893, is the largest temple ever constructed by the Church of Jesus Christ of Latter-Day Saints and is an international symbol of the Mormon faith. The building's style is rooted in Gothic and other classical forms but is unique and deeply symbolic; for example, the six spires represent the power of the church's priesthood.