Monday, November 14, 2011

BPE 2011: building digital repositories


Still playing catch-up re: the 2011 Best Practices Exchange (BPE). After the BPE ended, I spent a few days in Ohio with my parents, came back to Albany, prepped for and gave a presentation on salvaging and recovering data from electronic media, got sick, got well, got sick again, and got well again. Now I’m barreling through all kinds of personal and professional backlogs.

I took decent notes, but three weeks have elapsed. If you were there and your memory differs from mine, please let me know. I’ll update/correct as needed.


One of the most interesting BPE sessions I attended featured two speakers who focused on the creation of digital repositories. The first, Mitch Brodsky of the New York Philharmonic, discussed the creation and evolution of the Philharmonic’s repository. At present,staff are digitizing the organization’s international era (1943-1970) and will result in the digitization of 3,200 programs, 72 scrapbooks, 4,200 glass lantern slides (older but easy to do), 8,500 photographs, 8,000 business record folders. By the end of 2012, 1.3 million pages of paper records will be digitized and the repository will house 15 terabytes of data. Digitization of audiovisual materials will add another 2 TB of data to the system. However, the organization also plans to add materials created during the first 98 years of the Philharmonic’s existence (1842-1940) and to incorporate late 20th and 21st century electronic records into the repository.

The project’s larger goals are equally impressive:
  • Accurate representation of originals. The Philharmonic’s archivists want the digital repository experience to match the research room experience as closely as possible. They don’t want to flatten curled records, disassemble bound volumes, or do anything else that would make the digital surrogates noticeably different from the originals. As a result, they’re using a digital camera (and the photographer who produced the digital surrogates of the Dead Sea Scrolls) to capture the originals, and many of the digital surrogates have a three-dimensional look. (Click here for an example.)
  • Comprehensiveness. Staff are sensitive to privacy concerns, but want the digital repository to be as complete as possible.
  • Easy and free accessibility. The Philharmonic expects that its digital repository Web site will be the public access mechanism for its archives.
  • A new, sharable model for digitizing large collections.
As you might expect, the repository’s technical infrastructure is pretty sophisticated -- and entirely open source:
  • ImageMagick is used to convert images delivered by the photographer into various formats and sizes.
  • OpenMigrate is used to channel data into and out of Alfresco.
  • Alfresco, the open source content management system, serves as the repository’s core. (At present, the New York Philharmonic may be the only institution using it to build a repository of archival materials, so this project really bears watching.)
  • Alfresco is not yet developed enough to meet the Philharmonic’s data entry standards, and as a result it enters metadata into homegrown databases and then ingests the metadata into Alfresco.
  • The repository’s search functionality is handled by Solr, Apache’s search server.
  • The repository’s viewer component is a Javascript tool developed by the Internet Archive.
  • A suggested materials component based upon end user suggestions ties together related materials of different types and other end user input will be added via phpList.
  • Vanilla forums will promote end user discussions.
Brodsky also shared a number of lessons learned. As far as I’m concerned, anyone thinking of undertaking any sort of large systems development project should devote a substantial amount of thought to each of them:
  • You don’t know what you don’t know. Brodsky never expected that he would learn PHP, become a bugtracker, or proof code. However, he’s an on-the-ground project manager, and the Philharmonic had problems with its vendor.
  • Do it manually before you automate. The Philharmonic started out doing a lot of manual review and dragging and dropping. However, doing lots of hands-on work before setting up an automated system revealed where errors pop up and enabled Brodsky to figure out how to correct them. Deep and intricate understanding of every phase of your project is a must.
  • Vendors need to earn it. Do not be laid back. The vendor is there to do right by you, and it’s their job to convince you that they can be trusted. (Hear, hear! Managing vendor relationships and retaining or taking control of projects on which vendors work was a recurring BPE 2011 theme).
  • Archivists who develop systems are product developers. As Brodsky put it: “You are not the same sort of archivist you were before you went digital.” People are actively accessing your online resources from all over the world, and they expect that your system will be reliable.
John Sarnowski of the ResCarta Foundation then gave a demonstration of the ResCarta Toolkit, an open source, platform independent bundle of tools that enables institutions to create digital collections that range from the very small to the very large.

The toolkit contains a variety of useful, easy-to-use tools:
  • Metadata creation: assigns institutional identifier, adds directory organization with aggregator/root identifiers, adds metadata to image files using forms, and writes Metadata Encoding and Transmission Standard (METS) XML files to root directory.
  • Data conversion: converts JPEG, PDF, TIFF, or existing ResCarta data to TIFF with embedded metadata, writes a final object metadata.xml file with checksum. Archives and libraries have the option of using preconfigured METS XML (ResCarta metadata schemas are registered METS profiles) or apply a custom metadata template to all of the files in a given directory or tree.
  • Textual metadata editor: enables viewing and editing of OCR metadata and addition of descriptive metadata.
  • Collection manager: creates collections, manages digital objects, allows editing or augmenting object metadata, outputs METS collection level XML file, and can output Dublin Core or Open Archives Initiative_Dublin Core data from the collection-level metadata.
  • Indexer: creates a Lucene index of collection contents, indexes the collection level metadata, indexes all textual metadata from each TIFF, rebuild and optimize options.
  • Checksum verification: creates a checksum and verifies against the original checksum.
A separate ResCarta Web Application facilitates Web publishing of ResCarta digital collections. Simply download the application and drop your ResCarta data directory into the application.

Libraries and archives can also use ResCarta to create metadata before adding objects to CONTENTdm, and the ResCarta Foundation is thinking of creating a tool that will enable METS and Metadata Object Description Schema (MODS) metadata to be moved into CONTENTdm in a streamlined, easy fashion.

I haven’t yet had the opportunity to play around with ResCarta --I just bought a new computer, but haven’t had the chance to get the thing hooked up to the Internet or do miscellaneous software installs -- but I was pretty intrigued and impressed. I’ll report back after I get the chance to play around with it a little bit.

I would be remiss if I did not point out that ResCarta may not be an appropriate solution for everyone: at present, only images and textual files can be added to ResCarta repositories: the ResCarta Foundation is, understandably, waiting for the emergence of good, widely accepted metadata standards for audiovisual materials. However, if you want to build a simple digital repository to house digital images and textual records, by all means check ResCarta out.

Image: Mary Todd Lincoln Home, Lexington, Kentucky, 22 October 2011. William Palmateer built this two-story brick, late Georgian house, which originally served as an inn, in 1803-1806. It was soon purchased by Robert Smith Todd, one of Lexington's most affluent men, and became a home for the growing Todd family. Mary Todd Lincoln was born in 1818 and resided in this home, which is a stone's throw away from the hotel at which the BPE was held, until she married Abraham Lincoln in 1842.

1 comment:

  1. Imverter is great software proficient to perform images transformations with the implementation of few simple steps. Just add files by selecting Add File button. Select any image format for conversion including JPG, BMP, TIFF, PNG, GIF, PDF, DWG and PCX. This give an industrial look to output converted images.

    ReplyDelete