Showing posts with label Best Practices Exchange 2011. Show all posts
Showing posts with label Best Practices Exchange 2011. Show all posts

Monday, November 14, 2011

BPE 2011: building digital repositories


Still playing catch-up re: the 2011 Best Practices Exchange (BPE). After the BPE ended, I spent a few days in Ohio with my parents, came back to Albany, prepped for and gave a presentation on salvaging and recovering data from electronic media, got sick, got well, got sick again, and got well again. Now I’m barreling through all kinds of personal and professional backlogs.

I took decent notes, but three weeks have elapsed. If you were there and your memory differs from mine, please let me know. I’ll update/correct as needed.


One of the most interesting BPE sessions I attended featured two speakers who focused on the creation of digital repositories. The first, Mitch Brodsky of the New York Philharmonic, discussed the creation and evolution of the Philharmonic’s repository. At present,staff are digitizing the organization’s international era (1943-1970) and will result in the digitization of 3,200 programs, 72 scrapbooks, 4,200 glass lantern slides (older but easy to do), 8,500 photographs, 8,000 business record folders. By the end of 2012, 1.3 million pages of paper records will be digitized and the repository will house 15 terabytes of data. Digitization of audiovisual materials will add another 2 TB of data to the system. However, the organization also plans to add materials created during the first 98 years of the Philharmonic’s existence (1842-1940) and to incorporate late 20th and 21st century electronic records into the repository.

The project’s larger goals are equally impressive:
  • Accurate representation of originals. The Philharmonic’s archivists want the digital repository experience to match the research room experience as closely as possible. They don’t want to flatten curled records, disassemble bound volumes, or do anything else that would make the digital surrogates noticeably different from the originals. As a result, they’re using a digital camera (and the photographer who produced the digital surrogates of the Dead Sea Scrolls) to capture the originals, and many of the digital surrogates have a three-dimensional look. (Click here for an example.)
  • Comprehensiveness. Staff are sensitive to privacy concerns, but want the digital repository to be as complete as possible.
  • Easy and free accessibility. The Philharmonic expects that its digital repository Web site will be the public access mechanism for its archives.
  • A new, sharable model for digitizing large collections.
As you might expect, the repository’s technical infrastructure is pretty sophisticated -- and entirely open source:
  • ImageMagick is used to convert images delivered by the photographer into various formats and sizes.
  • OpenMigrate is used to channel data into and out of Alfresco.
  • Alfresco, the open source content management system, serves as the repository’s core. (At present, the New York Philharmonic may be the only institution using it to build a repository of archival materials, so this project really bears watching.)
  • Alfresco is not yet developed enough to meet the Philharmonic’s data entry standards, and as a result it enters metadata into homegrown databases and then ingests the metadata into Alfresco.
  • The repository’s search functionality is handled by Solr, Apache’s search server.
  • The repository’s viewer component is a Javascript tool developed by the Internet Archive.
  • A suggested materials component based upon end user suggestions ties together related materials of different types and other end user input will be added via phpList.
  • Vanilla forums will promote end user discussions.
Brodsky also shared a number of lessons learned. As far as I’m concerned, anyone thinking of undertaking any sort of large systems development project should devote a substantial amount of thought to each of them:
  • You don’t know what you don’t know. Brodsky never expected that he would learn PHP, become a bugtracker, or proof code. However, he’s an on-the-ground project manager, and the Philharmonic had problems with its vendor.
  • Do it manually before you automate. The Philharmonic started out doing a lot of manual review and dragging and dropping. However, doing lots of hands-on work before setting up an automated system revealed where errors pop up and enabled Brodsky to figure out how to correct them. Deep and intricate understanding of every phase of your project is a must.
  • Vendors need to earn it. Do not be laid back. The vendor is there to do right by you, and it’s their job to convince you that they can be trusted. (Hear, hear! Managing vendor relationships and retaining or taking control of projects on which vendors work was a recurring BPE 2011 theme).
  • Archivists who develop systems are product developers. As Brodsky put it: “You are not the same sort of archivist you were before you went digital.” People are actively accessing your online resources from all over the world, and they expect that your system will be reliable.
John Sarnowski of the ResCarta Foundation then gave a demonstration of the ResCarta Toolkit, an open source, platform independent bundle of tools that enables institutions to create digital collections that range from the very small to the very large.

The toolkit contains a variety of useful, easy-to-use tools:
  • Metadata creation: assigns institutional identifier, adds directory organization with aggregator/root identifiers, adds metadata to image files using forms, and writes Metadata Encoding and Transmission Standard (METS) XML files to root directory.
  • Data conversion: converts JPEG, PDF, TIFF, or existing ResCarta data to TIFF with embedded metadata, writes a final object metadata.xml file with checksum. Archives and libraries have the option of using preconfigured METS XML (ResCarta metadata schemas are registered METS profiles) or apply a custom metadata template to all of the files in a given directory or tree.
  • Textual metadata editor: enables viewing and editing of OCR metadata and addition of descriptive metadata.
  • Collection manager: creates collections, manages digital objects, allows editing or augmenting object metadata, outputs METS collection level XML file, and can output Dublin Core or Open Archives Initiative_Dublin Core data from the collection-level metadata.
  • Indexer: creates a Lucene index of collection contents, indexes the collection level metadata, indexes all textual metadata from each TIFF, rebuild and optimize options.
  • Checksum verification: creates a checksum and verifies against the original checksum.
A separate ResCarta Web Application facilitates Web publishing of ResCarta digital collections. Simply download the application and drop your ResCarta data directory into the application.

Libraries and archives can also use ResCarta to create metadata before adding objects to CONTENTdm, and the ResCarta Foundation is thinking of creating a tool that will enable METS and Metadata Object Description Schema (MODS) metadata to be moved into CONTENTdm in a streamlined, easy fashion.

I haven’t yet had the opportunity to play around with ResCarta --I just bought a new computer, but haven’t had the chance to get the thing hooked up to the Internet or do miscellaneous software installs -- but I was pretty intrigued and impressed. I’ll report back after I get the chance to play around with it a little bit.

I would be remiss if I did not point out that ResCarta may not be an appropriate solution for everyone: at present, only images and textual files can be added to ResCarta repositories: the ResCarta Foundation is, understandably, waiting for the emergence of good, widely accepted metadata standards for audiovisual materials. However, if you want to build a simple digital repository to house digital images and textual records, by all means check ResCarta out.

Image: Mary Todd Lincoln Home, Lexington, Kentucky, 22 October 2011. William Palmateer built this two-story brick, late Georgian house, which originally served as an inn, in 1803-1806. It was soon purchased by Robert Smith Todd, one of Lexington's most affluent men, and became a home for the growing Todd family. Mary Todd Lincoln was born in 1818 and resided in this home, which is a stone's throw away from the hotel at which the BPE was held, until she married Abraham Lincoln in 1842.

Friday, October 21, 2011

BPE 2011: emerging trends


The 2011 Best Practices Exchange (BPE) proceeds apace, and today I’m going to focus upon yesterday’s plenary session, which featured Leslie Johnston, the Director of Repository Development at the Library of Congress (LC). Johnston devoted a lot of time to discussing ViewShare, LC’s new visualization and metadata augmentation tool, but I’ll discuss ViewShare in a forthcoming post about some of the new tools discussed at this year’s BPE. Right now, I want simply to furnish an overview of her exhilirating and somewhat unsettling assessment of the changing environment in which librarians and archivists work:
  • Users do not use digital collections in the same way as they use paper collections, and we cannot guess how digital collections will be used. For example, LC assumed that researchers would want textual records, but a growing number of researchers want image files of textual records.
  • Until recently, stewardship organizations have talked about collections, series, etc., but not data. Data is not just generated by satellites, experiments, or surveys; publications and archival records also contain data.
  • We also need to start thinking in terms of “Big Data.” The definition of Big Data -- what can be easily manipulated with common tools and can be managed and stewarded by any one institutions -- is rather fluid, but we need to start thinking in these terms. We also need to be aware that Big Data may have commercial value, as evidenced by the increasing interest of firms such as Ancestry.com in the data found in our holdings.
  • More and more, researchers want to use collections as a whole and to mine and organize the collections in novel ways. They use algorithms to do so and new tools that create visual images that transform data into knowledge. For example, the Digging into Data project examined ways in which many types of information, including images, film, sound, newspapers, maps, art, archaeology, architecture, and government records, could be made accessible to researchers. One researcher wanted to digitally mine information from millions of digitized newspaper pages and see whether doing so can enhance our understanding of the past. LC’s experience with archiving Web sites also underscores this point. LC initially assumed that researchers would browse through the archived sites. However, researchers want access to all of the archived site files and to use scripts to search for the information they want. They don’t want to read Web pages. Owing to the large size of our collections, the lack of good tools, and the permissions we secured when LC crawled some sites, this is a challenge.
  • The sheer volume of the electronic data cultural stewardship organizations need to keep is a challenge. LC has acquired the Twitter archive, which currently consists of 37 billion individual tweets and will expand to approximately 50 billion tweets by year’s end. The archive grows by 6 million tweets an hour. LC is struggling to figure out how best to manage, preserve, and provide comprehensive access to this mass of data, which researchers have already used to study the geographic spread of the dissemination of news, the spread of epidemics, and the transmission of new uses of language.
  • We have to switch to a self-serve model of reference services. Growing numbers of researchers do not want to come to us, ask questions of us, and then use our materials in our environment. They want to find the materials they need and then pull them out of our environment and into their own workspaces. We need to create systems and mechanisms that make it easy for them to do so. As a result, we need to figure out how to support real-time querying of billions of full-text items and the frequent downloading by researchers of collections that may be over 200 TB each. We also need to think about providing tools that support various forms of collection analysis (e.g., visualization).
  • We can’t be afraid of cloud computing. Given the volumes of data coming our way and mounting researcher demands for access to vast quantities of data, the cloud is the only feasible mechanism for storing and providing access to the materials that will come our way. We need to focus on developing authentication, preservation, and other tools that enable us to keep records in the cloud.
There’s lots and lots of food for thought here -- including a few morsels that will doubtless induce indigestion in more than a few people -- and it’s just a taste of what’s coming our way. If we don’t come to terms with at least some of these changes, we as a profession will really suffer in the coming years. Let's hope that we have the will and the courage to do so.

A bottle of locally brewed Kentucky Bourbon Barrel Ale at Alfalfa Restaurant, Lexington, Kentucky, 20 October 2011. I highly recommend both the ale and the restaurant, but please note that Kentucky Bourbon Barrel Ale is approximately 8 percent alcohol. Just like the BPE, it's a little more intoxicating than one might expect.

Thursday, October 20, 2011

BPE 2011: ERA and the move to the cloud


This week, I’m spending a little time with my parents in Ohio and at the 2011 Best Practices Exchange (BPE) in Lexington, Kentucky. The BPE, which brings together state government, academic, and other archivists and librarians and other people seeking to preserve state government enduring information of enduring value, is my favorite archival conference. The Society of American Archivists annual meeting is always first-rate, but it’s gotten a little overwhelming, and I love the Mid-Atlantic Regional Archives Conference (MARAC), but nothing else has the small size, tight focus on state government records, informality, and openness that characterize the BPE.

Before I start detailing today’s highlights, I should say a few things about the content of these posts. For the past few years, those of us who have attended the BPE have tried to adhere to the principle that “what happens at BPE, stays at BPE.” This doesn’t mean that we don’t share what we’ve learned at the BPE (hey, I’m blogging about it!), but it does mean that we’re sensitive to the fact that candor is both essential and risky. The BPE encourages people to speak honestly about how and why projects or programs went wrong and what they learned from the experience. Openness of this sort is encouraging; all too often, we think that we’re alone in making mistakes. It's also helpful: pointing out hidden shallows and lurking icebergs helps other people avoid them. However, sometimes lack of senior manager commitment, conflicts with IT personnel, and other internal problems contribute to failure, and colleagues and supervisors occasionally regard discussion of internal problems as a betrayal. As a result, BPE attendees should exercise some discretion, and those of us who blog about the BPE should be particularly careful; our posts are a single Web search away. As a result, in a few instances I may write about the insights and observations that attendees have shared but obscure identifying details.

Moving on to this year's BPE itself, I'm going to devote the rest of this post to the insights and predictions offered up by U.S. National Archives and Records Administration (NARA) Chief Information Officer Mike Wash, who spoke this morning about the Electronic Records Archives (ERA), NARA’s complex, ambitious, and at times troubled electronic records system, and some changes that are on the horizon.

At present, ERA sort of works: staff use it to take in, process and store electronic records. At present, ERA holds approximately 130 TB of data. The Office of Management and Budget wants NARA to take in 10 TB of data per quarter, and NARA is working with agencies to meet this benchmark. However, ERA lacks an integrated access mechanism, and it contains multiple modules. The Base module handles executive agency data, the EOP module handles presidential records (and includes some internal access mechanisms), the Classified module holds classified records, and several other modules were built to deal with specific problems.

Building ERA taught NARA several lessons:
  • Solution architecture is critical. ERA’s multiple modules are a sign of a failed system architecture. Anyone building such a system must carefully consider the business and technical architecture carefully during the planning stage and must manage the architecture carefully over time.
  • The governance process must be clear and should start with business stakeholders. What do they really need the system to do, and how do you ensure that everyone stays on the same page throughout the process? Information technology invariably challenges control and authority, but if you set up your governance process properly, you should be able to retain control over system development.
  • Over communicate. Funders and other powerful groups need frequent updates; failure to keep feeding information to them can be profoundly damaging.
  • You must manage the project. The federal government tends to hire contractors to develop IT systems, and contractor relationships tend to deteriorate about six months after the contract is awarded. Most federal agencies cede authority to contractors because they are loath to be seen as responsible in the event that a project fails, but staying in control of the project increases your chances that you'll get the system you want.
  • Watch costs closely. Cost-escalating provisions have a way of sneaking into contracts.
  • Be mindful of intellectual property issues. The federal government typically reserves the right to all intellectual property created as a result of contracts, but this doesn’t always happen, and the vendor that built the first iteration of ERA has asserted that it controls some of the technology that now makes the system work; NARA will be much more assertive in working with future ERA vendors.
Wash also made some intriguing observations about some of the challenges that NARA and other archives are confronting:
  • At present, our ability to acquire data is limited by bandwidth limitations. It takes more than three days to convey 20 TB of data over a 1 gbps data line and at least a month to convey it via the Internet. NARA recently took custody of 330 TB of 2010 Census data, and it did so by accepting a truckload of hardware; at present, there are no alternatives to this approach.
  • The rate of data creation continues to accelerate. The administration of George W. Bush created 80 TB of records over the course of 8 years, but the Obama administration likely created more than 80 TB of data during its first year.
Wash indicated that NARA is starting to think that federal records should be created and maintained in a cloud computing environment and that transfer of custody from the creating agency to NARA should be effected by changing some of the metadata associated with the records being transferred.

Wash noted that the move to cloud computing will bring to the fore new preservation and authentication concerns. It also struck me that the transition that Wash envisions assumes the existence of a single federal government cloud that has adequate storage, security, and access controls and that, at least at this time, many states aren’t yet thinking of constructing such environments. Individual state agencies may be thinking of moving to the cloud, but most states don't seem to be preparing to move to a single, statewide cloud environment. Moreover, owing to its sheer size, the federal government is better able to negotiate favorable contract terms than state or local governments; the terms of service agreements that the feds hammered out with various social media providers are an excellent example. I have the uneasy feeling that some governments will accept, out of lack of knowledge, desperate financial straits, or inability to negotiate optimal terms, public cloud service contracts that prove problematic or outright disastrous.

Its nonetheless apparent that government computing will move into the cloud, that this transition offers both new challenges and new opportunities for managing and preserving records, and that archivists and records managers are going to have come to grips with these changes. The next decade promises to be most interesting.

The Lexington Laundry Company building on West Main Street, Lexington, Kentucky, 20 October 2011. This little gem was built ca. 1929, is an outstanding example of Art Deco architecture in the city, and is part of Lexington's protected Downtown Commercial District. It now houses an art gallery.

Thursday, August 4, 2011

Call for Proposals: Best Practices Exchange 2011

As a member of the 2011 Best Practices Exchange Program Committee, I am delighted to bring you the following announcement -- and I hope to see you in in the Bluegrass State this fall!

The sixth annual Best Practices Exchange (BPE) will be held at the Hyatt Regency in Lexington, Kentucky, from 20-22 October 2011. The BPE focuses on the management of digital information in state government, and it brings together librarians, archivists, information technology professionals, and other practitioners to discuss their real-world experiences, including best practices and lessons learned.

Following the format of past Best Practices Exchanges, the 2011 Program Committee encourages you, the attendees, to present your projects and experiences, successes, failures and lessons learned. This year's conference has four broad tracks. Each track is enumerated below,
along with a list of themes embraced by each track. We ask that potential speakers be guided, but not limited, by the themes indicated.

Each session will be 90 minutes long with two or more speakers per session. We ask that you keep presentations to 10-15 minutes to allow for discussion and engagement with the audience.

Proposals should include an abstract of 100 words or less, the proposed track (if applicable), and the name, title, email, phone number and organization of each presenter. You may submit a proposal for one speaker, which will then be paired with others by the Program Committee, or a proposal for a full session with multiple speakers (please contact and confirm the other speakers prior to submission.)

For more information about proposals, please see the Presentations page on the BPE 2011 Web site.

1) Access: Online access; should everything be accessible; FOIA/Open Records issues; legal issues with access

2) Sustainability: Budget/funding issues; technology (IT consolidation, lack of IT support); life after the grant; evaluation, statistics, and user feedback.

3) Digital Projects: Lessons learned; what worked and what didn't; solutions; new tools or services

4) Collaboration and Community: Support groups and user communities; shared services; user services; library/archives crossovers

Proposals are due by September 15, 2011. Please send all session proposals to Mark Myers, Kentucky Department for Libraries and Archives, at mark.myers[at]ky.gov

The hotel cost will be $139/night and conference registration will be $125. Registration information will be posted to the BPE Web site soon.

Be sure to friend the new BPE Facebook page!