Showing posts with label NDIIPP. Show all posts
Showing posts with label NDIIPP. Show all posts

Tuesday, April 13, 2010

New digital preservation video from the Library of Congress



Have you ever found yourself trying to explain to interested laypeople -- friends, relatives, elected officials -- precisely why it's so hard to keep electronic files intact and accessible over time? If so, be sure to check out Why Digital Preservation is Important for Everyone, the latest video from the Library of Congress's National Digital Information Infrastructure Preservation Program. In less than three minutes, the video touches upon most of the big threats to digital materials (wait until you see someone attempting to insert a Zip disk and an open-reel data tape into a netbook!) and emphasizes the need for "active management" of electronic files. It's an accessible non-technical introduction for people who aren't familiar with the challenges of preserving digital materials, and a great resource and model for those of us who must cultivate support for digital preservation. A full transcript is available.

Sunday, August 23, 2009

Library of Congress BagIt video



I saw this informative and fun video while I was at the Library of Congress's National Digital Information Infrastructure Preservation Program grant partners meeting in June, and I've been waiting for it to appear on YouTube. The Library of Congress actually put this video online about a month ago, but I was so focused on work and on getting to Austin that I didn't notice that it had been added to the Library of Congress's YouTube channel.

BagIt is an open-source application developed by the Library of Congress, Stanford University, and the California Digital Library, and it facilitates transfer of digital materials from creators' systems to those of archives and libraries. BagIt enables users to place materials slated for transfer into a "bag" and then automatically generates digital packing lists and authenticity checks (i.e., checksums). When the bag, which can be transferred on portable media or over a computer network, reaches its destination, the recipient verifies that all of the files in the bag are present and that none of the files have been changed.

I haven't yet had the chance to play around with the BagIt software, which is available on SourceForge, but I know several other people who have, and they're all pretty impressed by it. Electronic records archivists and digital librarians need tools such as BagIt, and the Library of Congress and its partners deserve kudos for making BagIt freely available.

NB: be sure to ponder the "inscrutable digital voodoo." Heh.

Friday, June 26, 2009

NDIIPP project partners meeting, day three

Union Station, Washington, DC, at around 3:00 PM today.

The Library of Congress (LC) National Digital Information Infrastructure Preservation Program (NDIIPP) partners meeting wrapped up this afternoon. This morning’s presentations concerned the Unified Global Format Registry, the PREMIS metadata schema, the Federal Digitization Standards Working Group, and the LC’s proposed National Digital Stewardship Alliance.

Right after breakfast, Andrea Goethals (Harvard University) discussed the Unified Global Format Registry (UGFR) and the importance of file format registries generally. One of the main goals of digital preservation is to ensure that digital information remains useful over time, and as a result we must determine whether a given resource has or is likely to become unusable. In order to do so, we need to answer a series of questions:
  • Which file format is used to encode the information?
  • What current technologies can render the information properly?
  • Does the format have sustainability issues (e.g., intellectual property restrictions)?
  • How does the digital preservation community view the format?
  • What alternative formats could be used for this information?
  • What software can transform the information from its existing format to an alternative format?
  • Can emulation software provide access to the existing format?
  • Is there enough documentation to write a viewing/rendering application that can access this format?
Format registries avoid the need to reinvent the wheel: they pool knowledge so that repositories can make use of each other’s research and expertise.

The UGFR was created with the April 2009 merger of the two largest format registry initiatives: PRONOM, which has been publicly accessible for some time, and the Global Digital Format Registry, which was still under development at the time of the merger. The UDFR will make use of PRONOM’s existing software and data and the GDFR’s support use cases, data model, and distributed architectural model. Moreover, it will incorporate local registry extensions for individual repositories and support distributed data input. At present, it’s governed by an interim group of academic institutions, national archives, and national libraries; a permanent governing body will be established in November 2009.

I’ve used PRONOM quite a bit, so I’m really looking forward to seeing the UGFR.

Rebecca Guenther (LC) then furnished a brief overview of the Preservation Metadata: Implementation Strategies (PREMIS) schema and recent PREMIS-related developments.

The PREMIS schema, which was completed in 2005, is meant to capture all of the information needed to make sure that digital information remains accessible, comprehensible, and intact over time. It is also meant to be practical: it is system/platform neutral, and each metadata element is rigorously defined, supported by detailed usage guidelines and recommendations, and (with very few exceptions) meant to be system-generated, not human-created.

I’m sort of fascinated by PREMIS and have drawn from it while working on the Persistent Digital Archives and Library System (PeDALS) project, but I haven’t really kept up with recent PREMIS developments. It was interesting to learn that the schema is now extensible: externally developed metadata (e.g., XML-based electronic signatures, format-specific metadata schemes, environment information, other rights schemas) can now be contained within PREMIS.

I was also happy to learn that the PREMIS development group is also working on incorporating controlled vocabularies for at least some of the metadata elements and that this work will be available via the Web. (Id.loc.gov)

The group is also working on a variety of other things, including:
  • Draft guidelines for using PREMIS with the Metadata Encoding and Transmission Standard (METS)
  • A tool that will convert PREMIS to METS and vice versa
  • An implementers registry www.loc.gov/premis/premis-registry.html
  • Development of a tool (most likely a self-assessment checklist) that will verify PREMIS implementers are using the schema correctly
  • A tool for extracting metadata and populating PREMIS XML schemas
Guenther also shared one tidbit of information that I found really interesting: although PREMIS allows metadata to be kept at the file, representation, and bitstream level, repositories may opt to maintain only file-level or file- and representation-level metadata. I hadn’t interpreted the schema in this manner, and someone else at the meeting was similarly surprised.

A quick update on the work of the Federal Digitization Standards Working Group followed. Carl Fleischauer (LC) explained that the group, which consists of an array of federal government agencies, is assembling objectives and use cases for various types of digitization efforts (e.g., production of still image master copies). To date, the group’s work has focused largely on still images, and it has put together a specification for TIFF header information and will look at the Extensible Metadata Platform (XMP) schema. In an effort to verify that scanning equipment faithfully reproduces original materials, it is also developing device and object targets and DICE, a software application (currently in beta form).

The group is also working on a specification for digitization of recorded sound and developing audio header standards. However, it is waiting for agencies to gain more experience before it tackles video.

The meeting ended with a detailed overview of LC’s plan to establish a group that will sustain NDIIPP's momentum. The program has just achieved permanent status in the federal budget, and all of the grant projects that it funded will end next year.

In an effort to sustain the partnerships developed during the grant-driven phase of NDIIPP’s existence, LC would like to create an organization that it is tentatively calling the National Digital Stewardship Alliance. Meg Williams of LC’s Office of Counsel outlined what the organization’s mission and governance might look like; before creating the final draft charter, LC will host a series of conference calls and develop an online mechanism that will enable the partners to provide input.

LC anticipates that this alliance, which is intended to be low-cost, flexible, and inclusive, will help to sustain existing partnerships and form new ones. In order to ensure that the organization remains viable, LC envisions that the organization will consist of LC itself, members, and associates:
  • Organizations willing to commit to making sustained contributions to digital preservation research would, at the invitation of LC, become full members of the alliance and would enjoy full voting rights. Member organizations would not have to agree to undertake specific actions or projects, but they would have to commit to remaining involved in the alliance over time.
  • Individuals and organizations that cannot commit to making ongoing, sustained contributions to digital preservation research but have an abiding interest in digital preservation, support the alliance’s mission, and are willing to share their expertise would, at LC’s invitation, become associates. Associates will not have voting status.
  • LC itself will serve as the alliance’s chair and secretariat and will use program funding to support its activities; there will be no fees for members or associates. It will also maintain a clearinghouse and registry of information about content, standards, practices and procedures, tools, services, and training resources. It will also facilitate connections between members and associates who have common interests, convene stakeholders to develop shared understanding of digital stewardship principles and practices, report periodically on digital stewardship, and provide grant funding if such monies are available.
LC projects that this committee will have several standing committees responsible for researching specific areas of interest:
  • Content: contributing significant materials to the “national collection” to be preserved and made available to current and future generations.
  • Standards and practices: developing, following, and promoting effective methods for identifying, preserving, and providing access.
  • Infrastructure: developing and maintaining curation and preservation tools, providing storage, hosting, migration and other services, building collection of open source tools.
  • Innovation: encouraging and conducting research, periodically describing a research agenda.
  • Outreach and education: for hands-on practitioners, other stakeholders, funders, and the general public.
  • Identifying new roles: as needed.
LC also sees these committees as having a governance role: at present, it envisions that the alliance’s Governing Council will consist of the Librarian of Congress, the LC Director of New Initiatives, the chairs of all of the standing committees, and a few members at large.

Williams closed by asking everyone present to think about this proposal how to define “success” and “failure” for the alliance, identify benefits of participation for their own institutions and for others, and supply feedback to LC. LC hopes to have a final draft charter finished later this year.

At this point, I think that creating some sort of formal organization makes a lot of sense but don’t have any strong ideas one way or another about the specifics of LC’s proposal. The past few days have been jam-packed/ Even though I relished the opportunity to hear about what’s happening with NDIIPP and to meet face-to-face with my PeDALS project colleagues -- several people told me that the PeDALS group struck them as really hard-working and really fun, and they’re right -- I’m really feeling the need to get home (I’m writing this post on the train), get some sleep, and reflect on everything that took place over the past couple of days. I’ll keep you posted . . . .

George Washington Bridge, Hudson River, as seen from Amtrak train no. 243 at around 8:30 PM.

Wednesday, June 24, 2009

NDIIPP project partners meeting, day one

I’m in Washington, DC for the National Digital Information Infrastructure Preservation Program (NDIIPP) grant partners meeting. NDIIPP is a program of the Library of Congress (LC), and we learned today that it has just been awarded permanent status in the federal budget. As a result, the program should receive an annual appropriation and become a permanent part of the digital preservation landscape.

In response to this change, LC is thinking of creating a National Digital Stewardship Alliance (the name may change), which would allow current NDIIPP partners to continue working with LC and attract new partners. Organizations that are willing and able to direct resources to NDIIPP initiatives will have a voice in the operations of the alliance, and other interested institutions and individuals can become observers. I’ll be sure to post more information about this alliance as it becomes available.

Martha Anderson (LC) opened the meeting by furnishing a quick overview of NDIIPP’s progress to date and in the process of doing so highlighted a simple fact that reinforces the conclusions the PeDALS project partners and many other people are reaching: “metadata is your worldview.” In other words, no two organizations use metadata in precisely the same way, and while there may be broad agreement as to how standards should be used, there must always be room for local practice.

The keynote speaker, Clay Shirky, noted that in many respects this persistence of local practice is a very good thing: the existence of different preservation regimes, practices, and approaches reduces the risk of catastrophic error.

Shirky’s work focuses on the social and economic impact of Internet technologies, and his address and follow-up comments highlighted the social nature of digital preservation problems.

The Internet has enabled just about anyone who wishes to create media to do so. Instead of the top-down, one-to-one or one-to-many models that have characterized the production of information since the invention of the printing press, we are seeing the emergence of a many-to-many model. Traditional media have high upfront costs, but the cost of disseminating information via the Internet is negligible. Instead of asking “why publish?” we now ask “why not publish?”

The profusion of Internet media has helped to popularize the notion of “information overload,” but our problem is in fact “filter failure.” Information overload has existed since the invention of the printing press, but we generally didn’t notice it because bookstores, libraries and other institutions created systems for facilitating access to printed information. However, on the Internet, information is now like knowledge itself: loosely arranged and variably available.

Shirky asserted that in the “why not publish?” era, librarians, archivists, and others seeking to preserve digital resources no longer need to decide which social uses for a given piece of information should be privileged by cataloging. Instead, they should actively seek to incorporate user-supplied information into the descriptive metadata (i.e., the filters) that they maintain. For example, user-created Flickr tags indicate that a given Smithsonian image of a fish is of demonstrated interest to both an ichthyologist and to a crafter who placed an image of the fish on a purse. Prior to the rise of the Internet, cataloguers would give the scientist’s use of this image more weight than that of the craftsperson. However, as long as metadata is creating value for some group of people, why not allow it to be applied as broadly as possible? In other words, the question we must answer is no longer “why label it this way?” but “why not label it this way?”

The incorporation of user-supplied metadata challenges librarians and archivists who fear losing control over the ways in which information is presented to researchers. However, as Shirky pointed out, this loss of control has already happened: it’s a mistake to believe that we can control how our institutions and holdings will be discussed. All we can really do is decide how to participate in these discussions. President Obama’s 2008 campaign provides a good example of active participation: the campaign understood right away that providing a really clear vision for Obama would empower supporters to talk about Obama without the campaign’s assistance. It then made use of the best user-generated content. In order to do so, it had to accept that some people would make critical and even bigoted use of the material it made available.

Shirky also noted that digital preservation itself has to be social. The “Invisible College,” a sixteenth-century group of intellectuals who established a set of principles for investigating the natural world and sharing their research, is a good model: its expectation that the results of research would be available for review and development of further inquiry gave rise to modern science, and we are now starting to think of digital preservation as an endeavor requiring collaboration, sharing, and community-building.

The social dimension of preservation also extends to end users. One of the big mental shifts in the NDIIPP project has been from “light” (i.e., totally open) and “dark” (i.e., completely inaccessible) archives to “dim” archives. The more secret something is, the harder it is to preserve. In some cases, we have no choice but to bear the costs of preserving inaccessible materials. However, to the degree that we can turn up the dimmer switch, we should do so. If we allow someone to view a movie -- even in five-minute snippets -- s/he will at least be able to tell us if a snippet has gone bad. Even a little exposure lowers the costs of preservation, and lowering costs increases the possibility that something will be preserved. Moreover, if we develop simple, low-cost tools that enable end users to take an active role in preserving information that is important to them, we’ll get a clearer picture of what they find important and increase the chance that information of enduring value is preserved.

All in all, a really vivid, thought-provoking presentation; this summary doesn’t do it justice.

After a lengthy break, Katherine Skinner and Gail MacMillan of the MetaArchive Cooperative furnished a fascinating overview of the results of two digital preservation surveys, one of which focused on electronic theses and dissertations and the other on cultural heritage institutions of all kinds.

The surveys were meant to identify institutions that were collecting digital materials, types of materials being collected, how these materials are stored, barriers to preservation, and the most desired preservation offerings. Respondents self-selected to take these surveys.

As Skinner and MacMillan noted, the findings reveal some unsettling problems:
  • Most institutions are actively collecting digital materials, and survey respondents hold an average of 2 TB of data.
  • Most respondents hold many different types of file formats and genres of information
  • Storage protocols vary widely. Some respondents are using purpose-built preservation environments (e.g., iRODS), others are relying upon access systems to preserve materials, while others have home-grown systems. Some respondents simply store materials on creator-supplied portable media.
  • The manner in which materials are organized also varies widely, and in many instances organizational schemes (or the lack thereof) pose preservation challenges.
  • Respondents are actively engaging with the ideas, have a high level of knowledge about community-based approaches to digital preservation, and still feel responsible for preservation.
  • Preservation readiness is low -- most institutions aren’t even backing up files, and most also lack preservation plans and policies -- but desire is high. People want training, independent assessments of their capacity, and the ability to manage their own digital preservation solutions. People don’t want to outsource digital preservation; however, some outsourcing will be needed, particularly for smaller institutions.
  • Respondents themselves identified insufficient preservation resources as the biggest threat; inadequate policies and plans, deteriorating storage media, and technological obsolescence were also mentioned.
  • Interestingly, the preservation offerings that respondents most desired did not address the threats that they identified. Cultural heritage institutions wanted training provided by professional organizations, independent study/assessment, local courses in computer or digital technology, new staff with digital knowledge and experience, consultants, and training from vendors. Colleges and universities responsible for electronic theses and dissertations wanted cooperative preservation framework, standards, training on best practices, model policies, conversion or migration services, preservation services provided by third-party vendors, and access services.
Skinner and MacMillan concluded that the most effective preservation strategies incorporate replication of content, geographic distribution, secure locations for storage, and private networks of trusted partners. However, most respondents seem to have fallen prey to “cowpath syndrome:” they have idiosyncratic, ad-hoc data storage structures that grew out of pressing needs, but these structures are increasingly difficult to expand and maintain over time, and some sort of triage will eventually become necessary. Moreover, there is a disconnect between administrators and people who are actually responsible for hands-on preservation work: administrators want to keep things in-house and under control, but hands-on people see the value of collaboration and distributed storage.

I suspect that everyone at this meeting faces at least some of these challenges and shortcomings and that many of us are going to go home and discuss at least some of these findings with our colleagues and managers . . . .