Showing posts with label PeDALS. Show all posts
Showing posts with label PeDALS. Show all posts

Friday, June 26, 2009

NDIIPP project partners meeting, day three

Union Station, Washington, DC, at around 3:00 PM today.

The Library of Congress (LC) National Digital Information Infrastructure Preservation Program (NDIIPP) partners meeting wrapped up this afternoon. This morning’s presentations concerned the Unified Global Format Registry, the PREMIS metadata schema, the Federal Digitization Standards Working Group, and the LC’s proposed National Digital Stewardship Alliance.

Right after breakfast, Andrea Goethals (Harvard University) discussed the Unified Global Format Registry (UGFR) and the importance of file format registries generally. One of the main goals of digital preservation is to ensure that digital information remains useful over time, and as a result we must determine whether a given resource has or is likely to become unusable. In order to do so, we need to answer a series of questions:
  • Which file format is used to encode the information?
  • What current technologies can render the information properly?
  • Does the format have sustainability issues (e.g., intellectual property restrictions)?
  • How does the digital preservation community view the format?
  • What alternative formats could be used for this information?
  • What software can transform the information from its existing format to an alternative format?
  • Can emulation software provide access to the existing format?
  • Is there enough documentation to write a viewing/rendering application that can access this format?
Format registries avoid the need to reinvent the wheel: they pool knowledge so that repositories can make use of each other’s research and expertise.

The UGFR was created with the April 2009 merger of the two largest format registry initiatives: PRONOM, which has been publicly accessible for some time, and the Global Digital Format Registry, which was still under development at the time of the merger. The UDFR will make use of PRONOM’s existing software and data and the GDFR’s support use cases, data model, and distributed architectural model. Moreover, it will incorporate local registry extensions for individual repositories and support distributed data input. At present, it’s governed by an interim group of academic institutions, national archives, and national libraries; a permanent governing body will be established in November 2009.

I’ve used PRONOM quite a bit, so I’m really looking forward to seeing the UGFR.

Rebecca Guenther (LC) then furnished a brief overview of the Preservation Metadata: Implementation Strategies (PREMIS) schema and recent PREMIS-related developments.

The PREMIS schema, which was completed in 2005, is meant to capture all of the information needed to make sure that digital information remains accessible, comprehensible, and intact over time. It is also meant to be practical: it is system/platform neutral, and each metadata element is rigorously defined, supported by detailed usage guidelines and recommendations, and (with very few exceptions) meant to be system-generated, not human-created.

I’m sort of fascinated by PREMIS and have drawn from it while working on the Persistent Digital Archives and Library System (PeDALS) project, but I haven’t really kept up with recent PREMIS developments. It was interesting to learn that the schema is now extensible: externally developed metadata (e.g., XML-based electronic signatures, format-specific metadata schemes, environment information, other rights schemas) can now be contained within PREMIS.

I was also happy to learn that the PREMIS development group is also working on incorporating controlled vocabularies for at least some of the metadata elements and that this work will be available via the Web. (Id.loc.gov)

The group is also working on a variety of other things, including:
  • Draft guidelines for using PREMIS with the Metadata Encoding and Transmission Standard (METS)
  • A tool that will convert PREMIS to METS and vice versa
  • An implementers registry www.loc.gov/premis/premis-registry.html
  • Development of a tool (most likely a self-assessment checklist) that will verify PREMIS implementers are using the schema correctly
  • A tool for extracting metadata and populating PREMIS XML schemas
Guenther also shared one tidbit of information that I found really interesting: although PREMIS allows metadata to be kept at the file, representation, and bitstream level, repositories may opt to maintain only file-level or file- and representation-level metadata. I hadn’t interpreted the schema in this manner, and someone else at the meeting was similarly surprised.

A quick update on the work of the Federal Digitization Standards Working Group followed. Carl Fleischauer (LC) explained that the group, which consists of an array of federal government agencies, is assembling objectives and use cases for various types of digitization efforts (e.g., production of still image master copies). To date, the group’s work has focused largely on still images, and it has put together a specification for TIFF header information and will look at the Extensible Metadata Platform (XMP) schema. In an effort to verify that scanning equipment faithfully reproduces original materials, it is also developing device and object targets and DICE, a software application (currently in beta form).

The group is also working on a specification for digitization of recorded sound and developing audio header standards. However, it is waiting for agencies to gain more experience before it tackles video.

The meeting ended with a detailed overview of LC’s plan to establish a group that will sustain NDIIPP's momentum. The program has just achieved permanent status in the federal budget, and all of the grant projects that it funded will end next year.

In an effort to sustain the partnerships developed during the grant-driven phase of NDIIPP’s existence, LC would like to create an organization that it is tentatively calling the National Digital Stewardship Alliance. Meg Williams of LC’s Office of Counsel outlined what the organization’s mission and governance might look like; before creating the final draft charter, LC will host a series of conference calls and develop an online mechanism that will enable the partners to provide input.

LC anticipates that this alliance, which is intended to be low-cost, flexible, and inclusive, will help to sustain existing partnerships and form new ones. In order to ensure that the organization remains viable, LC envisions that the organization will consist of LC itself, members, and associates:
  • Organizations willing to commit to making sustained contributions to digital preservation research would, at the invitation of LC, become full members of the alliance and would enjoy full voting rights. Member organizations would not have to agree to undertake specific actions or projects, but they would have to commit to remaining involved in the alliance over time.
  • Individuals and organizations that cannot commit to making ongoing, sustained contributions to digital preservation research but have an abiding interest in digital preservation, support the alliance’s mission, and are willing to share their expertise would, at LC’s invitation, become associates. Associates will not have voting status.
  • LC itself will serve as the alliance’s chair and secretariat and will use program funding to support its activities; there will be no fees for members or associates. It will also maintain a clearinghouse and registry of information about content, standards, practices and procedures, tools, services, and training resources. It will also facilitate connections between members and associates who have common interests, convene stakeholders to develop shared understanding of digital stewardship principles and practices, report periodically on digital stewardship, and provide grant funding if such monies are available.
LC projects that this committee will have several standing committees responsible for researching specific areas of interest:
  • Content: contributing significant materials to the “national collection” to be preserved and made available to current and future generations.
  • Standards and practices: developing, following, and promoting effective methods for identifying, preserving, and providing access.
  • Infrastructure: developing and maintaining curation and preservation tools, providing storage, hosting, migration and other services, building collection of open source tools.
  • Innovation: encouraging and conducting research, periodically describing a research agenda.
  • Outreach and education: for hands-on practitioners, other stakeholders, funders, and the general public.
  • Identifying new roles: as needed.
LC also sees these committees as having a governance role: at present, it envisions that the alliance’s Governing Council will consist of the Librarian of Congress, the LC Director of New Initiatives, the chairs of all of the standing committees, and a few members at large.

Williams closed by asking everyone present to think about this proposal how to define “success” and “failure” for the alliance, identify benefits of participation for their own institutions and for others, and supply feedback to LC. LC hopes to have a final draft charter finished later this year.

At this point, I think that creating some sort of formal organization makes a lot of sense but don’t have any strong ideas one way or another about the specifics of LC’s proposal. The past few days have been jam-packed/ Even though I relished the opportunity to hear about what’s happening with NDIIPP and to meet face-to-face with my PeDALS project colleagues -- several people told me that the PeDALS group struck them as really hard-working and really fun, and they’re right -- I’m really feeling the need to get home (I’m writing this post on the train), get some sleep, and reflect on everything that took place over the past couple of days. I’ll keep you posted . . . .

George Washington Bridge, Hudson River, as seen from Amtrak train no. 243 at around 8:30 PM.

Thursday, June 25, 2009

NDIIPP project partners meeting, day two

The National Digital Information Infrastructure Preservation Program (NDIIPP) partners meeting enables recipients of NDIIPP grant monies to discuss their work, seek feedback, and identify areas of common interest.

Today’s sessions began with Michael Nelson (Old Dominion University), who outlined the Synchronicity project, which will enable end users to recover data that has “vanished” from the Web. Synchronicity is a Firefox Web browser extension that catches that “File Not Found” messages that appear when a user attempts to access a page that no longer exists or no longer contains the information that the user seeks. It then searches Web search engine caches, the Internet Archive, and various research project data caches and retrieves copies of the information that the user seeks.

If these sources fail to provide the desired information, Synchronicity then generates a search engine query based upon what the message is “about” and attempts to find the information on the Web. These queries are based on “lexical signatures” (e.g., MD5 and SHA-1 message digests) and page titles, and preliminary research indicates that these searches are successful about 75 percent of the time. Nelson and his colleagues are currently exploring other methods of locating “lost” content and how to handle pages whose content have changed over time.

I’ve often used the Internet Archive and Google’s search caches to locate information that has vanished from the Web, and I’m really looking forward to installing the Synchronicity plug-in once it becomes available.

Michelle Kimpton of DuraSpace then discussed the DuraCloud project, which seeks to develop a trustworthy, non-proprietary cloud computing environment that will preserve digital information. In cloud computing environments, massively scalable and flexible IT-related capabilities are provided “as a service” over the Internet. They offer unprecedented flexibility and scalability, economies of scale, and ease of implementation. However, cloud computing is an emerging market, providers are motivated by profit, information about system architectures and protocols is hard to come by, and as a result cultural heritage institutions are rightfully reluctant to trust providers.

DuraCloud will enable institutions that maintain DSpace and FEDORA institutional repositories to preserve the materials in their repositories in a cloud computing environment; via a Firefox browser extension, it will also allow users to identify content that should be preserved. A Web interface will enable users to monitor their data and, possibly, run services.

DuraCloud members to create and manage multiple, geographically distributed copies of their holdings, monitor their digital content and verify that they have not been inadvertently or deliberately altered, and take advantage of the cloud’s processing power when doing indexing and other heavy processing jobs. It will also provide search, aggregation, video streaming and file migration services and will enable institutions that don’t want to maintain their institutional repositories locally to do so within a cloud environment.

The DuraCloud software, which is open source, will be released next month, and in a few months DuraSpace itself will conduct pilot testing with a select handful of cloud computing providers (Sun, Amazon, Rackspace, and EMC) and two cultural heritage institutions (the New York Public Library and the Biodiversity Heritage Library).

Fascinating project. We’ve known for some time that DSpace and FEDORA are really access systems, but lots of us have used them as interim preservation systems because we lack better options.

The next session was a “breakout” that consisted of simultaneous panels focusing on one or two NDIIPP projects. The Persistent Digital Archives and Library System (PeDALS) project was featured in a session that focused on digital preservation contracts and agreements. The first half of the session consisted of an overview of the contracts and agreements that support a variety of collaborative digital preservation initiatives:
  • Vicki Reich discussed the CLOCKSS Archive, which brings together libraries and publishers on equal terms and provides free public access to materials in the archive that are no longer offered for sale.
  • Julie Sweetkind-Singer detailed the provider agreements and content node agreements that govern the operations of the National Geospatial Digital Archive.
  • Myron Gutman discussed the development Data-PASS, which grew out of previous collaborations between the project’s partners and lengthy experience preserving social science data.
  • Dwayne Buttler, an attorney who was instrumental in crafting the agreements that support the operations of the MetaArchive Cooperative, emphasized that contracts, which focus on enforceability, grow out of a lack of trust and allow for simultaneous sharing and control; in contrast, agreements articulate goals.
The second half of the session focused solely on PeDALS. Richard Pearce-Moses (Arizona State Library and Archives; principal investigator), Matt Guzzi (South Carolina Department of Archives and History), Alan Nelson (Florida State Library and Archives), Abbie Norderhaug (Wisconsin Historical Society), and Yours Truly (New York State Archives) informally discussed some of the lessons that we’ve learned as the project unfolded. Among them:
  • People involved in long-distance collaborative projects need structured, consistent activities and expectations of involvement; both are key to fostering a sense of project ownership.
  • Lack of face-to-face interaction makes it harder for people to feel engaged; conference calls and other tools can help bridge the gap, but nothing really takes the place of getting to know other people.
  • Working in smaller teams capitalizes upon our strengths -- provided that we make sure that the right mix of IT, archival, and library personnel are involved.
  • Team members must be open to learning as they go and creative and innovative.
  • Working on this project has brought to light a number of challenges: communication and collaboration over long distances and multiple time zones, differences in organizational cultures, responsibilities, and IT infrastructures, learning to speak each other’s languages, and finding the right IT consultant.
  • We are nonetheless rowing in the same direction: we’ve learned to balance local practice with common requirements, and individual partners are beginning to incorporate PeDALS principles and standards into their current cataloging and other work.
In addition, Alan Nelson discussed how the IT personnel involved in the group have adopted the Agile Scrum process . . . and illustrated the difference between involvement and commitment.

The second “breakout” session took place after lunch, and the session I attended focused on building collaborative digital preservation partnerships:
  • Bill Pickett discussed the Web History Center’s efforts to provide online access to archival materials documenting the development of the World Wide Web and the organization’s need for partners.
  • David Minor outlined the work of the Chronopolis consortium, which is striving to build a national data grid that supports a long-term preservation (but not access) service.
  • Martin Halbert detailed the work of the MetaArchive, a functioning distributed digital preservation network and non-profit collaborative.
  • Beth Nichol discussed the Alabama Digital Preservation Network, which grew out of work with the MetaArchive and a strong history of informal statewide collaboration.
During the follow-up discussion, Martha Anderson (Library of Congress) made a really interesting point: according to an IBM study that LC commissioned, the strongest digital preservation organizations are focused on content; weaker groups are focused on tools. The study also found that tool-building works really well when there is a community interest in a tool and a central development team and that natural networks that grow out of years of other collaborative work also lead to the creation of strong organizations; however, there are other ways to build trust.

The end of the day brought all of the attendees back together. Abby Smith of NDIIPP provided an update on the work of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, which focuses on materials in which there is clear public interest and seeks to frame digital preservation and access as a sustainable economic activity (i.e., a deliberate, ongoing resource allocation over long periods of time), articulate the problems associated with digital preservation, and provide practical recommendations and guidelines.

Economic sustainability requires recognition of the benefits of preservation, incentives for decision-makers to act, well-articulated criteria for selection of materials for preservation, mechanisms to support ongoing, efficient allocation of resources, and appropriate organization and governance. As a result, the task force’s work -- an interim report released last December and a forthcoming final report -- is directed at people who make decisions about the allocation of resources, not people who are responsible for the day-to-day work of preserving digital information.

Smith wrapped up by making a series of thought-provoking points:
  • Preservation is a derived demand, and people will not pay for it. However, they will pay for the product itself. We need to think of digital information as being akin to a car: it’s something that has a long life but requires periodic maintenance.
  • Everything in the digital preservation realm is dynamic and path-dependent: content changes over time, users change over time, and uses change over time. Decisions made now close off future options.
  • Librarians and archivists are the defenders of the interests of future users, and we need to emphasize that we are accountable to future generations.
  • Fear that digital preservation and access are too big to take on is a core problem.
IMHO, Smith's last point succinctly identifies the biggest barrier to digital preservation.

Tuesday, December 9, 2008

New PeDALS Web site

The Persistent Digital Archives and Library System (PeDALS) project has a new -- and sleekly minimalist -- Web site: http://pedalspreservation.org

PeDALS is a multi-state digital preservation project funded by National Digital Information Infrastructure Preservation Program and the Institute of Museum and Library Services, and it aims:
First, to develop a curatorial rationale to support an automated, integrated workflow to process collections of digital publications and records. Second, to implement "digital stacks" using an inexpensive, storage network that can preserve the authenticity and integrity of the collections.

In addition to those technical goals, PeDALS seeks to build a community of shared practice so that the system meets the needs of a wide range of repositories that could then support the ongoing development of the system and promote best practices. To further that end, PeDALS hopes to remove barriers to adopting the technology by keeping costs as low as possible.
PeDALS is led by the Arizona State Library, Archives and Public Records, and the other partners are:
At present, the new site has information about the project's metadata schema, upon which we're putting the finishing touches, project-based papers and presentations, and lots of background information. Once all of the partners get their hardware and software up and running, I'm sure that lots of technical information about the project will be available as well.

Thursday, October 23, 2008

Desert Botanical Garden

After the third day of the PeDALS project partners meeting ended, I went with three of my colleagues -- Mark from the Florida State Library and Archives, Matt from the South Carolina Department of Archives and History, and Lynne of the New York State Library -- to the Desert Botanical Garden.

The DBG has cacti and succulents from around the world. We spent most of our time in its exhibits concerning the Sonoran Desert, in which the city of Phoenix is located.

The sun went down while we were there. Watching as the blues and purples of dusk sweep over the saguaros, organ pipes, prickly pears, agaves, and mesquites was a tranquil way to end an intense and eventful day.

Afterward, we stopped for some great Thai food, talked about the differences in outlook between IT and library/archives folks, the challenges of internal and external collaboration, all of the work before us, and travel, movies, TV, and alligators. We then went back to our hotel, wished each other safe journeys, and parted ways.

PeDALS partners meeting: day three

For most of us, the meeting of the PeDALS project partners -- Arizona, Florida, New York, South Carolina and Wisconsin -- ended today at noon. We spent the morning discussing development of a pamphlet and expanded Web site publicizing the project, going over some remaining metadata issues so that our database designer can start working in earnest, and reviewing the project timeline and divvying up responsibility for tasks that need to be finished soon. We also discussed how the partners collaborate -- internally and across state lines -- and started thinking, in a very preliminary way, about what will happen when our grant funding ends.

After we wrapped up, most of the non-technical people (i.e., librarians and archivists) left. However, the IT people at this meeting reconvened this afternoon and will continue working until around noon tomorrow. My colleague Lynne from the New York State Library and I are flying out tomorrow morning, so we sat in on the technical meeting that took place this afternoon.

Until this afternoon, we were meeting at our hotel, but the IT folks traveled to the 1900 Capitol Building, which houses the Arizona Capitol Museum; until a few weeks ago, when Polly Rosenbaum Archives and History Building opened, it was also home to the holdings and staff of the State Library and State Archives.

We met in what was once the courtroom of the Arizona Supreme Court. Miranda v. Arizona was heard in this very room, which has awesome light fixtures. However, we weren't there to discuss constitutional law or interior design: we were busy figuring out how to set up a LOCKSS box and configure Ubuntu operating system software. I was more than a bit out of my depth, but I'm looking forward to learning more.

Tuesday, October 21, 2008

PeDALS partners meeting: day two

The meeting of the five PeDALS partner states -- Arizona, Florida, New York, South Carolina, and Wisconsin -- continued today. This morning, we reviewed some of the preservation metadata elements included in our draft metadata schema. We also created a subgroup responsible for identifying content standards (e.g., AACR2) and data value standards (e.g., LCSH) that might apply to specific elements.

We then discussed some of the Web interfaces that we will need in order to facilitate the records transfer process, ingest into LOCKSS, etc. We ended up revisiting issues of workflow and identifying specific steps that should be taken as records move through the PeDALS system. Our deliberations were a bit chaotic at time, but they did force us to bring to the surface some unexamined assumptions and addess some unanticipated issues. I realize that, again, I'm being really vague, but our discussions will continue after we return home and I don't want to report any erroneous information.

We then had a quick (and daunting) display of BizTalk -- which was chiefly valuable, at least from my point of view, becuase it highlighted how it could pull out and act upon information entered into, e.g., an online library catalog -- and LOCKSS. However, none of this stuff looks impossibly complex; once we start getting our hands dirty, I think we'll be fine.

PeDALS partners meeting: day one

My Arizona vacation came about because I needed to travel to Phoenix for a Persistent Digital Archives and Library System (PeDALS) project partners meeting. PeDALS, which is funded by the Library of Congress's National Digital Information Infrastructure Preservation Program and the Institute for Museum and Library Services, is designed to allow the state libraries and state archives of Arizona (the project lead), Florida, New York, South Carolina, and Wisconsin to:
  • Develop a highly automated workflow for acquiring, describing, and providing access to electronic government publications and records.
  • Build a storage network that inexpensively and reliably preserves the authenticity and integrity of records and publications.
I'm being purposely vague about the day's proceedings because our discussions are still in progress (and may not be resolved for some time), but I think I can say that we spent some time this morning reviewing project progress and timelines, but devoted most of the day to a review of the draft metadata schema that we developed several months ago. We also engaged in some real-world exploration of two metadata extraction tools: JHOVE and the National Library of New Zealand's MetaExtractor. We got a lot of work done today, and tomorrow promises to be just as intense and fruitful.