l'Archivista: digitization

Showing posts with label digitization. Show all posts

Monday, August 24, 2015

SAA 2015: Cleveland Digital Public Library

Main lobby, Cleveland Public Library, East 3rd Street and Superior Avenue, Cleveland, Ohio, 2015-08-22. This is what a library should look like.

The annual meeting of the Society of American Archivists ended early Saturday afternoon. I then visited the main branch of the Cleveland Public Library. I fell in love with this library as an undergraduate, and I was pleased to see that the original building, a Beaux Arts beauty, has received some much needed care and that a sparkling 21st century addition now sits immediately to the east of the library's reading garden.

I was particularly pleased to discover that earlier this year, the library launched the Cleveland Digital Public Library, which supports digitization of historically significant materials owned by the Cleveland Public Library, other cultural heritage institutions, and organizations and individuals in the Cleveland area. Cleveland was one of four large public libraries that received Library Services Technology Act and Ohio Public Library Information Network funding that supported the purchase of high-resolution scanning equipment and storage, and Cleveland's program is unique in that it allows community members to use its scanning equipment and to add copies of the resulting image files to the library's permanent digital collections.

I know that the Cleveland Public Library isn't the first institution to create and maintain digital images of manuscript and archival materials that remain in the hands of their creators, but it may be unique in that it puts community members in charge of determining whether their materials should be added to the library's collections and enables them to create and donate copies of their materials at their convenience. Almost all of the other "scan and add" projects with which I'm familiar have sought to collect copies of materials that focused on a given event (e.g., the Civil War) and make their scanning services available to community members for only a few hours or a few days at a time.

I imagine that, in at least a few instances, the community-created images added to the Cleveland Public Digital Library's collections will strike archivists, librarians, and other members of the community as less than preservation-worthy. However, judging from the videos embedded in the Cleveland Digital Public Library's web page, this program will help to ensure that some fascinating Cleveland lives and stories are preserved and made broadly accessible. It pleases me deeply that the Cleveland Public Library is taking a 21st century approach to collecting and facilitating access to the city's historical record.

Monday, November 14, 2011

BPE 2011: building digital repositories

Still playing catch-up re: the 2011 Best Practices Exchange (BPE). After the BPE ended, I spent a few days in Ohio with my parents, came back to Albany, prepped for and gave a presentation on salvaging and recovering data from electronic media, got sick, got well, got sick again, and got well again. Now I’m barreling through all kinds of personal and professional backlogs.

I took decent notes, but three weeks have elapsed. If you were there and your memory differs from mine, please let me know. I’ll update/correct as needed.

One of the most interesting BPE sessions I attended featured two speakers who focused on the creation of digital repositories. The first, Mitch Brodsky of the New York Philharmonic, discussed the creation and evolution of the Philharmonic’s repository. At present,staff are digitizing the organization’s international era (1943-1970) and will result in the digitization of 3,200 programs, 72 scrapbooks, 4,200 glass lantern slides (older but easy to do), 8,500 photographs, 8,000 business record folders. By the end of 2012, 1.3 million pages of paper records will be digitized and the repository will house 15 terabytes of data. Digitization of audiovisual materials will add another 2 TB of data to the system. However, the organization also plans to add materials created during the first 98 years of the Philharmonic’s existence (1842-1940) and to incorporate late 20th and 21st century electronic records into the repository.

The project’s larger goals are equally impressive:

Accurate representation of originals. The Philharmonic’s archivists want the digital repository experience to match the research room experience as closely as possible. They don’t want to flatten curled records, disassemble bound volumes, or do anything else that would make the digital surrogates noticeably different from the originals. As a result, they’re using a digital camera (and the photographer who produced the digital surrogates of the Dead Sea Scrolls) to capture the originals, and many of the digital surrogates have a three-dimensional look. (Click here for an example.)
Comprehensiveness. Staff are sensitive to privacy concerns, but want the digital repository to be as complete as possible.
Easy and free accessibility. The Philharmonic expects that its digital repository Web site will be the public access mechanism for its archives.
A new, sharable model for digitizing large collections.

As you might expect, the repository’s technical infrastructure is pretty sophisticated -- and entirely open source:

ImageMagick is used to convert images delivered by the photographer into various formats and sizes.
OpenMigrate is used to channel data into and out of Alfresco.
Alfresco, the open source content management system, serves as the repository’s core. (At present, the New York Philharmonic may be the only institution using it to build a repository of archival materials, so this project really bears watching.)
Alfresco is not yet developed enough to meet the Philharmonic’s data entry standards, and as a result it enters metadata into homegrown databases and then ingests the metadata into Alfresco.
The repository’s search functionality is handled by Solr, Apache’s search server.
The repository’s viewer component is a Javascript tool developed by the Internet Archive.
A suggested materials component based upon end user suggestions ties together related materials of different types and other end user input will be added via phpList.
Vanilla forums will promote end user discussions.

Brodsky also shared a number of lessons learned. As far as I’m concerned, anyone thinking of undertaking any sort of large systems development project should devote a substantial amount of thought to each of them:

You don’t know what you don’t know. Brodsky never expected that he would learn PHP, become a bugtracker, or proof code. However, he’s an on-the-ground project manager, and the Philharmonic had problems with its vendor.
Do it manually before you automate. The Philharmonic started out doing a lot of manual review and dragging and dropping. However, doing lots of hands-on work before setting up an automated system revealed where errors pop up and enabled Brodsky to figure out how to correct them. Deep and intricate understanding of every phase of your project is a must.
Vendors need to earn it. Do not be laid back. The vendor is there to do right by you, and it’s their job to convince you that they can be trusted. (Hear, hear! Managing vendor relationships and retaining or taking control of projects on which vendors work was a recurring BPE 2011 theme).
Archivists who develop systems are product developers. As Brodsky put it: “You are not the same sort of archivist you were before you went digital.” People are actively accessing your online resources from all over the world, and they expect that your system will be reliable.

John Sarnowski of the ResCarta Foundation then gave a demonstration of the ResCarta Toolkit, an open source, platform independent bundle of tools that enables institutions to create digital collections that range from the very small to the very large.

The toolkit contains a variety of useful, easy-to-use tools:

Metadata creation: assigns institutional identifier, adds directory organization with aggregator/root identifiers, adds metadata to image files using forms, and writes Metadata Encoding and Transmission Standard (METS) XML files to root directory.
Data conversion: converts JPEG, PDF, TIFF, or existing ResCarta data to TIFF with embedded metadata, writes a final object metadata.xml file with checksum. Archives and libraries have the option of using preconfigured METS XML (ResCarta metadata schemas are registered METS profiles) or apply a custom metadata template to all of the files in a given directory or tree.
Textual metadata editor: enables viewing and editing of OCR metadata and addition of descriptive metadata.
Collection manager: creates collections, manages digital objects, allows editing or augmenting object metadata, outputs METS collection level XML file, and can output Dublin Core or Open Archives Initiative_Dublin Core data from the collection-level metadata.
Indexer: creates a Lucene index of collection contents, indexes the collection level metadata, indexes all textual metadata from each TIFF, rebuild and optimize options.
Checksum verification: creates a checksum and verifies against the original checksum.

A separate ResCarta Web Application facilitates Web publishing of ResCarta digital collections. Simply download the application and drop your ResCarta data directory into the application.

Libraries and archives can also use ResCarta to create metadata before adding objects to CONTENTdm, and the ResCarta Foundation is thinking of creating a tool that will enable METS and Metadata Object Description Schema (MODS) metadata to be moved into CONTENTdm in a streamlined, easy fashion.

I haven’t yet had the opportunity to play around with ResCarta --I just bought a new computer, but haven’t had the chance to get the thing hooked up to the Internet or do miscellaneous software installs -- but I was pretty intrigued and impressed. I’ll report back after I get the chance to play around with it a little bit.

I would be remiss if I did not point out that ResCarta may not be an appropriate solution for everyone: at present, only images and textual files can be added to ResCarta repositories: the ResCarta Foundation is, understandably, waiting for the emergence of good, widely accepted metadata standards for audiovisual materials. However, if you want to build a simple digital repository to house digital images and textual records, by all means check ResCarta out.

Image: Mary Todd Lincoln Home, Lexington, Kentucky, 22 October 2011. William Palmateer built this two-story brick, late Georgian house, which originally served as an inn, in 1803-1806. It was soon purchased by Robert Smith Todd, one of Lexington's most affluent men, and became a home for the growing Todd family. Mary Todd Lincoln was born in 1818 and resided in this home, which is a stone's throw away from the hotel at which the BPE was held, until she married Abraham Lincoln in 1842.

Wednesday, June 22, 2011

Archival mysteries

Yesterday, Der Spiegel and Lens, the visual and multimedia reporting blog of the New York Times, published a lengthy, unsettling, and thought-provoking post about a Nazi photo album that has recently surfaced. The album is unusual in that it depicts both Nazi leaders -- Hitler among them -- and victims of Nazi persecution. No one, including the elderly garment industry executive who received it in lieu of cash repayment of a loan, knew who created it, but it documents the travels of a Nazi Party unit responsible for planning mass rallies from Berlin to Minsk and Smolensk -- via Danzig (now Gdansk), Königsberg (now Kaliningrad), and Barysaw -- and to Munich.

The owner of the album, who has pressing health and financial problems, wants to sell it, and he hoped that pinpointing its provenance would increase its selling price. The New York Times, intrigued by the historical mystery, researched the album, digitized some of its images, and put them online in hopes that readers would be able to shed light on its origins; however, they made it plain to the owner that their findings might decrease the album's monetary value and that they would not ask any of the experts that they consulted to furnish an estimate of the album's selling price.

Lens author David Dunlap and his colleagues consulted with staff at the United States Holocaust Memorial Museum, Yad Vashem, and New York University and with professors at New York University and Columbia and learned the following:

As even a cursory glance at the well-composed and well-executed photographs reveals, the photographer was a skilled professional. Moreover, the photographer may have been attached to the Propagandakompanie of the Wehrmacht, and this album may have been his personal property.
The pictures were taken in 1941, as evidenced by images of a meeting between Hitler and Admiral Miklos Horvathy, the regent of Hungary, in what was then the East Prussian city of Rastenburg (now Katrzyn, Poland).
The album contains a number of images of prisoners of war, including several of prisoners who wore yellow Stars of David. Photographs of Jewish P.O.W.'s are relatively rare: in most instances, Jewish P.O.W.'s were swiftly turned over to the S.S. and executed, a fate that almost certainly befell the Jews depicted in this album.
One of the images of prisoners is identical to photograph No. 1907/15 held by Vad Yashem's Stephen Spielberg Jewish Film Archive.
The photographer himself appears in several of the images, most notably those taken in Bavaria, where he wore civilian clothes. Moreover, many of the Bavarian photographs, including several in which the photographer is depicted, feature an anonymous woman.

In a textbook example of the value of crowdsourcing, the mystery of the album was solved a few hours after the New York Times and Der Spiegel published the images. Harriet Scharnberg, a German doctoral student who is researching German propaganda photographs depicting Jews, recognized the images of Jewish prisoners and identified the photographer, Franz Krieger (1914-1993), a native of Salzburg, Austria who left the S.S. to join the Propagandakompanie in 1941. Dr. Peter Kramml, the author of a book on Krieger's work, also identified Krieger and supplied confirming evidence.

Scharnberg and Kramml shed light on the circumstances that led to the album's creation and, to a lesser extent, its arrival in the United States:

Krieger traveled to the Eastern Front in August 1941, and the album documents his journey. He photographed the meeting between Hitler and Horvathy on his way home from the front.
The woman in the Bavarian photos is Krieger's wife Frieda, who died in the 17 November 1944 Allied bombing raid on Salzburg; the couple's two year-old daughter, Heidrun, also perished.
Shortly after Krieger returned from the Eastern Front, he left the Propagandakompanie and became a regular soldier, and when the war ended he became a businessman. He never again worked as a professional photographer.
The album, which might have been among the photographs that his mother apparently gave away at one point, was most likely brought to the United States by an American soldier, but its postwar chain of custody may always remain a mystery.

I strongly encourage you to check out the Lens posts (or, if you're fluent in German, the Der Speigel post, which is available here). The images are compelling and disturbing, and the speed with which the album's creator was identified ought to be an inspiration to archivists and other people seeking to learn more about records that have a tantalizingly incomplete provenance.

We should nonetheless keep in mind the counsel of Marvin Taylor, the head of New York University's Fales Library, who noted before the images were published that the photographs were printed on two different types of paper and thus may have been the work of more than one person, that the album might have been assembled by a third party, and that the photographs might not have been in chronological order: "We think we can get so close to these people [i.e., records creators], but we can’t. They are not the same people we are. We come up with assumptions -- and the material always undermines what we think." Although it's heartening to see an archival mystery solved with such speed and accuracy, we archivists should always keep in mind that some of our mysteries resist solution and that our own assumptions and conclusions may lead us astray.

And if you believe that the digital age will be devoid of archival mystery, let me assure you that, thanks to missing and incorrect metadata, corrupted files, ill-advised migrations and conversions, murky transfers of custody, and a host of other problems, we are on the cusp of a most mysterious age. Earlier today, I was looking through a series of born-digital photographs in an effort to find exhibit-worthy images and started scrutinizing their internal timestamps, which are visible only when the images are displayed at 10 times their original size and which aren't included in the metadata that accompanied these images. I quickly realized that when sorted by file name, these images, which were taken seconds apart and run through a variety of systems before they were transferred to my repository, are actually in reverse chronological order -- something that escaped me when I initially processed these files several years ago. This isn't the first digital mystery I've encountered, and it most certainly won't be the last.

Monday, October 25, 2010

More catching up

Happy Monday. I'm not at the office today -- I worked on Saturday, so this is the second day of my "weekend" -- and I've rounded up some stuff that may interest you:

Vermont Public Radio's Vermont Edition interviewed Vermont State Archivist Gregory Sanford and Terry Cook and Wendy Smith of the University of Manitoba about the value of government archives, the Vermont State Archives' efforts to document the perspectives of citizens as well as the workings of state government, functional appraisal, and the new archival challenges of the digital era. You can catch this excellent episode, which aired on 18 October, here.

Last year, Google began working with archivists to add digitized aerial photographs of major European cities that were taken in 1945 into its popular Google Earth application, which allows users to view present-day aerial images of the entire planet. Last week, Google added additional historical photographs, including photographs of London taken in 1945, to Google Earth. As a result, users can easily see London, Paris, Warsaw, and several other major European cities -- some of which were heavily bombed during the Second World War -- looked in 1945 and how they look today. Cool.

On 1 October, George Mason University hosted an Archiving Social Media conference that addressed the following topics: potential uses of archived social media content, institutional responsibility for preserving social media content, the ethics of archiving social media, capture and preservation tools, types of content that are being overlooked, and copyright issues. Notes are available on the Archiving Social Media conference Web site, Travis Kaya at Wired Campus and Kate Theimer at ArchivesNext have posted about it, there's an Archiving Social Media Zotero group, and all of the conference-related tweets (#asome) are here. (NB: the #asome hashtag apparently has multiple uses, so you'll need to zero in on tweets sent on or around 1 October.)

The state of Texas recently recovered an 1858 state Supreme Court document that concerned a slave-related case and that somehow fell into private hands -- and one of my own colleagues at the New York State Archives helped to make this recovery possible. She traveled, on her own time and at her own expense, to the upstate New York home of the man who held the document and calmly explained how she knew it was a Texas government record. The collector, who had reacted angrily when a police officer aggressively sought to recover the document, quickly agreed to turn over the record. There's a lesson here, folks: most collectors want to do the right thing, and civility and a willingness to explain the value of government records will often result in the return of an alienated record. Calling in law enforcement probably shouldn't be the first step.

Tuesday, September 7, 2010

Digitization can save lives

Archivists are accustomed to asserting that creating digital surrogates of paper records and analog recordings will increase access and facilitate genealogical and other types of historical research. We're not used to thinking about the ways in which digitization can save lives, but a new short film, Saving Data, Saving Lives, highlights just how digitizing historical weather data can save the lives of millions of people who might otherwise perish as a result of floods, drought, and other catastrophic weather events.

The film, which is an entry in LinkTV's ViewChange Online Film Contest, focuses on the work of the International Environmental Data Rescue Organization (IEDRO), which digitizes developing nations' weather observations and makes the results freely available to scientists who can identify areas that are particularly flood-prone and predict the frequency of catastrophic weather events. Governments and individuals living within these nations can then plan accordingly.

Saving Data, Saving Lives is a little more than 5 minutes long, and I strongly encourage you to take the time needed to watch it (note: the audio sounded muffled when played through my speakers, but was okay when played through headphones). I also encourage you to register at LinkTV and affirm the importance of IEDRO's work by voting for Saving Data, Saving Lives. You may cast one vote every 24 hours until 12:00 PM Pacific Time, 15 September 2010; details are available here.

Thanks to Chris Muller for alerting me to this video!

Monday, November 2, 2009

MARAC Fall 2009, S1: Solutions to Acquiring and Accessing Electronic Records

Pavonia Arcs, by Robert Pfitzenmeier (2004), Newport, Jersey City, 29 October 2009.

Along with Ricc Ferrante (Smithsonian Institution Archives) and Mark Wolfe (M.E. Grenander Department of Special Collections and Archives, University at Albany), I had the good fortune to participate in this session, which was graciously chaired by Sharmila Bhatia (U.S. National Archives and Records Administration).

Ricc Ferrante discussed the challenges of accessioning and preserving archival e-mail created by employees of the Smithsonian Institution's semi-autonomous museums and research institutes. His experience should resonate with many government and college and university archivists. Until late 2005, the Smithsonian's component facilities used a variety of e-mail applications, and retention guidelines were implemented in 2008. As a result, the archives is both actively soliciting transfers of cohesive groups (i.e., accounts) of documented and backed-up messages at predetermined intervals and passively accepting transfers of older groupings of records in a variety of formats.

Ricc then discussed the processing of these e-mails, which is performed on PC or Mac desktop computers. Incoming transfers are backed up, analyzed and documented, converted to a preservation format, and securely stored. The Smithsonian Institution Archives uses a tool to convert accounts or groupings of messages in formats other than MBOX to the MBOX format, and the Collaborative Electronic Records Project (CERP) parser then converts the MBOX files to an XML-based preservation format. Experimenting with the MBOX conversion tool and the CERP parser has been on my to-do list for some time, so I was really glad I got the chance to hear Ricc discuss these tools.

Mark Wolfe discussed how the M.E. Grenander Department of Special Collection and Archives is using Google Mini, a modestly priced "plug and play" search appliance that will index up to 300,000 documents, to improve access to its student newspapers. Prior to the installation of Google Mini, a paper card file was the only access mechanism for these publications, and Google MIni has made it possible for staff to find information about people who became prominent well after they left the university (e.g., gay rights activist Harvey Milk, '51), respond quickly to reference inquiries, and enhance access to the newspapers.

Mark also highlighted the shortcomings of Google Mini's indexing of digitized materials. When assigning titles, it looks for the most prominent text on a given page, which in a newspaper may be part of an ad, not a story. Dates are another problem. When sorting search results by date, it hones in on the date the digital file was created, not the date of the scanned original. The former problem can be corrected, albeit with considerable effort, by manually changing the author, title, etc. properties of the files, which are in text-based PDF format. However, the date properties, which help to safeguard the authenticity of born-digital files, cannot easily be changed and thus inhibit date-based access to scanned archival materials. There's been a lot of talk lately about how the management of born-digital and born-again digital materials will eventually converge, but Mark's talk is a good reminder that we're not quite there yet.

My presentation concerned our capture of New York State government sites and the redaction (i.e.. removal of legally restricted information from records prior to making them accessible) of electronic records converted to PDF format. In lieu of giving an exhaustive recap, I'll just offer a few words of advice to people contemplating electronic redaction. At present, there are several good tools for redacting PDF files, including the built-in tool bundled with Adobe Acrobat 8 and 9, Redax, and Redact-It. If you are using an older version of Adobe Acrobat and can't or don't want to upgrade or purchase an add-on tool, the National Security Agency has produced a document that outlines a laborious but effective redaction procedure. If you commit to electronic redaction, you need to keep abreast of the relevant legal and digital forensics literature: people are trying to figure out how to crack these tools and techniques and recover redacted information, and one of them may eventually succeed.

There are also several really bad PDF redaction techniques. Never, ever use Adobe Acrobat's Draw or Annotate tool to place black, white, etc. boxes over information you wish to redact. Another spectacularly bad idea: "redacting" a word processing document by changing the font color to white or using a shading or highlighting feature to obscure the text and then converting the document to PDF format.

Want to know why these options are so bad? Read this. And this. And this. And this. And this. And this, too (thanks to John J. @ W&L for drawing my attention to this recent blunder.)

Saturday, October 31, 2009

MARAC Fall 2009, plenary session

The Hudson & Manhattan Railroad Powerhouse, Jersey City, 30 October 2009. Built in 1906-1908, this beautiful Romanesque Revival structure -- which by night looks as if it could serve as a set for a Frankenstein film -- once powered what is now the PATH train system. It ceased operations in 1929 and has stood vacant ever since, and its owner, the Port Authority of New York and New Jersey, ~~wanted~~ recently sought to tear it down and build a parking deck on the site. Owing to the determined efforts of the Jersey City Landmarks Conservancy and other citizens, this remarkably sturdy building has been declared a National Historic Site and is on the verge of redevelopment.

MARAC got off to a roaring start yesterday morning: during the plenary session, Ellen Fleurbaay and Marc Holtman of the Amsterdam City Archives discussed the Archiefbank, the repository's on-demand scanning program, and the institutional changes required to make it work.

The archives, which holds a wide array of municipal government records and other materials documenting the history of the city, experienced substantial declines in in-person visitors during the early 21st century; at the same time, the number of visitors to its Web site increased steadily. Visitor statistics are the measure of Dutch cultural institutions' success, and the archives realized that it needed to reinvent itself in order to survive. To that end, it articulated two main goals:

In-person visitors will experience the look and feel of authentic archival documents and the pleasure of doing historical research.
Everyone should be able to access all archival collections at home at all times.

In support of the first goal, the archives changed its name and logo, developed a new facility in the city center, developed a permanent exhibit, offered evening events and weekend hours. It also transformed its research room into a wired "information center" in which people were encouraged to discuss their work with others; this idea intrigues me, but the security-minded archivist and the tranquility-loving researcher within me have a few doubts.

In support of the second, it radically expanded its digitization program. The archives holds more than 20 miles of records -- which would take an estimated 406 years to scan -- but quickly realized that it should first focus on its most heavily used documents.

It also developed a stunning new program that allows users to request scanning of specific ~~her~~ records. Online researchers scan the EAD-encoded finding aids in the Archiefbank, and with a simple click of a button request scanning of specific records. The Archiefbank then generates an order number that is used to track the order throughout the scanning process and to generate file names for the scans. Staff retrieve the records, quickly examine them for copyright and preservation issues, and do some minimal prep work (e.g., removing staples), then convey the materials to a scanning vendor. The resulting images are added to the archives' electronic repository, and are then transferred to its Web site. The requester then purchases the scans s/he wants; if a researcher wants materials that have already been scanned and added to the archives' Web site, he or she can do so instantly. The more scans one purchases, the lower the cost per scan.

The archives aims for a turnaround time of 2-3 weeks and a total of 10,000 scans per week. A distinctive mix of circumstance and policy makes this prodigious activity possible:

Dutch law. There is no fee for consulting original records or viewing digital images at the archives, but charges for reproduction are allowed, so the archives can assess fees for scanning materials for online researchers -- and the archives has carefully calibrated its fees so that it breaks even.
Focus on legibility, not preservation-quality scanning. Instead of the high-resolution TIFFs produced for preservation/conservation purposes, on-demand scans are created as low-resolution JPEGs. This policy dramatically reduces the archives' storage costs: the cost of storing 1 TB of 300 dpi TIFFs in a digital repository with remote backup is $7,000 per year, and but that of storing equivalent 200 dpi JPEG 4 images is $77.
Emphasis on high volume. The archives' in-house scanning facilities support preservation/conservation scanning, and on-demand scanning is outsourced. In order to reduce the amount of manual processing needed, the archives scans entire files, not individual documents; researchers pay only for those scans that they want.
An efficient back-office operation. The archives has developed a barcode-driven management system that enables staff to identify precisely where each group of records slated for scanning is located and which current and succeeding tasks are to be performed on each group.
A well-developed IT infrastructure. Although Fleurbaay and Holtman didn't emphasize this point, it's pretty evident that without robust and seamlessly integrated systems, high-volume on-demand scanning wouldn't be possible. Image ordering and purchasing functionality meshes neatly with the archives' EAD finding aids, and the archives' document viewer has a built-in filter that enables users to increase contrast -- a real help when inks have faded over time.

Everyone present was wowed by the Amsterdam City Archives' efforts, which by every measure are a rousing success: visits to the repository have increased five-fold, 15,000 registered online users have requested scans, and after two years of high-volume scanning more than 7 million images are available online.

I have the feeling that just about everyone who attended this presentation is going to devote a lot of time to thinking about their repositories can emulate the example set by the Amsterdam City Archives. Most of us probably won't be able to establish programs as sophisticated or as large as that of the Amsterdam City Archives -- because we lack the needed IT infrastructure, hold tons of copyrighted or restricted materials, or work in government archives that are legally barred from charging for online access -- but many of us will likely reassess some of our digitization practices and priorities. And that's a good thing.

Friday, October 23, 2009

Archives and Web 2.0

Here's a tasty tidbit: Library Journal has a great overview of a recent Association of Research Libraries (ARL) and the Coalition for Networked Information (CNI) session on the pluses and minuses of adding digitized archival material to Wikipedia and other popular Web 2.0 sites.

Friday, August 14, 2009

SAA 2009: Seeing the Forest: Environmental Sustainability and Archives (Session 406)

Texas Lottery drawing, as seen through the grate on the front door of the Texas Lottery Building, 611 East 6th Street, Austin, Texas, at about 10:30 PM on 13 August 2009. Refusing to address the varied sustainability issues that impinge upon our lives and our work is sort of like relying on the lottery to fund our retirements.

I went to this session largely because of my nagging concerns about the environmental impact of electronic records work -- and because I wanted to meet Terry Baxter, with whom I’ve collaborated via e-mail and who is a Facebook friend. However, all of the panelists were full of good ideas about how to reduce the environmental impact of our work.

Heather Soyka of Texas Tech University focused on archival facilities. Buildings consume 66 percent of all of the energy used in the United States, 80 percent of that energy is derived from coal, and 10-20 percent of it is wasted. Archivists can reduce their facilities’ energy demands by making intelligent design choices (e.g., separating collections storage and work spaces, installing shared printers, situating staff offices on the sunny side of the building), collecting environmental data, and understanding the basics of their HVAC systems and establishing good relationships with maintenance staff. At the same time, they should avoid becoming overly fixated on climate targets; the increasing precision of environmental data-gathering tools can result in an energy-intensive fixation on minor variations in temperature and relative humidity. Proactively “greening” one’s facilities and implementing other sustainability initiatives (e.g., offering campus or community shredding days and recycling the shreds) demonstrates one’s seriousness, commitment to saving money, and interest in one’s community.

Kristen Yarney-Tylutki of the University of Scranton (presentation here) highlighted the role of people in promoting sustainability. Conservation psychologists have determined that the chief reason people don’t engage in sustainable behaviors is that they perceive the disincentives to be greater than the incentives. Moreover, simply providing information isn’t enough to propel behavior changes. Individuals need to make an active commitment to sustainable behavior, and institutions and individuals need to establish new social norms (e.g., reserving the best parking spaces for carpoolers, retooling Web sites to highlight public transportation and bicycling and downplaying driving), provide feedback, and furnish visual and auditory prompts (e.g., prominently placed recycling receptacles). Archivists can also question vendors about their sustainability practices, work with local and state government to green their archives, cultivate private donors receptive to “green” initiatives, and promote sustainability via their professional organizations and online social networks.

Terry Baxter of the Multnomah County (Oregon) Records Program ended the session with a presentation that first identified the core characteristics of sustainable systems (eliminating our contribution to the progressive buildup of substances extracted from the earth, the progressive buildup of human-made chemicals and compounds, the progressive physical deterioration of nature and natural processes, and conditions that undermine people’s ability to meet their basic needs) and then detailed the environmental, technical, and recordkeeping dimensions of electronic records sustainability.

Electronic storage media and hardware have short lifespans, require energy, petroleum, heavy metals, and toxic compounds, and some of our digital preservation solutions (e.g., LOCKSS, which Terry memorably described as Lots of Copies Keeps Servers Stuffed) are highly resource intensive. Although digitizing paper materials might conserve energy by enabling users to conduct research at home, not everyone wants to be online, and we need to determine at what point the economic and environmental costs of electronic storage and migration outweigh the benefits of keeping information in digital form. In some instances, opting to retain information on paper might be the more sustainable choice. Moreover, we must take sustainability issues into account when performing conversions and migrations: each action we perform is likely to produce something that needs to be disposed of, and we can’t defer responsibility for assessing the impact of our choices upon the world.

During the discussion section, the issue of how to talk to IT people about sustainability issues arose, and Terry noted that he emphasized cost savings: good electronic records management frees up large quantities of server space, thus eliminating the need to purchase more storage. Kristen Yarmey-Tylutki noted that it’s important to need learn at least a little about green computing options and discuss these options with vendors and IT people.

All in all, a thought-provoking session that I suspect will remain on my mental back burner for some time. However, I can’t shake the nagging thought that flying to and from Austin is in, all likelihood, the least green thing I’ll do all year . . . .

Monday, July 13, 2009

Catching up: mid-summer edition

Sorry for the light blogging as of late. Between work, helping to plan the 2009 Best Practices Exchange, and getting some stuff done around the house, I've been spending more time than usual away from the computer (which isn't always a bad thing!)

Here are a few things that have caught my eye recently:

The New York Times' City Room blog posted a fascinating piece on federal Standard Form 152, which federal agencies use when creating, modifying, or doing away with "standard" or "optional" federal forms (and, yes, there are other types of federal forms not covered by Standard Form 152). Archivists and records managers should ponder the fact that the electronic era has witnessed an increase, not a decrease, in the number of forms in use.

Some scholars believe that the National Archives of the United Kingdom's sweeping digitization program, plans to store digitized materials offsite, and budget-induced layoffs and reductions in operating hours constitute a "dumbing down" of the archives.

A group of attorneys who have defended prisoners being held at Guantanamo Bay are working with archivists and scholars from New York University's Tamiment Library and Seton Hall University's School of Law to preserve records documenting their work. The Web-at-Risk Project will help to preserve related materials.

Finally, on a lighter note, the City of Vancouver Archives and the Florida State Library and Archives have recently placed digitized moving images on the Web. If you're interested in seeing how Vancouverites amused themselves in the 1920s, World War I flying aces soar over British Columbia, a young Jim Morrison (yes, the Jim Morrison) learn about Florida's public university system, or on-the-job training of Weekee Watchee Mermaids, you're in luck! Seriously, these clips are enough to make any archivist question why s/he opted against specializing in moving image materials.

Friday, June 26, 2009

NDIIPP project partners meeting, day three

Union Station, Washington, DC, at around 3:00 PM today.

The Library of Congress (LC) National Digital Information Infrastructure Preservation Program (NDIIPP) partners meeting wrapped up this afternoon. This morning’s presentations concerned the Unified Global Format Registry, the PREMIS metadata schema, the Federal Digitization Standards Working Group, and the LC’s proposed National Digital Stewardship Alliance.

Right after breakfast, Andrea Goethals (Harvard University) discussed the Unified Global Format Registry (UGFR) and the importance of file format registries generally. One of the main goals of digital preservation is to ensure that digital information remains useful over time, and as a result we must determine whether a given resource has or is likely to become unusable. In order to do so, we need to answer a series of questions:

Which file format is used to encode the information?
What current technologies can render the information properly?
Does the format have sustainability issues (e.g., intellectual property restrictions)?
How does the digital preservation community view the format?
What alternative formats could be used for this information?
What software can transform the information from its existing format to an alternative format?
Can emulation software provide access to the existing format?
Is there enough documentation to write a viewing/rendering application that can access this format?

Format registries avoid the need to reinvent the wheel: they pool knowledge so that repositories can make use of each other’s research and expertise.

The UGFR was created with the April 2009 merger of the two largest format registry initiatives: PRONOM, which has been publicly accessible for some time, and the Global Digital Format Registry, which was still under development at the time of the merger. The UDFR will make use of PRONOM’s existing software and data and the GDFR’s support use cases, data model, and distributed architectural model. Moreover, it will incorporate local registry extensions for individual repositories and support distributed data input. At present, it’s governed by an interim group of academic institutions, national archives, and national libraries; a permanent governing body will be established in November 2009.

I’ve used PRONOM quite a bit, so I’m really looking forward to seeing the UGFR.

Rebecca Guenther (LC) then furnished a brief overview of the Preservation Metadata: Implementation Strategies (PREMIS) schema and recent PREMIS-related developments.

The PREMIS schema, which was completed in 2005, is meant to capture all of the information needed to make sure that digital information remains accessible, comprehensible, and intact over time. It is also meant to be practical: it is system/platform neutral, and each metadata element is rigorously defined, supported by detailed usage guidelines and recommendations, and (with very few exceptions) meant to be system-generated, not human-created.

I’m sort of fascinated by PREMIS and have drawn from it while working on the Persistent Digital Archives and Library System (PeDALS) project, but I haven’t really kept up with recent PREMIS developments. It was interesting to learn that the schema is now extensible: externally developed metadata (e.g., XML-based electronic signatures, format-specific metadata schemes, environment information, other rights schemas) can now be contained within PREMIS.

I was also happy to learn that the PREMIS development group is also working on incorporating controlled vocabularies for at least some of the metadata elements and that this work will be available via the Web. (Id.loc.gov)

The group is also working on a variety of other things, including:

Draft guidelines for using PREMIS with the Metadata Encoding and Transmission Standard (METS)
A tool that will convert PREMIS to METS and vice versa
An implementers registry www.loc.gov/premis/premis-registry.html
Development of a tool (most likely a self-assessment checklist) that will verify PREMIS implementers are using the schema correctly
A tool for extracting metadata and populating PREMIS XML schemas

Guenther also shared one tidbit of information that I found really interesting: although PREMIS allows metadata to be kept at the file, representation, and bitstream level, repositories may opt to maintain only file-level or file- and representation-level metadata. I hadn’t interpreted the schema in this manner, and someone else at the meeting was similarly surprised.

A quick update on the work of the Federal Digitization Standards Working Group followed. Carl Fleischauer (LC) explained that the group, which consists of an array of federal government agencies, is assembling objectives and use cases for various types of digitization efforts (e.g., production of still image master copies). To date, the group’s work has focused largely on still images, and it has put together a specification for TIFF header information and will look at the Extensible Metadata Platform (XMP) schema. In an effort to verify that scanning equipment faithfully reproduces original materials, it is also developing device and object targets and DICE, a software application (currently in beta form).

The group is also working on a specification for digitization of recorded sound and developing audio header standards. However, it is waiting for agencies to gain more experience before it tackles video.

The meeting ended with a detailed overview of LC’s plan to establish a group that will sustain NDIIPP's momentum. The program has just achieved permanent status in the federal budget, and all of the grant projects that it funded will end next year.

In an effort to sustain the partnerships developed during the grant-driven phase of NDIIPP’s existence, LC would like to create an organization that it is tentatively calling the National Digital Stewardship Alliance. Meg Williams of LC’s Office of Counsel outlined what the organization’s mission and governance might look like; before creating the final draft charter, LC will host a series of conference calls and develop an online mechanism that will enable the partners to provide input.

LC anticipates that this alliance, which is intended to be low-cost, flexible, and inclusive, will help to sustain existing partnerships and form new ones. In order to ensure that the organization remains viable, LC envisions that the organization will consist of LC itself, members, and associates:

Organizations willing to commit to making sustained contributions to digital preservation research would, at the invitation of LC, become full members of the alliance and would enjoy full voting rights. Member organizations would not have to agree to undertake specific actions or projects, but they would have to commit to remaining involved in the alliance over time.
Individuals and organizations that cannot commit to making ongoing, sustained contributions to digital preservation research but have an abiding interest in digital preservation, support the alliance’s mission, and are willing to share their expertise would, at LC’s invitation, become associates. Associates will not have voting status.
LC itself will serve as the alliance’s chair and secretariat and will use program funding to support its activities; there will be no fees for members or associates. It will also maintain a clearinghouse and registry of information about content, standards, practices and procedures, tools, services, and training resources. It will also facilitate connections between members and associates who have common interests, convene stakeholders to develop shared understanding of digital stewardship principles and practices, report periodically on digital stewardship, and provide grant funding if such monies are available.

LC projects that this committee will have several standing committees responsible for researching specific areas of interest:

Content: contributing significant materials to the “national collection” to be preserved and made available to current and future generations.
Standards and practices: developing, following, and promoting effective methods for identifying, preserving, and providing access.
Infrastructure: developing and maintaining curation and preservation tools, providing storage, hosting, migration and other services, building collection of open source tools.
Innovation: encouraging and conducting research, periodically describing a research agenda.
Outreach and education: for hands-on practitioners, other stakeholders, funders, and the general public.
Identifying new roles: as needed.

LC also sees these committees as having a governance role: at present, it envisions that the alliance’s Governing Council will consist of the Librarian of Congress, the LC Director of New Initiatives, the chairs of all of the standing committees, and a few members at large.

Williams closed by asking everyone present to think about this proposal how to define “success” and “failure” for the alliance, identify benefits of participation for their own institutions and for others, and supply feedback to LC. LC hopes to have a final draft charter finished later this year.

At this point, I think that creating some sort of formal organization makes a lot of sense but don’t have any strong ideas one way or another about the specifics of LC’s proposal. The past few days have been jam-packed/ Even though I relished the opportunity to hear about what’s happening with NDIIPP and to meet face-to-face with my PeDALS project colleagues -- several people told me that the PeDALS group struck them as really hard-working and really fun, and they’re right -- I’m really feeling the need to get home (I’m writing this post on the train), get some sleep, and reflect on everything that took place over the past couple of days. I’ll keep you posted . . . .

George Washington Bridge, Hudson River, as seen from Amtrak train no. 243 at around 8:30 PM.

Saturday, June 6, 2009

New York Archives Conference, day two

The 2009 New York Archives Conference wrapped up yesterday afternoon, and everyone in attendance seemed to have a great time.

The first morning session I attended, “Exploring the Possibilities of Web 2.0 for Cultural Heritage Websites,” gave attendees an introduction to the world of Web 2.0 and some of the ways in which archivists could make use of it.

Greg Bobish (University at Albany, SUNY) provided an overview of some of Web 2.0’s core concepts and then noted the characteristics that Web 2.0 technologies such as blogs, wikis, and social networking sites share: they are available online from almost any computer (or other device), require minimal technical skills, and encourage and participation and creation and editing of content. Bobish’s presentation, which is a great introduction to Web 2.0 principles, is available online.

Nancy Cannon and Kay Benjamin (both from the SUNY College at Oneonta) then outlined how Web 2.0 technology could be used to make primary source materials freely available to students, teachers, and researchers. They obtained permission from the Delaware County Historical Association to reproduce materials that shed light on life in the county prior to the Civil War, and Cannon drafted historical essays that placed the primary source materials in context. Cannon and Benjamin then used basic HTML coding to create their site, Voice of the People: Daily Life in the Antebellum Rural Delaware County New York Area.

Cannon and Benjamin used Google Maps to add interactivity to sections of the site documenting an 1851 sea voyage from New York to California and a Delhi family's 1823 journey through upstate New York. Benjamin then gave a practical demonstration of how to set up a Google Maps account and then combine maps with text, images, and multimedia materials. As she noted, Google Maps can be of great use to archivists and librarians who want to create interactive online content on a shoestring.

I next attended “Digitizing Audio and Video Materials.” My colleague Monica Gray opened the session by explaining how the New York State Archives used a one-time allocation of $25,000 to outsource the digitization of 53 motion picture films, 98 video recordings, and 34 audio recordings.

In preparation for digitization, Gray conducted an inventory of holdings, did a lot of background research into digitization standards and best practices, and worked with colleagues and vendors to select materials that were of interest to researchers or in formats on the verge of obsolescence. She stressed that archivists need to specify exactly what they want from their vendors, determine in advance whether to add title frames, etc., and anticipate the need to provide access to the resulting files.

As a result of this project, the State Archives now manages preservation master copies (.wav format, 44.1 kHZ, 16 bit), and access copies (.mp3 format) of audio recordings and preservation master copies (.avi format) and access (.wmp format) copies of moving image materials. It is now focusing on providing access to its use copies.

Gray also outlined some easy preservation measures that all archivists can undertake:

Store media vertically, not horizontally.
Rewind all recordings to the start.
Remove all record tabs from video and audio cassettes.
Remove papers from film canisters (dust is the great enemy of tape and film).
Use film strips that measure the extent of vinegar syndrome in motion picture film.

Andrea Buchner (Gruss Lipper Digital Laboratory, Center for Jewish History) then discussed the results of her repository’s year-long, grant-funded pilot digitization project. Staff digitized 94 oral histories on 142 audio cassettes, 79 hours worth of recordings on 193 reel-to-reel tapes; they also produced transcripts of 23 oral history interviews. Each hour of preservation master recordings comprises 1 GB of data, and Buchner determined that it cost $80 to produce, catalog, and store one hour of digital audio data.

Library staff created preservation master files of each recording (PCM.wav format, 2 channel stereo, 48.1 kHz, 24 bit). Derivative access copies were produced in .mp3 format. They also created a MARC21 catalog record for each recording and incorporated data captured during the digitization process into each record.

Buchner noted that the digitization process itself was easy compared to other challenges that staff encountered:

Unreliable metadata: people hadn’t listened to these tapes in decades, and existing catalog records weren’t always accurate.
Copyright: in some instances, staff had to make use of the “library exception” in U.S. copyright law; i.e., they made a limited number of copies and must restrict access to onsite users, include a copyright notice, and inform users that they should not exceed the fair use provision of U.S. copyright law.

Melinda Dermody (Belfer Audio Laboratory and Archive, Syracuse University) then outlined how her repository digitized some of its approximately 22,000 cylinder recordings, 12,000 of which are unique titles. The Belfer Audio Archive received a $25,000 grant for this ongoing three-year project; a gift that made possible the purchase of a new digital soundboard has made it much easier for staff to work on this project.

The project’s core team includes Dermody, a music librarian, the core metadata librarian, and the digital initiatives librarian, and the Belfer's sound engineer. The group’s goal was to make available online 6,000 audio files (300 are currently available), and to create create preservation master (.wav format, 44.1 kHz, 24 bit) and access (.mp3 format) copies of each recording.

The group determined which cylinders had already been digitized by another university, identified cylinders in fragile condition, and assessed the interests of music faculty and researchers. The digitization of selected recordings is being done by Belfer Audio Archive staff, and staff have created or revised a MARC record for each recording. They use a MARC-to-Dublin Core crosswalk to populate the metadata fields of CONTENTdm, which is being used to provide access to the use copies of the recordings.

After the second session ended, all of the attendees convened for lunch and a great talk by Syracuse University Archivist Ed Galvin, who outlined how the Syracuse University Archives was drawn into the production of The Express (2008), a film about the life of alumnus Ernie Davis, the first African-American winner of the Heisman Trophy.

Preparations for the filming of The Express brought Universal’s production designers and other Hollywood personnel to the SU campus, and Galvin and his staff spent the next 18 months responding to their requests. The filmmakers were intent upon reconstructing Davis’s life on campus as faithfully as they could, and developed a wide-ranging and sometimes surprising list of items they sought and questions they wished to have answered. Galvin and his colleagues supplied detailed information about uniforms, etc., and other aspects of campus life and gave production staff access to yearbooks, copies of the student newspaper, copies of football programs, other campus publications and memorabilia, images of the coach’s office and other SU facilities.

The SU Archives also led licensing negotiations with Universal on behalf of the entirety of the university at large; however, much of the SU material in the film came from departments other than the archives.

Completion of the film, most of which was shot in Chicago, brought additional challenges. The film’s world premiere was held in Syracuse, prompting SU’s marketing unit and development office and a California film marketing firm to request additional materials from the SU Archives. Three days after the film’s premiere, Universal asked the archives to locate footage that could be used to produce a bonus featurette for the film’s DVD release. The archives also received requests for materials from alumni, politicians, History Day students, and other interested individuals.

Galvin made it plain that he and his staff often enjoyed working on this project, but also emphasized that archives approached by film studios should draw up detailed contracts and specify fees before any work begins; SU received only $4,000-$5,000 -- which did not even cover reproduction costs -- for 18 months of intense work on The Express.

NYAC conferences typically don't have overarching themes, but it struck me on the way home that just about every speaker I heard at this year's meeting centered upon clearly articulating one's expections -- about security measures, vendor deliverables, project specifications and outcomes -- and documenting whether or not they have been met. We as a profession haven't always excelled at doing so, and it was really heartening to hear so many colleagues assert the need for this sort of activity.

Tuesday, April 28, 2009

MARAC: Flickr: An Image is Worth a Thousand Views

Flickr is an online photo sharing site that enables users to “tag” (i.e., provide descriptive and other information about) one’s images. In this great session, archivists working in a variety of settings highlighted its practical value to archives.

Barbara Natonson discussed a pilot project undertaken by the Library of Congress (LC), which wanted to learn how social tagging could help cultural institutions and participate in an online community. LC chose Flickr because of its popularity and because its application programming interface (API) facilitated batch loading of photos. LC’s experience should be of interest to many larger repositories.

LC determined at the outset that every image it placed on Flickr would be available via its own site and that it would post only those images that lacked known copyright restrictions. It then did some custom programming that made batch loading practical and made its copyright statement (developed in consultation with the U.S. Copyright Office) appear whenever one of its photos was displayed. It also purchased a Flickr Pro account ($24/year) that allowed it to add large numbers of images and view access statistics.

LC’s first photos went online in early 2008, and LC adds new photos on a weekly basis. As of mid-March 2009, LC’s Flickr images have gotten roughly 15 million views. Most of the traffic comes from Flickr itself, but some of it arrives via seach engines, which index user comments.

To date, approximately 4,500 users have commented on at least one LC image. However, 40 percent of the tags are supplied by a small group of people, and most of the comments concerning images accompanied by good descriptive information simply repeat that information or document emotional/aesthetic responses. Images that lack such information produce the informative tags and comments that LC seeks.

A core group of approximately 20 “power commenters” corrects place names, supplies additional descriptive information, does historical detective work, and incorporates LC images into Wikipedia, etc., entries. These commenters have also highlighted how places have changed over time; photos documenting changes and links to GoogleEarth accompany some of these discussions.

LC actively monitors its Flickr photosets for uncivil discourse, and staff incorporate user-supplied information into LC’s descriptive resources and periodically update Flickr users on LC’s work; this work takes about 15-20 hours per week, and staff rotate responsibility for it. LC has also started incorporating links to Flickr versions of its images into its online catalog.

Natonson noted that that there are some risks to Flickr (and, by extension, other Web 2.0 technologies):

Disrespect for collections -- Flickr privileges the individual image
Loss of meaning/contextual information -- LC links Flickr images to its descriptive information in an effort to remedy
Reduced revenue from photo sales
Undigitized collections are by definition excluded

However, there are also substantial benefits:

Collections are made more widely available
LC gets additional information about its collections
The visibility of specific photos is increased
LC’s Flickr presence helps win support for cultural heritage institutions
Users can mix past and present -- thus leading to a more informed world

Natonson also discussed The Commons, which Flickr has developed specifically for cultural heritage institutions wishing to provide access to images lacking known copyright restrictions and which tries to address the individual biases in Flickr’s existing terms of service. At present, 24 institutions are members of The Commons.

The other presenters highlighted how smaller repositories could make use of Flickr. Judy Silva discussed how the Slippery Rock University Archives, which uses CONTENTdm to manages its digital collections, has used Flickr to reach out to new audiences and experiment with Web 2.0 technology. Slippery Rock’s Flickr project, which made use of the university library’s existing Flickr account, centered on 41 digitized photographs taken by an alumnus during his time in service during the Second World War.

It took Silva one afternoon to load the images into Flickr and do some very basic (i.e., non-LC) tagging, and the rewards have been substantial: to date, Slippery Rock has gotten over 700 comments on these photographs, and one commenter forwarded the obituary of one of the people depicted in one of the images.

Owing to the success of this project, Silva is thinking of adding more recent images in an effort to get information from people who might Google themselves or their friends.

Malinda Triller was not able to come to Charleston, so her colleague Jim Gerenscer discussed how Dickinson College's Archives and Special Collections department, which also uses CONTENTdm, is using Flickr to publicize and obtain more information about its photographic holdings.

By design, the archives’ Flickr project was simple enough to be completed largely by undergraduates. The archivists identified images that lacked copyright restrictions, had appeal outside of the Dickinson community, and had basic contextual metadata, and students scanned the images and added them to Flickr.

Unlike LC and many other repositories, which create high-resolution master images in TIFF format and mount lower-resolution JPEG derivatives on Flickr and their own Web sites, Dickinson didn’t want to manage TIFF files. Students thus scanned the images in JPG format, at 100 dpi, and in grayscale or color as appropriate; in the future, the archives will rescan the images as needed. Project work is documented in a basic spreadsheet that contains the unique identifier, description (collection-derived or student-supplied), and title of each image.

To date, Dickinson’s Flickr photosets, which consist of images of an 1895 family trip to Europe, the 1893 Columbian Exposition, a school for Native American children, and construction of a major road in Alaska, have received 66,000 hits, which is a remarkable amount of exposure for a college archives; however, the archives recently learned that its Flickr account settings greatly limited the number of people who could comment upon the images, and it corrected this error a short time ago. The archives is really pleased with the project and is planning to add another set of images to Flickr.

I think that a lot of archivists are hesitant to embrace Flickr and other interactive Web technologies because they either don’t grasp their potential or fear that they’ll find themselves in the midst of a digital Wild West. This session highlights how repositories of varying sizes can use Web 2.0 technology without being consumed by it or losing physical or intellectual control of their holdings, and many of the attendees seemed really intrigued by these presentations. I suspect that The Commons will grow as a result of this session . . .

Wednesday, April 22, 2009

MARAC: There and Back Again: Nazi Anthropological Data at the Smithsonian

I wrote this post during a long layover at the Detroit Metro Airport on 21 April 2009, and finished around 8:35 PM, but simply wasn't prepared to pay $8.00 for the privilege of accessing DTW's wireless connection.

I attended this session simply because the topic seemed interesting, and I’m glad I did: the records at the center of this session are inherently interesting (albeit in a disturbing sort of way), have a complicated, transnational provenance, and processing them, reformatting them, and determining where they should be housed posed real challenges. Although most of us will never encounter a situation quite as complex, many of us eventually encounter records of uncertain or disputed provenance, materials that lack discernable order, or multi-stage reformatting projects. The decisions that the Smithsonian made and the lessons that it learned thus ought to be of interest to many archivists.

The records in question were created by the Institut für Deutsche Ostarbeit (IDO; Institute for German Work in the East), which the Nazis created in 1940 to settle all questions relating to occupation of Eastern Europe. Edie Hedlin (Smithsonian Institution Archives), Beth Schuster (Thomas Balch Library), and Ruth Selig (Smithsonian) took turns discussing the records’ complicated custodial history and the Smithsonian’s involvement with them.

The IDO had many sections, including one that focused on “racial and national traditions” and researched Polish ethnic groups; however, apart from one study completed in the Tarnow ghetto, the IDO’s racial and national section did not study Jews. The section gathered or created data forms (e.g., personal and family histories), photographs of people and objects, and bibliographic and reference cards and published articles based on some of this research.

U.S. and British troops captured the IDO’s records in 1945, and the U.S. Army brought the records to the United States in 1947. The War Department’s intelligence division and the Surgeon General’s medical intelligence unit went through the records (in the process destroying whatever original order may have existed) and then offered them to the Smithsonian. The Smithsonian accepted the records, but then transferred some of them to the Library of Congress, the National Gallery of Art, and the Pentagon (which then sent some of the records to the National Archives). As a result, there are small pieces of the collection all over Washington, DC.

The IDO records held by the Smithsonian were not used for research until 1997, when a cultural anthropologist reorganized some of them, created the collection’s first detailed finding aid, and eventually published a book based on her research.

In 2003, the Polish Embassy requested that the IDO records be returned to Poland. It took the Smithsonian about five years to figure out how to respond to this request, and its response was the product of repeated consultation between various units of the Smithsonian, the State Department’s Holocaust studies unit, and the Library of Congress, which had received competing requests from the German and Polish governments for materials that had been created by German authorities but which concerned Poland; the State Department, which noted that the Smithsonian’s decision might set a precedent, wanted the governments to reach some sort of agreement concerning the materials in LC’s possession.

In order to determine how it would respond to the Polish government’s request, the Smithsonian set up a task force that examined:

Accepted archival principles and guidelines;
Whether the U.S. Army had acted legally when it took the records and gave them to the Smithsonian;
Whether the other Allied nations had any legal claim to the records;
The Smithsonian’s authority to acquire, hold, and de-accession archival collections;
The records’ unique characteristics and potential research uses;
Whether various other parties—the U.S. Army, the Bundesarchiv and other German government agencies, the U.S. National Archives and Records Administration, the U.S. Holocaust Memorial Museum, the Polish government, and the U.S. State Department—had any interest in the records;
The impact of any precedents that the Smithsonian’s actions would establish upon the Smithsonian itself, the Library of Congress, the Hoover Institution (which holds most of the records of the Polish government in exile), and U.S. government agencies.

The process of determining whether other parties had any interest in the records required tact and discretion. However, the Smithsonian eventually determined that neither the U.S. Army nor the Bundesarchiv objected to returning the records to Poland, and the State Department, which was extremely helpful throughout the process, determined that the German government had no interest in the records.

In September 2005, the Smithsonian decided that it would make copies of the records and then transfer the originals to the Jagiellonian University Archives, which agreed to make them publicly accessible. It opted to digitize the records and then produce microfilm from the scans, and needed to raise a lot of money to do so. It initially requested funding from a private foundation, which deferred giving an answer for approximately a year. When the Polish Embassy inquired about the status of the project, the Smithsonian seized the opportunity to cc: approximately 20 other people and institutions in its response. As a result of this e-mail exchange, the U.S. Holocaust Memorial Museum offered funding for digitization and for conservation and allowed the Smithsonian to use its standing digitization contract; the Polish university to which the records were headed also offered some support.

The Smithsonian engaged Schuster, an archival intern fluent in German, to process the records and oversee their digitization. Schuster humidified, flattened, and cleaned the records, which were trifolded and covered in coal dust and other contaminants, and rehoused them in boxes suitable for A4-sized paper. She imposed order upon them, which was no small challenge. The anthropologist who prepared the initial finding aid had attempted to arrange the records geographically; however, she was chiefly interested in the IDO’s Tarnow ghetto and Krakow studies, and as a result most of the collection was unarranged. Schuster ultimately organized the records by type. In order to preserve the initial arrangement of the records (which was reflected in the anthropologist’s published citations), she created an Access database that tracked the original and new order of each document in the collection and generated container lists that contained crosswalks between the two arrangements.

Schuster also shared a couple of lessons she learned during the digitization phase of the project:

Digitization should begin only after a collection is completely conserved and reprocessed. Project deadlines led the Smithsonian to start digitizing as soon as possible, and as a result, the image files had to be renamed after processing.
Do not underestimate the amount of time and effort needed for good quality control. The Smithsonian needed accurate, complete surrogates and to ensure that every original had been scanned, and as a result Schuster needed to examine each image and count the number of pages in each folder. She had to send back to the vendor many originals that were scanned crookedly or were missed, and she has a jaundiced view of outsourcing as a result.

The project wrapped up in late September 2007, when the records were sent to Poland via diplomatic pouch; however, Schuster continued to rename the image files and correct the finding aid, and the Smithsonian finished producing microfilm from the digital surrogates in April 2009. The transfer deeply pleased the Polish government: within a few months of the transfer, it tracked down people who had taken part in IDO studies as children and completed a short film highlighting their recollections.

Ruth Selig concluded by making a very important point: the transfer was successful because the Smithsonian committed to working through a complicated process in a very deliberate, step-by-step manner. Many different institutions were brought together in interesting and unanticipated ways, and everyone was pleased with the outcome. Even the State Department was pleased; the initial request was technically issued by Jagiellonian University and directed to the Smithsonian, which is not a government agency, so the Smithsonian’s transfer decision really isn't precedent-setting.

All in all, a good session full of practical tips for dealing with a wide array of complex issues.

Friday, December 5, 2008

DCAPE meeting: day two

Today was the second day of the inaugural meeting of the Distributed Custodial Archival Preservation Environments group. We went over over the project timeline (which needs some revision), and spent some time discussing the specifics of the archivist partners' first task: outlining, with reference to the Open Archival Information System Reference Model, the specific functional requirements of the preservation system. We then spent some time going over some practical stuff (travel reimbursements, etc.), talked about the records that each partner was thinking of contributing to the project, and wrapped up at around 11:30 AM. We continued talking informally over lunch, and then broke up headed our separate ways.

Right before lunch, Rich Szary, the director of the Wilson Library, gave me a quick tour of the Carolina Digital Library and Archives, which is housed in Wilson Library and which has an astounding array of scanning equipment -- including a high-capacity, autofeed paper scanner that UNC's conservation staff have approved for use with archival records -- and is digitizing archival materials and rare books with immense speed.

Immediately after lunch, I had to leave for RDU. I had another long layover at EWR. I'm getting really familiar with Concourse C, and got to watch the sun set over the "Airtrain" connecting it to the other concourses; unfortunately, my the window glass through which I shot this picture reflected some of the light in the concourse's interior.

Although the good people at Continental no doubt wanted a different outcome, I was really happy with the way my flight arrangements worked out: none of the planes I was on were full, and I didn't have any seatmats on any of my flights. Owing to the clear night sky, I also got a stunning view of Manhattan, Brooklyn and Queens on the flight from EWR to ALB.

Nonetheless, I am really, really glad to be home.

Saturday, August 16, 2008

Your photos, finally off the shelf . . .

. . . and onto portable media. A couple of days ago, a David Pogue piece in the New York Times focused on ScanMyPhotos.com, which will, for $50.00, digitize 1,000 of your home photos and place them on DVD.

ScanMyPhotos.com isn't the only company providing such services, which might be useful to people who want to produce digital copies of their photos but don't want to take the time to scan thousands of images at home. However, Pogue doesn't discuss the file format and dpi/ppi that ScanMyPhotos uses or the file naming conventions that it employs. According to its Photo Scanning FAQs, bulk scanning customers can receive their images in only one format and resolution (300 dpi JPEG only, at least for bulk scanning customers) and those who want their photos scanned in a specific order, vertical/horizontal orientation preserved, etc. , will have to pay extra. Detailed information about file naming conventions isn't available, but it's pretty evident from the FAQ's that the first image on each DVD is no. 1, the second is no. 2, and so on. Anyone sifting through 1,000 arbitrarily numbered files in search of that wonderful picture of Great Aunt Oona at the 1972 family reunion will have some work to do . . . .

Moreover, although he notes in passing that one advantage of digital photos is that they are "easily backed up," Pogue doesn't explain that the long-term survival of these images will require periodic intervention. Pogue's a sharp guy, and his readership is most likely more technologically savvy than the public as a whole. However, even people who understand, in a general way, that DVDs will eventually become obsolete, that storing backup media right next to the computer isn't a good idea, and that electronic storage media have wildly unpredictable lifespans sometimes fail to plan for the preservation of their data. A paragraph or two about the importance of creating multiple backups, storing backups well away from the computer (in a safe deposit box, at the home of a trusted relative or friend, or at the office), copying files from old media to new in accordance with a predetermined schedule or when replacing one's computer, and remaining abreast of changes in storage technology would have been extremely helpful.

Pogue's absolutely right that digital copies provide an added layer of protection for photos. The parents of a close friend of mine have spent the past few weeks cleaning up after a major house fire, and their family photos were either badly water-damaged or became what Pogue vividly calls "Toxic Photo Soup." They're heartbroken, and I'm sure that they would love to have digital copies of those images--even if said copies were in no discernable order, needed extensive descriptive/indexing work to be truly useful and accessible, and couldn't be used to produce good-quality enlargements. They would also doubtless appreciate some of the extra services (e.g., ability to create albums) that ScanMyPhotos.com and other vendors offer.

Pogue's piece concludes with a brief but very welcome discussion of preservation of photographic prints, and I'm glad that he recognizes the value of keeping the paper originals. I just wish that he had devoted a little attention to the preservation of digital files as well.

Monday, August 24, 2015

Monday, November 14, 2011

Wednesday, June 22, 2011

Monday, October 25, 2010

Tuesday, September 7, 2010

Monday, November 2, 2009

Saturday, October 31, 2009

Friday, October 23, 2009

Friday, August 14, 2009

Monday, July 13, 2009

Friday, June 26, 2009

Saturday, June 6, 2009

Tuesday, April 28, 2009

Wednesday, April 22, 2009

Friday, December 5, 2008

Saturday, August 16, 2008

Search l'Archivista

About l'Archivista

Caveat lector

Contact l'Archivista

Blog Archive

Subscribe To

New York State Archives News and Events

Archivist and Records Manager Blogs

Blogs of Archives in New York State

New York State History Blogs

L'Archivista Also Reads

Labels

Where Are l'Archivista's Readers?

Legal Stuff