The first session of the Spring 2011 Mid-Atlantic Regional Archives Conference focused on three electronic records research projects sponsored by the U.S. National Archives and Records Administration. All of them are intriguing, and all of them promise to help many electronic records archivists do their work.
Peter Bajscy (National Center for Supercomputing Applications) detailed the cloud-based solutions that he and his colleagues have developed in order to the challenges associated with the increasing number and complexity of file formats, the increasing volume of electronic records, growing hardware and software complexity, and ephemeral support for proprietary software. I haven’t had the opportunity to check out these tools, but I certainly will do so as soon as I get the chance:
Conversion Software Registry: A registry and freely accessible search tool that enables users seeking to convert files from one format to another to specify the format of the records with which they’re working and the desired preservation format and then review a list of appropriate conversion tools. Over 2,000 software packages are documented in the registry.
Polyglot: a cloud-based, open source conversion tool suitable for classified and proprietary information.
Versus [in development]: a tool that can compare original and converted versions of the same digital object -- simple and complex -- and evaluate resulting information losses. The results of these comparisons can be used to determine which preservation approach results in the least loss.
Bajcsy and his team are also interested in developing a Universal File Viewer, a cloud-based service that could provide a preview of files encoded in any format.
Bajcsy also posed a few questions for the audience to consider:
- His team can deliver, on average, 1537.41 file conversions in one hour (50% utilization of a single CPU virtual machine and 50% virtual uptime of the virtual machine). Does this conversion rate meet archival needs?
- How many file formats have you personnally encountered in your work?
- Would the Universal File Viewer provide an added value?
- Is data-driven file format selection for preservation a viable approach?
- Is software robustness evaluation a viable approach to determining whether a given file is well-formed? (i.e., determining how many applications can open a given file might be a more practical means of determining well-formedness than comparing the file to the format specification.)
- What is the value of data-driven evaluation of quality of software input/output functionality?
Archivists must identify file formats for a variety of reasons: assessing compliance with submission agreements/transfer memoranda, reading/playing files, conversion to standard or preservation formats, extracting information from archive files (e.g., .zip, .arc), password recovery and decryption, and repairing damaged files. In some instances, it may be possible to use external identifiers (e.g., file extensions, MIME types) to identify unknown formats. However, in some instances, external indicators are not sufficient., and the most popular analytical tools, the Linux file command and magic file, have some limitations: their output is sometimes ambiguous, they test output metadata as well as file types, and their tests for character set and language of text files are less than perfect.
Underwood and his colleagues are refining the Linux file command and magic file so that they produce file format signatures that can be compared to signatures of known file formats. To date, they have defined roughly 850 file format signatures and have collected examples of approximately 700 different file format types. They have also created a file signature database and, as moderator Mark Conrad noted afterward, contributed file signatures to the National Archives of the United Kingdom (NAUK) PRONOM file format registry; these signatures have been incorporated into DROID, NAUK’s open source file format identification tool.
Underwood and his colleagues are also testing new techniques for recognizing document types and extracting descriptive metadata. Their focus is on legacy documents that do not conform to XML document type definitions. They examine the intellectual form (i.e., structure) of these documents and then construct “intellectual grammars” for each document type (e.g., memoranda) and use intellectual extraction techniques to pull out names, dates, and other metadata elements.
Underwood noted in passing that after and his colleagues have extracted this metadata, they can write rules that enable us to create item-level descriptions. From those item-level descriptions, they can write rules that enable us to create file-level and then series-level descriptions. I was really struck by this statement, which suggests that automation is going to lead to some really intriguing -- and to many people unsettling -- changes in archival descriptive practice.
Underwood and his team hope to apply induction techniques to examples of a particular document type and generate a “document grammar” automatically and to expand their extraction techniques to include physical elements of documentary form (e.g., fonts) and document grammars of physical layouts. Cool stuff.
I really can’t do justice to the third presentation. Maria Esteva (Texas Advanced Supercomputing Center) and her colleagues are exploring possible archival uses of visualization technology, and, not surprisingly, her presentation included a lot of illustrations and multimedia material. If you want to get a sense of these materials look like -- and I recommend that you do so -- some of them are featured and on the Texas Advanced Supercomputing Center site and in this month’s issue of Discover magazine; the team has also outlined its findings here.
Visualization tools can be used to depict, compare, and contrast many different types of data. Esteva and her colleagues hope that visualization, which is often easier for the mind to grasp than lengthy textual or statistical analyses, will ultimately help to guide archival processing decisions, facilitate analysis large quantities of electronic records that consist of multiple document types and complex digital objects, and to enhance access to large, complex grouping of electronic records.
Using an electronic records testbed supplied by NARA, Esteva and her colleagues use a variety of automated techniques to identify groupings of files and data objects related by provenance and extract information about content and organization, and then place the resulting data in a relational database. They then use data mining, alignment algorithms, natural language processing, data distributions, and information classes to compare, contrast, and identify intellectual relationships between records and use visualization tools to create graphic representations of the results of their analyses: pie charts, network graphs, and, in particular, tree diagrams.
Esteva presented two visualization case studies that drew upon the testbed. The first highlighted how visualization could help archivists process electronic records by highlighting intellectual content and relationships that were not immediately apparent, assess preservation needs, and identifying other salient characteristics of the records. The other showed how visualization could help users identify collections that were particularly relevant to their research needs. Researchers searching for materials that contain specific intellectual content, have a specific provenance, were created at a specific time, exhibit specific patterns, or have some combination of these and other characteristics could visually assess which collections would be the most fruitful.
Maybe it’s a sign of advancing age (not to mention my fondness for the written word), but at this point I’m not quite convinced that researchers seeking records that possess a specific characteristic or cluster of characteristics will invariably prefer analyzing a tree diagram. However, I am intrigued by visualization technology and I think that we’ll soon come to accept that it can help us identify materials that have particular preservation needs, are responsive to specific freedom of information requests, or have specific traits or patterns that might have otherwise escaped our attention. In addition, I think that researchers will embrace visualization technology as an analytical tool; for example, many researchers will likely use relationship graphs to highlight patterns of interaction embedded within large clusters of e-mail messages.