Showing posts with label MARAC Spring 2011. Show all posts
Showing posts with label MARAC Spring 2011. Show all posts

Saturday, May 7, 2011

MARAC Spring 2011: New Tools to Address Electronic Records Challenges

Fifth order Fresnel lens used in the Jones Point lighthouse, Alexandria, Virginia, during the 19th century and now held by The Lyceum, Alexandria, Virginia, as seen on 5 May 2011.

The first session of the Spring 2011 Mid-Atlantic Regional Archives Conference focused on three electronic records research projects sponsored by the U.S. National Archives and Records Administration. All of them are intriguing, and all of them promise to help many electronic records archivists do their work.

Peter Bajscy (National Center for Supercomputing Applications) detailed the cloud-based solutions that he and his colleagues have developed in order to the challenges associated with the increasing number and complexity of file formats, the increasing volume of electronic records, growing hardware and software complexity, and ephemeral support for proprietary software. I haven’t had the opportunity to check out these tools, but I certainly will do so as soon as I get the chance:

Conversion Software Registry: A registry and freely accessible search tool that enables users seeking to convert files from one format to another to specify the format of the records with which they’re working and the desired preservation format and then review a list of appropriate conversion tools. Over 2,000 software packages are documented in the registry.
Polyglot: a cloud-based, open source conversion tool suitable for classified and proprietary information.
Versus [in development]: a tool that can compare original and converted versions of the same digital object -- simple and complex -- and evaluate resulting information losses. The results of these comparisons can be used to determine which preservation approach results in the least loss.

Bajcsy and his team are also interested in developing a Universal File Viewer, a cloud-based service that could provide a preview of files encoded in any format.

Bajcsy also posed a few questions for the audience to consider:
  • His team can deliver, on average, 1537.41 file conversions in one hour (50% utilization of a single CPU virtual machine and 50% virtual uptime of the virtual machine). Does this conversion rate meet archival needs?
  • How many file formats have you personnally encountered in your work?
  • Would the Universal File Viewer provide an added value?
  • Is data-driven file format selection for preservation a viable approach?
  • Is software robustness evaluation a viable approach to determining whether a given file is well-formed? (i.e., determining how many applications can open a given file might be a more practical means of determining well-formedness than comparing the file to the format specification.)
  • What is the value of data-driven evaluation of quality of software input/output functionality?
William Underwood (Georgia Tech Research Institute) then discussed his work on new tools for identifying file formats, identifying document types, and extracting metadata.

Archivists must identify file formats for a variety of reasons: assessing compliance with submission agreements/transfer memoranda, reading/playing files, conversion to standard or preservation formats, extracting information from archive files (e.g., .zip, .arc), password recovery and decryption, and repairing damaged files. In some instances, it may be possible to use external identifiers (e.g., file extensions, MIME types) to identify unknown formats. However, in some instances, external indicators are not sufficient., and the most popular analytical tools, the Linux file command and magic file, have some limitations: their output is sometimes ambiguous, they test output metadata as well as file types, and their tests for character set and language of text files are less than perfect.

Underwood and his colleagues are refining the Linux file command and magic file so that they produce file format signatures that can be compared to signatures of known file formats. To date, they have defined roughly 850 file format signatures and have collected examples of approximately 700 different file format types. They have also created a file signature database and, as moderator Mark Conrad noted afterward, contributed file signatures to the National Archives of the United Kingdom (NAUK) PRONOM file format registry; these signatures have been incorporated into DROID, NAUK’s open source file format identification tool.

Underwood and his colleagues are also testing new techniques for recognizing document types and extracting descriptive metadata. Their focus is on legacy documents that do not conform to XML document type definitions. They examine the intellectual form (i.e., structure) of these documents and then construct “intellectual grammars” for each document type (e.g., memoranda) and use intellectual extraction techniques to pull out names, dates, and other metadata elements.

Underwood noted in passing that after and his colleagues have extracted this metadata, they can write rules that enable us to create item-level descriptions. From those item-level descriptions, they can write rules that enable us to create file-level and then series-level descriptions. I was really struck by this statement, which suggests that automation is going to lead to some really intriguing -- and to many people unsettling -- changes in archival descriptive practice.

Underwood and his team hope to apply induction techniques to examples of a particular document type and generate a “document grammar” automatically and to expand their extraction techniques to include physical elements of documentary form (e.g., fonts) and document grammars of physical layouts. Cool stuff.

I really can’t do justice to the third presentation. Maria Esteva (Texas Advanced Supercomputing Center) and her colleagues are exploring possible archival uses of visualization technology, and, not surprisingly, her presentation included a lot of illustrations and multimedia material. If you want to get a sense of these materials look like -- and I recommend that you do so -- some of them are featured and on the Texas Advanced Supercomputing Center site and in this month’s issue of Discover magazine; the team has also outlined its findings here.

Visualization tools can be used to depict, compare, and contrast many different types of data. Esteva and her colleagues hope that visualization, which is often easier for the mind to grasp than lengthy textual or statistical analyses, will ultimately help to guide archival processing decisions, facilitate analysis large quantities of electronic records that consist of multiple document types and complex digital objects, and to enhance access to large, complex grouping of electronic records.

Using an electronic records testbed supplied by NARA, Esteva and her colleagues use a variety of automated techniques to identify groupings of files and data objects related by provenance and extract information about content and organization, and then place the resulting data in a relational database. They then use data mining, alignment algorithms, natural language processing, data distributions, and information classes to compare, contrast, and identify intellectual relationships between records and use visualization tools to create graphic representations of the results of their analyses: pie charts, network graphs, and, in particular, tree diagrams.

Esteva presented two visualization case studies that drew upon the testbed. The first highlighted how visualization could help archivists process electronic records by highlighting intellectual content and relationships that were not immediately apparent, assess preservation needs, and identifying other salient characteristics of the records. The other showed how visualization could help users identify collections that were particularly relevant to their research needs. Researchers searching for materials that contain specific intellectual content, have a specific provenance, were created at a specific time, exhibit specific patterns, or have some combination of these and other characteristics could visually assess which collections would be the most fruitful.

Maybe it’s a sign of advancing age (not to mention my fondness for the written word), but at this point I’m not quite convinced that researchers seeking records that possess a specific characteristic or cluster of characteristics will invariably prefer analyzing a tree diagram. However, I am intrigued by visualization technology and I think that we’ll soon come to accept that it can help us identify materials that have particular preservation needs, are responsive to specific freedom of information requests, or have specific traits or patterns that might have otherwise escaped our attention. In addition, I think that researchers will embrace visualization technology as an analytical tool; for example, many researchers will likely use relationship graphs to highlight patterns of interaction embedded within large clusters of e-mail messages.

Friday, May 6, 2011

MARAC Spring 2011: Archival Ethics and the Call of Justice

1315 Duke Street, Alexandria, Virginia, 5 May 2011. Between 1828-1861, this unassuming brick building was used as a holding pen for slaves awaiting sale in Natchez, New Orleans, or elsewhere; neighboring structures were also part of the city's slave trade district. It now home to the Northern Virginia Urban League and its Freedom House Museum, which documents the lives of the men, women, and children who were imprisoned here.

The Spring 2011 meeting of the Mid-Atlantic Regional Archives conference got off to a roaring start with Rand Jimerson’s thought-provoking plenary address, "Archival Ethics and the Call of Justice." Jimerson’s words have been bouncing around my head since this morning, and this post is an effort to nail some of them down. First, however, a disclaimer is in order. I’m a little sleep-deprived at the moment, and as a result some of the first half of Jimerson’s address bounced right off my benumbed skull. In other words, this post may not be fully faithful to his remarks. However, what I heard (or think I heard) got at least a few of my mental wheels spinning.

Jimerson began by summarizing several propositions put forth at a 2005 colloquium sponsored by the Nelson Mandela Foundation:
  • Archivists must avoid allowing normative conceptions of society to color the ways in which they select, acquire, and furnish access to materials.
  • Archivists must fight against destruction or neglect of records that document oppression.
  • (Oppressive regimes tend to be really good at documenting their crimes but attempt to destroy their records when their demise is imminent.)
  • Archivists must proactively create archives that reflect the full diversity of their societies.
  • Archivists should not be passive documenters of society but active participants in efforts to achieve social justice.
All of these propositions are new (or relatively new) to the archival profession, which has traditionally seen itself as objective and neutral. However, archives have traditionally served and reinforced he interests of entrenched power: their holdings reflect the words and deeds of the powerful, the successful, and the educated, and people and groups lacking one or more of these characteristics have either remained undocumented or documented only by records creators opposed to or indifferent to their experiences and perspectives. In recent years, Archivists have consciously started making an effort to make the documentary record more inclusive, but our emphasis upon provenance and upon the written word ensures that we are subtly biased toward the powerful and the influential.

Jimerson noted that many archivists might have trouble accepting that their work and their holdings reflect and perpetuate existing relations of power and might be deeply wary of the "call to justice" articulated in Johannesburg in 2005. However, he noted that it is possible to maintain professional standards of objectivity while at the same time accepting the impossibility of being personally neutral: as historian Thomas Haskell has asserted, a commitment to telling the truth does not prevent one from engaging in advocacy, but it does place certain intellectual limits on one’s advocacy. Moreover, answering the "call to justice" does not mandate that one adopt a particular partisan affiliation. However, it does mandate that one embrace and defend democratic values (e.g., government openness and transparency, the right of all citizens to participate fully in the life of their society and to have their histories and perspectives documented).

Jimerson then offered a variety of ways in which archivists can answer this "call of justice":
  • Ensure diversity in the archival record. The Society of American Archivists has recently identified the need for diversity in the record and in the profession as one of three key priorities, and this is a step in the right direction.
  • Welcome the stranger into the archives. We seek to include previously marginalized groups in archival documentation and ensure that they are full partners in the recordkeeping process. In the end, the entire community must be the provenance.
  • Base selection and appraisal decisions should be based upon clearly articulated and widely accessible criteria. We need to document our decisions.
  • Listen for oral testimony. Many peoples throughout the world -- including some residing in Canada and the U.S. -- do not write down their histories. If we do not seek out oral testimony and conduct oral histories, we will not know large parts of the world from the inside.
  • Make archival description sensitive to power relationships and conscious of the coded language that describes the social dynamics that led to their creation.
  • Make records accessible freely and openly, within the bounds established by privacy concerns and cultural concerns (e.g., access to tribal records).
  • Embrace new technologies. Social media and electronic records make it easier to make information widely available. Moreover, we need to embrace Kate Theimer’s conception of Archives 2.0: promote openness, flexible, user-centered, efficient, assessment-oriented.
  • Support open government, transparency, and democratic values.
  • Engage in public advocacy, which may include becoming whistleblowers when powerful people and groups try to destroy or alter records.
As noted above, Jimerson’s address was provocative. First, it made me painfully aware of the manner in which I still privilege the written word and literary aptitude. I came to archives as an aspiring labor historian seeking to recover the experiences of men and women who created few written records. My earliest work in archives focused on increasing the inclusivity of the documentary record, and I will argue to death the importance of ensuring the comprehensiveness of the historical record. I am nonetheless unduly impressed by people who “write well” and can be quite uncharitable toward people who are not proficient writers (especially if they’re hard-partying or unfocused undergraduates -- hence my decision not to finish my Ph.D. and go into academe).

I don’t think I will ever overcome this bias -- and in some respects I don’t really want to -- but Jimerson’s words were a stinging reminder that I need to be aware of it and to ensure that I go out of my way to treat with respect records creators, researchers, and other people who don’t embrace the written word as I do, to understand how they understand the world and document their histories, and to do what I can to ensure that they are equitably represented in the documentary record.

I also started thinking about the ways in which Jimerson’s ideas seem to be rooted in relatively recent developments in historical scholarship. The historians who pioneered the "new social history" -- "history from the bottom up" -- in the 1970s and 1980s began scouring records created by elites for information about the lives and perspectives of non-elite people: slaves, laborers, women of all classes, and racial, ethnic, and religious minorities. Barbara Hanawalt’s superb The Ties that Bound: Peasant Families in Medieval England, which mines records of royal inquiries into unnatural deaths for evidence of everyday peasant life, is a superb example of this sort of reading against the grain: Hanawalt was able to reconstruct how these largely illiterate men and women bathed and washed their clothes (yes, they did these things!), cared for children and the elderly, attempted to regulate sexual relationships and negotiate internal social hierarchies, distributed food and other essential resources, and grew crops, tended animals, and produced various necessities of life. Charles Joyner’s Down by the Riverside: A South Carolina Slave Community, which draws upon plantation owners’ diaries and records in addition to oral histories of and narratives written by former slaves, is another stellar example.

I can easily envision a scenario in which this sort of historical inquiry might be viewed as oppressive in and of itself. For example, one person whose life is partially documented in the records of government social service agencies might welcome the sort of inquiry undertaken by a social historian intent upon treating his or her subjects respectfully, but another might view it as yet another unwelcome and painful intrusion perpetuated by yet another educationally, socially, and economically privileged person. However, it strikes me that the philosophical commitments of the new social historians (e.g., belief in the inherent dignity and value of all persons, desire for a comprehensive and equitable historical record) are closely related to those of Jimerson’s justice-focused archivists. The new social history is still reshaping the archival worldview -- and, in my view, that’s a very good thing.