l'Archivista: Web crawling

Showing posts with label Web crawling. Show all posts

Friday, July 28, 2017

SAA 2017: records management, the web, and open data

Courtyard of Tranquility, Lan Su Chinese Garden, Portland, Oregon, 26 July 2017.

What follows is a quick stab at outlining a few ideas that came to the fore during two sessions -- one of which I was a participant and one in which I was an audience member -- and during the Government Records Section's annual meeting. Some are my own, and some are other people's, and all of them concern in some way our profession's inability to explain the value of records management programs, and in particular government records management programs, to the broader public:

Government archivists and records managers have tried for decades to get public officials, policymakers, journalists, and the public at large to understand that government records management and archives programs are essential to ensuring government accountability, efficiency, and transparency. We haven't gotten a lot of traction, and I'm increasingly convinced that our lack of success is because we frame our arguments in ways that make sense to us but not to the vast majority of our fellow citizens. Why do we keep doing the same thing and expecting different results? Why aren't we working with public relations professionals and other people who are adept at crafting simple, resonant messages and communicating them to broad audiences? How would Don Draper sell records management?
As one archivist in a session I attended this morning noted, governments that release the data they gather or create as open data -- data that third parties can use, reuse, and redistribute subject only, at most, to the requirement that the source of the data be identified may not pose much of a records management challenge. For example, this archivist's public sector employer, which has begun sharing datasets it has created with the public in an effort to be proactively transparent, treats the versions of the datasets it posts on its open data website as convenience copies. However, as other archivists pointed out during the annual meeting of the Government Records Section, the controversy and wave of "citizen archiving" initiatives that ensued when the new presidential administration removed certain types of information from federal government websites suggests that at least some members of the public have come to expect that information posted online will remain readily accessible in perpetuity. I have the feeling that, in the coming years, we're going to devote a lot of energy to coming to grips with this expectation. Will we give into it and focus on harvesting and preserving web content, or will we ramp up our efforts to explain that managing government records appropriately may mean removing and disposing of data that was once freely available online? Or will we preserve tons of web content and explain that, in some instances, we work with agencies to identify and acquire additional, related records that are not available online and that, in others, we capture only snapshots of web content?

Thursday, November 14, 2013

Best Practices Exchange 2013: Web archiving

I've been up for a long time and I'm a little under the weather, so I'm going to highlight a couple of cool things that Scott Reed of Archive-It shared this morning and then call it a day:

First, Scott highlighted a tool that's new to me: computer systems engineer Vangelis Banos has developed Archive Ready, a site that enables website creators and people archiving websites to assess the extent to which a given site can be archived using Web crawling software. Simply type the URL of the site into a text box and click a button, and Archive Ready assesses:

The extent to which the site can be accessed by Web crawlers
The cohesion of the site's content (i.e., whether content is hosted on a single resource or scattered across multiple resources and services)
The degree to which the site was created according to recognized standards
The speed with which the host server responds to access requests
The extent to which metadata that supports appropriate rendering and long-term preservation is present

Archive Ready also runs basic HTML/CSS validity checks, analyzes HTTP headers, highlights the presence of Flash and Quicktime objects and externally hosted image files, looks for sitemap.xml files, RSS links, and robots.txt files, and determines whether the Internet Archive has crawled the site and converted the results to WARC format.

Archive Ready has apparently engendered some controversy, but it's a handy resource for anyone seeking to capture and preserve Web content. Based upon my very limited experience with Archive Ready, which is still in beta testing, I have to say that it might not be able to locate streaming videos located deep within large websites. However, I have to say that it's overall assessments seem pretty accurate; I entered the URLs of several sites that we've crawled repeatedly, and the sites that we've been able to capture without incident consistently received high scores from Archive Ready and the sites that have given us lots of problems consistently received low scores. If you're archiving websites, I encourage you to devote a little time to playing around with this nifty tool.

Scott also reported that the Internet Archive is looking to develop new tools to capture social media content and other types of media-heavy content that Heritrix, its Web crawler, simply can't capture properly. To the greatest extent possible, the Internet Archive will integrate these new tools and the content they capture into its existing capture and discovery mechanisms. Capturing social media content is a real challenge (and if I weren't so tired, I would blog about the great social media archiving presentation that Rachel Trent of the North Caorlina State Archives gave this afternoon). It's good to see that new options may be on the horizon.

Image: quotation above the doorway to the library of the Church History Library, Salt Lake City, Utah, 14 November 2013. The Church History Library houses archival and library materials that chronicle the history of The Church of Jesus Christ of Latter-day Saints and its members from 1830 to the present.

Wednesday, October 2, 2013

Federal government shutdown: Internet Archive needs your help

Today is the second day that all federal government facilities deemed non-essential have been closed and all federal employees whose jobs have been deemed non-essential have had to stay home. The shutdown has extended -- sometimes in ways that seem to defy common sense -- to the federal government's Web presence. In response to this situation, the tireless souls at the Internet Archive have sprung into action: they're "creating a collection of websites affected by the government shutdown."

At the present time, the only publicly accessible content on many sites is a notice indicating that the site's creator has shut down.

Other federal government entities continue to make large quantities of site content available but stress that the information may not be current, that offices are closed, and that inquiries will not receive prompt attention.

The Internet Archive wants to capture and preserve all of these shutdown notices, contingency plans, and other shutdown-related Web content, and its archivists need your help. If you run across a federal government site that features a shutdown notice or contains information about how the agency will operate during the shutdown, please consider going to the Internet Archive's Government Shutdown Seed URL Nomination Form, entering some basic information about the site and the URL of its home page, and clicking "submit."

If you would like to see whether the Internet Archive has already captured a given site or simply want to check out the collection, Internet Archive has is already making the 2013 Government Shutdown collection available to researchers.

Friday, August 16, 2013

CoSA-SAA 2013: The Web of Sites

I had the good fortune to attend three great hour-long sessions today:

Session 304, Training in Place: Upgrading Staff Capabilities to Manage and Preserve Electronic Records, in which Richard Pearce-Moses (Clayton State University) discussed how online graduate education programs can benefit working archival professionals, Lori Lindberg (San Jose State University) highlighted SAA's new Digital Archives Specialist program, and Sarah Grimm (Wisconsin Historical Society) discussed the educational offerings developed by CoSA's State Electronic Records Initiative project.

Session 407, The Web of Sites: Creating Effective Web Archiving and Collection Development Polices, which is discussed in greater detail below.

Session 504, Records Management Training Gumbo for the Digital Age, in which Cheryl Stadel-Bevans (Office of the Inspector General, U.S. Dept. of Housing and Urban Development) facilitated a series of lighting talks given by Jane Zhang (Catholic University of America), Donna Baker (Middle Tennessee State University), Daniel Noonan (Ohio State University), and Lorraine Richards (University of North Carolina at Chapel Hill).

However, I'm desperately in need of sleep, so this post is going to focus solely on Session 407, The Web of Sites: Creating Effective Web Archiving and Collection Development Policies drew a standing room-only crowd, and with good reason. The three panelists represented three very different institutions with three very different goals:

Olga Virakhovskaya discussed how one collecting repository, the University of Michigan's Michigan Historical Collections (MHC) devised a Web archiving policy that dovetails with its collecting policy, which calls for aggressive collecting and broad documentation of the state's history and culture. In an effort to balance topical importance and the quality of information found on a given site, MHC staff identify sites that are created by individuals and organizations that MHC seeks to document, fill in gaps in its holdings, or contain material that fall outside MHC's collecting priorities but nonetheless warrant preservation and determine whether the sites content that is rich, unique, or new. If the site meets all of these requirements, MHC will archive it. MHC, which uses the California Digital Library's Web Archiving Service, stops archiving sites when no new content has been added for three consecutive years; it will also cease archiving sites upon creator request.

Jennifer Wright of the Smithsonian Institution Archives discussed the Archives' efforts to ensure that the 257 websites, 10 mobile sites, 89 blogs, 26 apps, and 578 social media accounts maintained by various Smithsonian entities are managed and preserved appropriately. The archives is responsible for providing retention guidance to creators, maintaining periodic snapshots of Smithsonian Web resources, and maintaining a registry of Smithsonian social media accounts. It has developed distinct approaches to preserving websites, Intranet sites, and social media accounts:

Public websites are generally treated as permanent records, and the Archives tries to crawl them annually, before and after major redesigns, and on days of major events. However, it will attempt to configure Archive-It's crawler to exclude content that is being transferred to the Archives in other formats, is the responsibility of other Smithsonian units, or consists of collections (as opposed to organizational records), or which merely points to other Web content. Crawls of public sites are made publicly accessible almost immediately after completion.
Intranet sites are appraised individually. Given that most Intranet sites block Web crawlers, Intranet content is transferred to the Archives via FTP, hard drive, or other non-crawling mechanism.
Most social media accounts are captured once in order to document their existence and show how they are used. After this initial capture, staff reappraise each account annually and recapture it if significant new content is present. Social media content often resists capture, so the Archives uses multiple tools (Archive-It, export tools, and screenshots) as needed. These captures are not made available online.

Rachel Taketa discussed how she created the California Tobacco Control Web Archive CTCWC, a topical collection of archived sites that complements the University of San Francisco's Legacy Tobacco Documents Library (LTDL), which consists of 14 million internal business records created by major tobacco companies. The archive consists of about 90 sites that were captured with the California Digital Library's Web Archiving Service and complement materials found within the LTDL, but most focus on the other side of the tobacco control movement: they were created by public health advocacy organizations, anti-smoking campaigns, and sites relating to proposed tobacco control legislation. A written scope statement that establishes the archive's geographic focus (California and anti-smoking campaigns in the state's large metropolitan counties) and collecting priorities (original and/or unique content found in blog posts, interviews, multimedia, sites of established tobacco control groups, and local government sites). Site captures cease when a given site hasn't been updated for a year or when a given issue is no longer relevant; as one might expect, reappraising sites consumes a lot of time.

My key takeaways from this session:

Your Web archiving policy should, to the extent that your resources and Web archiving tools allow, align with your main collecting policy.
Just as collecting policies vary from one institution to another, Web archiving policies will vary from one institution to another.
Given the speed with which sites change and the frequency with which once-active sites become dormant, reappraisal is a must. However, it's incredibly time-consuming and we need some tools that will help us analyze the evolution (or lack thereof) of site content over time.

Image: traces of a rainbow over the West Bank Crescent City Connection, the twin cantilever bridges that span the Mississippi River, in New Orleans, Louisiana, 16 August 2013. Thanks to my friend S.G. for pointing out to me; I never would have noticed it otherwise.

Wednesday, March 10, 2010

Vancouver Archives, UK Web crawling, Preserving Virtual Worlds, Yamasaki Associates records

Just a few things that have flown over the transom as of late:

The City of Vancouver Archives is soon going to have a fully functioning digital archive that makes use of the open source Archivematica digital archiving system. It's also actively reaching out to the open source/hacker community, which it sees as an emerging user group for its datasets and other electronic records. This is pretty cool -- to the best of my knowledge, no one in the American archival community is really seeking to find out how the coding community wants to make use of archival data. (If I'm overlooking someone, please let me know!)

The online UK edition of Wired recently posted a great article outlining the copyright and other legal challenges that the British Library faces as it attempts to capture and preserve the nation's Web presence -- a daunting challenge in and of itself. (h/t: Resource Shelf)

The Atlantic recently highlighted (online and in print) the fascinating work being done by the Preserving Virtual Worlds project, which seeks to preserve video games and virtual environments such as Second Life. It's really heartening to see this project, which is funded by the Library of Congress's National Digital Information Infrastructure Preservation Program, get this sort of media attention.

This (probably) doesn't have much to do with electronic records, but I really want to commend my colleagues at the Archives of Michigan for taking quick, decisive action to save the records of the now-defunct architectural firm of Yamasaki Associates. The firm was founded by Minoru Yamasaki, who designed, among many other important buildings, the World Trade Center complex in New York City and the U.S. National Archives and Records Administration's National Personnel Records Center (Military Personnel Records) in St. Louis, Missouri, which suffered a devastating fire in 1973. Kudos Michigan State Archivist Dave Harvey and all of the Michigan state government personnel and Society of Architectural Historians administrators who helped to save the firm's records. They were given less than a day to do so, and they rose to the challenge spectacularly well.

Tuesday, August 25, 2009

Preserving GeoCities: Internet Archive needs your help

Remember GeoCities? It was one of the first Web site hosting services, and in the mid-1990s people flocked to it because enabled non-techies to develop and maintain small, publicly accessible Web sites. Most of them were badly designed or abandoned after a short time, but a few of them were -- and are -- gems. More importantly, when examined as a group, they help to document an important phase in the early history of the World Wide Web.

GeoCities has never been a money maker, and most of its users have moved on to other, more sophisticated hosting services. Earlier this year, Yahoo, which has owned GeoCities since 1999, announced that it will shut down GeoCities on 26 October 2009. GeoCities site creators do have options: they can move their sites into Yahoo! Web Hosting Service or download their GeoCities files and recreate their sites via another hosting service. However, in all likelihood, a lot of GeoCities sites, particularly those that haven't been well-tended as of late, are going to disappear in a couple of months.

In order to ensure that this chapter in the Web's history is properly documented, the good people at the Internet Archive, which has for years been copying GeoCities sites and providing access to its copies, is asking creators and fans of GeoCities sites to submit the sites' URL's. Doing so will allow the Internet Archive to identify GeoCities sites that it hasn't captured during its past sweeps of cyberspace and to copy them before they disappear. If you're a creator of a GeoCities site, a fan of one, or just happen to stumble across one as you make your way through the wilds of the Web, you can help to save digital history with just a few mouse clicks.

The Internet Archive's effort to document GeoCities is taking place in tandem with that of the Archive Team, an alliance of volunteers seeking to preserve at-risk information on the Web, and it's great to see an established repository work with a community-based initiative. Moreover, I'm a huge fan of the Archive Team's succinct (but not necessarily safe for work) mission statement/logo.

Hat tip: the relentless librarians of Resource Shelf.

Friday, June 5, 2009

New York Archives Conference, day one

Grewen Hall, LeMoyne College

Yesterday was the first day of the New York Archives Conference (NYAC), which is being held at Lemoyne College in Syracuse. One of the things I really like about NYAC is its informality: many people either know each other or know of each other’s work, and the atmosphere is intimate and convivial as a result.

Today was jam-packed with sessions and other activities. IT started with a plenary session led by Maria Holden (New York State Archives), who outlined how the State Archives has responded to a recent internal theft and left the attendees with the following advice:

It is up to you to take ownership of security. It isn’t something that just happens or is the concern of a handful of people.
Do not wait until something bad happens. Addressing security issues before trouble occurs helps to avert problems and makes it easier to manage change and secure staff support.
Become with security standards and guidelines relating to cultural heritage institutions.
Do your due diligence: develop policies and procedures, and document what you have done to improve security.
Remember that security is as much about protecting the innocent as it is about protecting collections. Employees need to understand that good security practices help to ensure that they will not become suspects in the event that a theft takes place.

The “Everyday -- Ethics (or What Do I Do Now?)” session touched on a host of related issues, and all of the panelists made some great points.

Geoff Williams (University at Albany, SUNY) asserted that archivists need to question whether they should use their own holdings when conducting their own scholarly research; any archivist who does so will have to figure out what to do when other scholars want access to the results of their research and determine what to do when other users want to see the records that they’re using.
Kathleen Roe (New York State Archives) discussed the thorny issue of collecting manuscripts, ephemera, and artifacts that fall within their own institution’s collection parameters and concluded that the safest course of action is to avoid collecting anything that, broadly defined, falls within the collecting scope of one’s employer; this approach avoids both the actuality and the appearance of impropriety -- and frees one to develop new collecting interests.
Trudy Hutchinson (Bellevue Alumnae Center for Nursing History, Foundation of New York State Nurses) discussed how her nursing background informs her understanding of archival ethics and how, as an undergraduate majoring in public history, she had been required to develop a written personal code of ethics. She has since updated and expanded this code, which she discussed with her current employer during the interview process, and would like to see all archives students develop such does. (This is a great idea for current professionals, too.)
Patrizia Sione (Kheel Center, Cornell University) discussed a variety of ethical issues that she has confronted, and noted that she would like to see employers develop written policies relating to scholarly research undertaken by staff. She also emphasized the importance of working with donors to ensure that the privacy of correspondents, etc., is appropriately protected; doing so will ensure that appropriate access restrictions are spelled out in deeds of gift. Finally, she noted that archivists need to be sensitive to the ways in which the pressure to assist researchers with ties to high-level administrators can conflict with their ethical obligation to treat all users equitably.

In the next session, “Can We Afford Not to Act? Strategies for Collection Security in Hard Times,” Richard Strassberg (independent archival consultant) and Maria Holden (New York State Archives) outlined a wide array of low-cost security measures.

Richard Strassberg noted that the recession makes protection of collections particularly important: instances of shoplifting and employee theft are on the rise, and archivists and researchers face the same financial pressures as everyone else. He also noted that the increasing prevalence of online finding aids and digitized images has had mixed results: although they make it easier for honest dealers and collectors to identify stolen materials, they also make it easier for dishonest individuals to hone in on valuable materials.

He then outlined what he called “minimal level protection” strategies for cultural institutions, all of which require staff time but don’t cost much:

Have a crime prevention specialist employed by the local or state police do an assessment of your facility.
Establish links with the local police so that they know that you hold valuable materials.
Have a fire inspection conducted (but make sure that your management knows in advance that you’re planning to do so -- the fire department will close your facility if it finds serious problems that management isn’t able to fix).
Get a security equipment quote; even if you don’t have the money, the cost might be lower than you expect, and having the quote will give you a fundraising target.
Do an insurance review and have your holdings appraised; doing so will help you in the event that you suffer a loss.
Protect your perimeter by tightly controlling keys and, if possible, screwing window sashes shut.
Avoid drawing attention to valuable materials. Don’t put up red-flag labels (e.g., “George Washington letter”) in your stacks and be cautious about what you display to VIPs and other visitors.
Tighten up on hiring. Conduct background checks if you can, and carefully check references by phone.

Strassberg emphasized that these measures will protect collections from “conditionally honest visitors,” but will not guard against thefts by staff. Moreover, they are not sufficient for repositories that hold materials of particular interest to thieves (e.g., collections relating to politics, sports, Native Americans, African Americans, and literary figures); such institutions will likely have to invest in electronic anti-theft technology.

In the event that a theft occurs or is suspected, contact, in the following order: your supervisor (or, if s/he is the suspect, his/her boss), the police, the donor (if applicable and he/she is still around), and your staff. Staff must be cautioned not to talk about the theft with family, friends, or co-workers. Also, develop a local phone tree -- external thieves tend to hit all of the repositories in a region within a short amount of time, and your colleagues will appreciate being informed. Avoid sending out e-mail alerts; you don’t want to document suspicions that might be unfounded.

Strassberg concluded by noting that librarians and archivists must be trained to confront suspected thieves in a legal and appropriate manner -- or how to set the process of confrontation in motion by contacting security or the police. They also need to know that they cannot physically prevent anyone from leaving the research room; in New York State, they might be guilty of battery if they attempt to do so.

Maria Holden then focused upon internal theft, which is the most common security threat that archives face. Employee theft is a complex problem, and full understanding of it is hard to come by. Theft is motivated by a variety of factors: personality disorders, gambling or substance abuse problems, retaliation for actual or perceived slights, and feelings of being unvalued.

We need to create a work environment that discourages theft and to control when, where, and how people interact with records; doing so protects not only the records but also innocent people who might be otherwise be suspected of wrongdoing. There are several ways we can do so:

Hiring should be done carefully and with due diligence. The references of prospective employees should be screened carefully, and their collecting habits should be scrutinized carefully; the results of these checks should be documented. Many archives compel staff to adhere to codes of ethics and sign disclosure statements re: their collecting and dealing habits. The code of ethics developed by the Association of Research Libraries might be a good model.
A number of recent thefts have been perpetrated by interns and volunteers. Develop a formal application process for interns and volunteers, document the process, and supervise interns and volunteers at all times.
Keep order in your house. There is growing evidence in the literature that disordered environments can encourage delinquent behavior. Order begets respect for collections.
Keep collections in the most restricted space possible. The State Archives has looked at every space in which records might be found (research room, scanning lab, etc.) and then figured out when it’s appropriate to bring records into a given space and how long they should remain in it. Develop overarching rules governing removal and return of records to the stacks.
Keep collections in the most secure space possible, grant access rights thoughtfully, designate spaces for storage, work, and research, and establish parameters for working hours; many internal thefts occur during off-hours.

During the question and answer period, Kathleen Roe made an important point: Sometimes, people start out honest, then fall prey to gambling or other addictions or personal problems. We have to make it difficult for desperate people to steal from our holdings.

Richard Strassberg also emphasized the research proves that most people are conditionally honest, i.e., they won’t steal from their friends. We need to create work environments that make people feel valued.

I took part in one of the late afternoon sessions, “The Challenge of the New: Archivists and Non-Traditional Records,” which focused on various electronic records projects at the New York State Archives. Ann Marie Przybyla discussed our new e-mail management publication, Michael Martin detailed our Web crawling activities, and I discussed the processing and description of a series of records relating to the “Troopergate” scandal.

At the end of the day, we went to a reception and a great tour of the LeMoyne College Archives led by College Archivist Fr. Bill Bosch. Afterward, I went out to dinner with my State Archives colleagues Monica Gray and Pamela Cooley, Capital Region Documentary Heritage Program Archivist Susan D’Entremont, and Nathan Tallman, who just graduated from the University at Buffalo’s library school and is a project archivist at the Herschell Carousel Museum. We had a great time, and all of us would recommend Phoebe’s to anyone visiting Syracuse.

Sunday, April 26, 2009

MARAC: Will the Fruit Be Worth the Harvest? Pros and Cons of End of Presidential Term Web Harvesting Projects

We as a profession are still trying to figure out how to deal with Web sites, which exist in the netherworld separating archives and libraries and pose a host of preservation challenges, and this session furnished interesting insight into the contrasting approaches of the U.S. National Archives and Records Administration (NARA) and the Library of Congress (LC).

Session chair Marie Allen (NARA) noted that NARA’s handling of Web records has consistently engendered controversy. Its 2000-01 decision to compel Federal agencies to copy their Web site files, at their expense, at the end of President Bush’s first term of office and transfer them to NARA within eight days of doing so angered agencies, and its 2008 decision not to take a Web snapshot (i.e., a one-time copy) of federal agency sites at the end of President George W. Bush’s second term aroused public concern.

Susan Sullivan (NARA) pointed out that in 2004 NARA had contracted with the Internet Archive to copy publicly accessible federal government Web site that it had identified and to provide access to the copies, then explained the rationale for NARA’s 2008 decision: it has determined that Web records are subject to the Federal Records Act and must be scheduled and managed appropriately. It issued Guidance on Managing Web Records in January 2005 and has since offered a lot of training and assistance to agencies; some of this information is available on NARA’s Toolkit for Managing Electronic Records, an internet portal to resources created by NARA and many other entities.

Sullivan emphasized that snapshots are expensive, have technical and practical shortcomings, and encourage the agency misperception that NARA is managing Web records. In fact, there is no authoritative list of federal government sites, which means that snapshots fail to capture at least some sites. Moreover, snapshots capture sites as they existed at a given point of time, cannot capture Intranet or “deep Web” content, and are plagued by broken links and other technical limitations. In sum, snapshots do not document agency actions or functions in a systematic and complete manner.

NARA is still copying Congressional and Presidential Web sites, which are not covered by the Federal Records Act. Although these snapshots have all of the problems outlined above, NARA regards them as permanent.

Abbie Grotke (LC) then outlined LC’s response to NARA’s 2008-09 decision: in partnership with the Internet Archive, the California Digital Library, the University of North Texas, and the Government Printing Office, it opted to take snapshots of publicly accessible federal government sites. All of the partners seek to collect and preserve at-risk born-digital government information, and all of them believed that the sites had significant research potential.

The partners developed a list of URLs of publicly accessible federal government sites in all three branches of government; they placed particular emphasis on identifying sites that were likely to disappear or change dramatically in early 2009. They then asked a group of volunteer government information specialists to identify sites that were out of scope (e.g., commercial sites) or particularly worthy of crawling (e.g., sites focusing on homeland security). This process ultimately yielded a list of approximately 500 sites.

The partners took a series of comprehensive snapshots and a number of supplemental snapshots focusing on high-priority sites. Much of this work centered on two key dates -- Election Day and Inauguration Day -- but some copying is still taking place.

Grotke outlined the project’s challenges, which will be familiar to any veteran of a multi-institutional collaborative project. The partners had no official funding for this project and thus have had to divert staff and resources from day-to-day operations. They have also had a difficult time managing researcher expectations: users want immediate access to copied sites, but the indexing process is time-consuming. The partners have also had to accept that, owing to the technical limitations of their software and the possibility that some sites escaped their notice, they could not fully capture every federal government site.

The snapshots have nonetheless captured a vast quantity of information that might otherwise be lost, and the project is also paving the way for future collaborations.

Thomas Jenkins (NARA) then explained how Web sites fit into NARA’s three-step appraisal process, which is guided by Directive 1441 (some of which is publicly accessible):

Data gathering. When appraising Web sites, an archivist visits each site and analyzes the information found on it, interviews agency Web administrators, assesses the recordkeeping culture of the creating agency, and determines how the site’s content relates to permanent records in NARA’s holdings.
Drafting of appraisal memorandum. The archivist prepares a detailed report that assesses the extent to which the site documents significant actions of federal officials, the rights of citizens, or the “national experience.” The report also examines the site’s relationship to other records identified as permanent (i.e., is the Web site the best and most comprehensive source of information?)
Stakeholder review. Each appraisal memorandum is circulated within NARA and then published in the Federal Register in order to solicit agency and public input.

Using a site created by the U.S. Department of Justice as an example, Jenkins highlighted how this process works and why NARA ultimately determined that this site, which contains only a fraction of the information contained within other series deemed archival, did not warrant permanent retention. In contrast, NARA has determined that the site of the U.S. Centennial of Flight Commission warrants permanent preservation because it contains significant information not found in other series.

In response to a comment concerning whether Web snapshots capture how an agency presents itself to the public, Jenkins stated that NARA assesses whether the information presented on a given site is unique. Moreover, NARA is aware that other entities are crawling federal government sites. Although there is a risk that this crawling activity will cease, a risk analysis indicated that archival records and other sources of information amply document the agency’s activities.

Although this session illuminated how and why NARA and LC reached such sharply contrasting decisions and highlighted some resources that somehow escaped my attention, it underscored precisely why the profession hasn't reached any sort of consensus and is unlikely to do so in the near future. Many if not most state and local government archives lack the degree of regulatory authority afforded by the Federal Records Act, and as a result many of them will not want to rely upon the kindness of site creators. Archivists working in repositories with broad collecting missions may have great difficulty ensuring that creators properly maintain, copy, and transfer site files. Moreover, some archivists will doubtless differ with NARA's conclusion that documenting how site creators presented themselves to the public is not sufficient reason to take periodic Web site snapshots or otherwise preserve sites comprehensively. As a result, many of us will likely find LC's approach to federal government sites or NARA's handling of Congressional and Presidential Web sites more relevant to our own circumstances than NARA's treatment of executive-branch agency sites.

Monday, January 26, 2009

Preserving Web sites

I really wish that I had been able to find more time last week to blog about all of the stuff that happened last week . . . .

First, we had some momentous changes at the federal level. As any archivist who hasn't had his or her head in the sand knows, the very first executive order that President Obama signed overturns President Bush's dread E.O. 13233 and should facilitate the timely release of presidential records. President Obama signed a memorandum reminding the heads of all federal agencies that "the Freedom of Information Act should be administered with a clear presumption: In the face of doubt, openness prevails." The archival blogosphere and listservs have been chock-full of commentary about these developments, and I really don't have much to add to the discussion at this point; suffice it to say that I'm really, really glad that E.O. 13233 is gone.

Yesterday, an opinion piece penned by Lynne Brindley, the head of the British Library, appeared in the Guardian. Noting that "personal digital disorder" -- our unwillingness or inability to save the digital photos and other electronic materials we create in a way that ensures their long-term survival -- threatens "to leave our grandchildren bereft," she asserts:

As chief executive of the British Library, it's my job to ensure that this does not extend to our national memory. At the exact moment Barack Obama was inaugurated, all traces of President Bush vanished from the White House website, replaced by images of and speeches by his successor. Attached to the website had been a booklet entitled 100 Things Americans May Not Know About the Bush Administration - they may never know them now. When the website changed, the link was broken and the booklet became unavailable.
The 2000 Sydney Olympics was the first truly online games with more 150 websites, but these sites disappeared overnight at the end of the games and the only record is held by the National Library of Australia.
These are just two examples of a huge challenge that faces digital Britain. There are approximately 8 million .uk domain websites and that number grows at a rate of 15-20% annually. The scale is enormous and the value of these websites for future research and innovation is vast, but online content is notoriously ephemeral.
If websites continue to disappear in the same way as those on President Bush and the Sydney Olympics - perhaps exacerbated by the current economic climate that is killing companies - the memory of the nation disappears too. Historians and citizens of the future will find a black hole in the knowledge base of the 21st century.

Brindley goes on to point out that, popular assumptions to the contrary, Google and other commercial entities are simply not capturing and preserving the "nortoriously ephemeral" but immensely valuable information found on the Web:

. . . . The task of capturing our online intellectual heritage and preserving it for the long term falls, quite rightly, to the same libraries and archives that have over centuries systematically collected books, periodicals, newspapers and recordings and which remain available in perpetuity, thanks to these institutions.

She then details the British Library's efforts to digitize some of its paper treasures, ensure the preservation of Web sites relating to the 2010 Olympic Games in London, and, "with appropriate regulation . . . create a comprehensive archive of materials from the UK Web domain."

Brindley is absolutely right, but there really is something missing from this article: an explanation of why libraries and archives must carry out this particular mission. I'm not faulting Brindley for this omission. The editors of the Guardian no doubt had a substantial amount of say in determining the length and overall content of this piece, and Brindley is using her ration of words to link the British Library's activities to a highly anticipated government report on the future of "digital Britain."

However, without explicit discussion of the role that libraries and archives play in preserving cultural heritage materials over very long periods of time and ensuring that the materials in their possession are authentic and unaltered, many people simply won't grasp why it's important that institutions such as the British Library are seeking to preserve electronic materials. A cursory glance at the comment section associated with this article illustrates precisely why this focus on the long term and on authenticity is needed: one of the commenters states that a copy of "100 Things Americans May Not Know About the Bush Administration" is currently available on another publicly accessible Web site and that the British Library simply doesn't know how to use Google. However, the commenter doesn't seem to have thought about whether the copy found on this Web site has not been altered or whether the site itself will be around in 10 years, let alone 100 or 1000 years; given that s/he also links to a parody of this booklet, s/he may simply have tongue planted firmly in cheek, but there is no way to tell from the comment alone.

Unless librarians and archivists do more to jolt people out of their present-minded view of the Web and digital materials generally and to underscore the importance of safeguarding the integrity of digital information that warrants long-term preservation, we're going to find it harder and harder to secure the resources we need to preserve and provide access to it. There simply is no alternative to emphasizing -- loudly and insistently -- that we seek both to serve today's researchers and to lay the foundation needed to ensure that future generations of archivists and librarians will be able to serve future generations of researchers.

Saturday, August 30, 2008

SAA: Second day of sessions

NB: Sessions occupied only one time slot today.

Digital Revolution, Archival Evolution: An Archival Web Capture Project
Dean Weber (Ford Motor Company), Judith Endelman (Henry Ford Museum), Pat Findlay (Ford.com), and Reagan Moore (University of North Carolina at Chapel Hill) discussed their joint effort to use Web crawling software to create preservation copies of the main Web site (www.ford.com) maintained by the Ford Motor Company.

As Findlay emphasized, this site is extremely large and complex: the site contains content created by many different Ford units, pulls content from a large number of different feeds, has Flash and non-Flash and high- and low-speed versions, and has features that allow people to view cars by color, passenger number, etc. As a result, there are literally millions of different page combinations. Moreover, it has strong anti-hacking protection and is hosted on geographically dispersed servers located throughout the world.

The Henry Ford Museum, which wanted to preserve periodic snapshots of the site, worked with San Diego Supercomputing Center (where Moore worked until a very short time ago) to conduct three crawls of the site and store and furnish access to the results. In an effort to improve results, staff from the Henry Ford Museum and SDSC consulted with Ford's IT staff; as Endelman noted, everyone entered into this project thinking that it was about technology, but it was really about management, people, and relationships.

Moore furnished a great overview of the various challenges that the group encountered over the course of the project, and he explicitly linked them to the traditional archival functions of:

Appraisal--understanding what was actually present in the Web site and deciding what to preserve;
Accessioning--using a crawler to produce copies of the site and place the copies into a preservation environment;
Description--gathering essential information needed to identify and access the crawl and system metadata guaranteeing authenticity, etc.
Arrangement--preserving the intellectual arrangement of the files and determining their physical arrangement (SDSC actually bundles the files into a single TAR file, which means that it needs to maintain checksums, etc., for only one file per crawl. The iRODS software that SDSC developed can search within TAR and files and pull up content as directed);
Preservation--determining whether to store, e.g., banners indicating the archival status of the files, with the files or in a separate location;
Access--enabling people using multiple browsers on multiple platforms to examine the files.

I've done quite a bit of Web crawling, and I'm glad to learn that Moore and other researchers are actively trying to figure out how to capture content that current crawlers can't preserve (e.g., database-driven content and Flash). The session was nonetheless a bit disheartening: even with the active cooperation of Ford's IT staff and the involvement of visonary computer scientists, Web crawling remains an imperfect technology. However, for those trying to preserve large sites or large numbers of sites, it nonetheless remains the best of a bunch of bad options.

l'Archivista

Friday, July 28, 2017

SAA 2017: records management, the web, and open data

Thursday, November 14, 2013

Best Practices Exchange 2013: Web archiving

Wednesday, October 2, 2013

Federal government shutdown: Internet Archive needs your help

Friday, August 16, 2013

CoSA-SAA 2013: The Web of Sites

Wednesday, March 10, 2010

Vancouver Archives, UK Web crawling, Preserving Virtual Worlds, Yamasaki Associates records

Tuesday, August 25, 2009

Preserving GeoCities: Internet Archive needs your help

Friday, June 5, 2009

New York Archives Conference, day one

Sunday, April 26, 2009

MARAC: Will the Fruit Be Worth the Harvest? Pros and Cons of End of Presidential Term Web Harvesting Projects

Monday, January 26, 2009

Preserving Web sites

Saturday, August 30, 2008

SAA: Second day of sessions

Search l'Archivista

About l'Archivista

Caveat lector

Contact l'Archivista

Blog Archive

New York State Archives News and Events

Archivist and Records Manager Blogs

Blogs of Archives in New York State

New York State History Blogs

L'Archivista Also Reads

Labels

Where Are l'Archivista's Readers?

Legal Stuff

Friday, July 28, 2017

Thursday, November 14, 2013

Wednesday, October 2, 2013

Friday, August 16, 2013

Wednesday, March 10, 2010

Tuesday, August 25, 2009

Friday, June 5, 2009

Sunday, April 26, 2009

Monday, January 26, 2009

Saturday, August 30, 2008

Search l'Archivista

About l'Archivista

Caveat lector

Contact l'Archivista

Blog Archive

Subscribe To

New York State Archives News and Events

Archivist and Records Manager Blogs

Blogs of Archives in New York State

New York State History Blogs

L'Archivista Also Reads

Labels

Where Are l'Archivista's Readers?

Legal Stuff