Thursday, November 14, 2013

Best Practices Exchange 2013: Web archiving

I've been up for a long time and I'm a little under the weather, so I'm going to highlight a couple of cool things that Scott Reed of Archive-It shared this morning and then call it a day:

First, Scott highlighted a tool that's new to me: computer systems engineer Vangelis Banos has developed Archive Ready, a site that enables website creators and people archiving websites to assess the extent to which a given site can be archived using Web crawling software. Simply type the URL of the site into a text box and click a button, and Archive Ready assesses:
  • The extent to which the site can be accessed by Web crawlers
  • The cohesion of the site's content (i.e., whether content is hosted on a single resource or scattered across multiple resources and services) 
  • The degree to which the site was created according to recognized standards
  • The speed with which the host server responds to access requests
  • The extent to which metadata that supports appropriate rendering and long-term preservation is present
Archive Ready also runs basic HTML/CSS validity checks, analyzes HTTP headers, highlights the presence of Flash and Quicktime objects and externally hosted image files, looks for sitemap.xml files, RSS links, and robots.txt files, and determines whether the Internet Archive has crawled the site and converted the results to WARC format.

Archive Ready has apparently engendered some controversy, but it's a handy resource for anyone seeking to capture and preserve Web content. Based upon my very limited experience with Archive Ready, which is still in beta testing, I have to say that it might not be able to locate streaming videos located deep within large websites. However, I have to say that it's overall assessments seem pretty accurate; I entered the URLs of several sites that we've crawled repeatedly, and the sites that we've been able to capture without incident consistently received high scores from Archive Ready and the sites that have given us lots of problems consistently received low scores.  If you're archiving websites, I encourage you to devote a little time to playing around with this nifty tool.

Scott also reported that the Internet Archive is looking to develop new tools to capture social media content and other types of media-heavy content that Heritrix, its Web crawler, simply can't capture properly.  To the greatest extent possible, the Internet Archive will integrate these new tools and the content they capture into its existing capture and discovery mechanisms. Capturing social media content is a real challenge (and if I weren't so tired, I would blog about the great social media archiving presentation that Rachel Trent of the North Caorlina State Archives gave this afternoon). It's good to see that new options may be on the horizon.

Image: quotation above the doorway to the library of the Church History Library, Salt Lake City, Utah, 14 November 2013. The Church History Library houses archival and library materials that chronicle the history of The Church of Jesus Christ of Latter-day Saints and its members from 1830 to the present.


kntonas said...

Interesting article. Could you please elaborate more about the following two sentences of your post?
a) "Archive Ready has apparently engendered some controversy". What's the point of disagreement?
b) "it might not be able to locate streaming videos located deep within large websites". Could you please explain further?

l'Archivista said...

kntonas, I suggest you contact Scott Reed re: the "controversy" surrounding Archive Ready; I was merely reporting what he said. As far as Archive Ready's ability to locate streaming media embedded within large sites, my comments were based on my own very limited testing: Archive Ready indicated that did not contain any streaming video, but the site contains dozens if not hundreds of them (see, e.g.,