Showing posts with label redaction. Show all posts
Showing posts with label redaction. Show all posts

Friday, February 13, 2015

Jeb Bush's e-mail, continued

Earlier this week, Jeb Bush made available online hundreds of thousands of emails he sent and received during his tenure as Florida's governor. As I noted yesterday, the emails Bush's organization placed online contained Social Security numbers, home addresses and phone numbers, and a wealth of other personal information about private citizens. In the wake of the controversy, the Bush camp pledged to review and redact the e-mails, which are identical to the unredacted e-mails held and made accessible to researchers by the Florida State Archives.

Earlier today, Fortune reported that the e-mails approximately 13,000 Social Security numbers and that roughly 12,500 of these numbers were housed within a spreadsheet embedded within a PowerPoint presentation attached to a message that Governor Bush and approximately 50 other people received in October 2003. The other 500 are scattered throughout the correspondence. The Bush team has been able to use software to identify and redact approximately 400 of them, but as of earlier today approximately 100 were still available online because they don’t conform to the usual XXX-XX-XXXX pattern and thus can’t be easily found.
Fortune also reported that a spokesperson for the Florida Department of State, of which the Florida State Archives is part, stated that: “the Department of State is currently reviewing our process for redacting confidential information from documents given to the State Archives.” Ouch.

To add insult to injury, ComputerWorld notes that the Microsoft Personal Storage Table (PST) versions of the Bush e-mails that the Florida State Archives disclosed to researchers and that were, for a short time, made available for downloading on the Bush e-mail site contain a number of old viruses and Trojan Horse applications. Most of them pose little threat to anyone who has a newer computer and up-to-date anti-virus software, but they might cause problems for people who have older machines or don't have anti-virus software installed.

Thursday, February 12, 2015

Jeb Bush's e-mail

On Monday, former Florida governor Jeb Bush placed online copies of hundreds of thousands of e-mails he sent and received while in office. Bush is actively exploring the possibility of running for President and has stated that he released the messages to show his commitment to transparency and his embrace of information technology; many political observers have concluded that the release is also meant to prove that he's a dedicated, responsive, and effective executive. Things did not go quite as planned, and the resulting uproar ought to be of interest to any government archivist who might accession electronic records that contain legally restricted information, respond to FOI requests for born-digital or digitized records, or confront the sweeping records requests that invariably occur whenever a former official seeks higher office.

As soon as the e-mails were released, tech journalists and bloggers began exploring the search interface that Bush's staff created and the contents of the messages their searches yielded. They found thousands of Social Security numbers, home addresses, and tons of other personal data that had not been redacted. The Verge, Ars Technica, Buzzfeed, and a host of other media outlets quickly redacted and published copies of numerous e-mails that contained such information, and Bush and his staff quickly promised that they would remove Social Security numbers and other personal data. However, the e-mails – in searchable database form as well as downloadable Microsoft Personal Storage Table (PST) files – were freely available online for almost a day before the Bush team decided to take action.

Bush and his staff were also quick to point fingers. Yesterday, Bush told reporters in Tallahassee that the messages were public records held by the Florida State Archives (which is part of the state's Department of State) and that he and his staff had merely "released what the government gave us." The Bush team also revealed that in May 2014, an attorney representing Bush sent a letter to an unidentified state official asserting that the state was responsible for redacting any legally restricted information found within the e-mails:
We hope these emails will be available permanently to the public, provided the records are first reviewed by state officials in accordance with Florida Statute to ensure information exempt from public disclosure is redacted before release, including social security numbers of Florida citizens who contacted Governor Bush for assistance; personal identifying information related to victims of crime or abuse; confidential law enforcement intelligence; and other information made confidential or exempt by applicable law.
The Florida State Archives holds 26.2 gigabytes of Bush's gubernatorial e-mail, and the catalog record describing the correspondence indicates that the records consist of "PST files" that "must be loaded onto user's hard drive and opened using MS Outlook software." The catalog record makes no mention of access restrictions, and unredacted copies of the files have evidently been disclosed to other researchers. Yesterday, National Public Radio (NPR) reported that many of the e-mails Bush released on Monday had first been disclosed to reporters shortly after they were created or received and that several media organizations, NPR among them, had previously obtained copies of the full set from the Florida State Archives.

At this point in time, I am not going to second-guess or condemn the Florida State Archives. I simply don't know enough about Florida's Sunshine Law, which is more expansive than many other state freedom of information laws, or the Florida State Archives' disclosure protocols to come to any sort of informed conclusion. I do know that the Sunshine Law for the most part bars the disclosure of Social Security numbers, but many freedom of information laws mandate that previously disclosed information cannot be withheld for any reason; given that many of these e-mails had been disclosed to reporters while Bush was in office, the Florida State Archives might have no choice but to release them without redacting them. To date, no one from the Florida State Archives or Florida Department of State has commented upon this matter, but I hope that some sort of explanation will eventually be made.

I am more willing to second-guess Jeb Bush and his associates. As the Miami Herald has pointed out, the May 2014 letter written by Bush's attorney strongly suggests that Bush has been seriously thinking about running for president for quite some time. To my way of thinking, it also suggests that Bush or, at the very least, his lawyers knew that the e-mail contained legally restricted information, decided that the State of Florida was solely responsible for redacting it prior to disclosure, and figured that it was ethically okay to make information that Florida couldn’t or wouldn’t redact a lot easier to find. Requesting a PST file from the Florida State Archives and importing it into Microsoft Outlook doesn’t require a ton of effort or technical know-how, but at least some of the people who are now idly rummaging through the searchable Web database of e-mails created by the Bush campaign probably wouldn’t feel the need to make the effort. Manual redaction and review of e-mail is a pain – trust me on this – but there are numerous tools that will flag and facilitate redaction of Social Security numbers, telephone numbers, and other consistently formatted data. Why didn't the Bush camp make even a modest attempt to weed out the Social Security numbers?

Finally, I must be a bit skeptical about the Bush camp's claims of transparency: the Tampa Bay Times recently reported that Bush used a private e-mail account to conduct all state business and transferred only some of the messages associated with this account to the archives when he left office. Specifically, all messages relating to “politics, fundraising, and personal matters” were removed prior to transfer. I have no problem with purging messages relating to purely personal matters, but the removal of messages relating to political affairs and fundraising efforts raises a few questions in my mind. How were these messages identified? Were they identified as they were sent or received, or was there a massive end-of-term review effort? If the latter, who was involved in the review and what criteria were employed? And, of course, why didn't Bush use a state government e-mail account to conduct state business?

Friday, September 16, 2011

SAA 2011: Skeletons in the Closet

This just about beats the record for tardy posting, but below you'll find the slides from my Society of American Archivists presentation, which was part of Session 101, "Skeletons in the Closet: Addressing Privacy and Confidentiality Issues for Born-Digital Materials." In it, I outline the current climate in which government archives operate, discuss how my repository responded to two sweeping freedom of information requests, and detail some of the lessons we learned as a result of these experiences.

Personal Privacy and Freedom of Information in the Digital Age: Challenges and Strategies for Government A...

I'll have a post concerning session 705,"Theft Transparency in the Digital Age: Stakeholder Perspectives," up later this weekend.

Tuesday, June 1, 2010

CNN's bad e-redaction

Earlier today, Al and Tipper Gore announced that they were separating after 40 years of marriage. As if this weren't awful enough, CNN posted online a PDF copy of the e-mail announcement that they sent to their friends and supporters . . . but without properly redacting their private e-mail address. Someone at CNN apparently drew a black box over the e-mail address, but didn't remove the underlying metadata. As a result, CNN readers who clicked on the black box were able to view the e-mail address hidden underneath. The post has since been revised to exclude the e-mail address, but the original version of the post was up on CNN's site for a couple of hours. Ugh.

Regular readers of this blog will recognize that proper redaction of PDF files is one of my pet causes. If you ever need to redact a PDF file, here are a few tips that should help ensure that you won't end up like CNN . . . or the U.S. Transportation Security Administration, the U.S. Department of Defense, the Washington Post, the New York Times, Google, Facebook, or any of the other organizations that have been stung by their own (or their lawyers') lack of technical knowledge.

Tuesday, December 8, 2009

TSA's bad PDF redaction . . . and tips on redacting PDFs properly

The Transportation Security Administration (TSA) is the latest in a long line of Fortune 500 companies and federal government agencies to discover that information can all too easily be recovered from an improperly redacted PDF document. On Sunday, blogger The Wandering Aramean announced that the TSA had posted a copy of its Screening Management Standard Operating Procedure manual, which provides detailed information about how TSA personnel screen passengers and luggage, on a federal contract soliciation Web site.

Portions of the manual, which is identified as containing Sensitive Security Information, were redacted, but . . . whoever did the redactions simply used Adobe Acrobat or other PDF-compatible software to draw black boxes over the information that should have been redacted. As I've noted before, it doesn't take tons of computer know-how to recover the information hiding under those black boxes, and The Wandering Aramean and lots of other people were able to do so. The TSA has pulled the manual off the federal contract site, but you can find a complete and unredacted copy here and on lots of other sites.

The TSA has stated that the version of the manual it posted has been superseded repeatedly, that it was never actually used by TSA personnel, and that TSA security procedures have changed substantially since it was written. However, the damage has been done: the blogosphere and the news media are having a field day, and Congress is demanding an investigation. I know that beating up on the TSA is something of a sport (and, believe me, I have some issues with its 3-1-1 policy), but I really do feel for the folks at TSA HQ who have to clean up this mess.

Putting poorly redacted PDFs on the Web seems to be something of a fad these days -- Google did it a few weeks ago -- but I don't want to see archivists or records managers fall prey to the pitfalls that have ensnared so many others. If you're trying to figure out how to provide access to PDFs that contain information restricted by law or donor agreement, here are a few pointers:
  • If you're working with a PDF file, never, ever use Adobe Acrobat's Draw or Annotate tools (or comparable tools in other programs) to place black, white, etc. boxes over the information you wish to redact. All a savvy user needs to do is to copy the PDF in its entirety and paste it into a word processing document. Moreover, someone with ready access to Adobe Acrobat or comparable software can skip the copying and pasting and simply open the PDF and remove the boxes that you drew. Don't think that locking your PDF will keep this from happening: shareware that promises to unlock PDFs is all over the Interwebs.
  • If you're working with a word processing document that you plan to convert to PDF format, never, ever attempt to redact information by changing the font color to white or using a shading or highlighting feature to obscure the text and then converting the document to PDF format. The copy-and-paste technique outlined above will reveal the hidden text; users might have to play with the font colors a bit, but doing so won't take them more than a few seconds.
At present, there are several good tools for redacting PDF files, and you'll need to assess your current software setup, the amount of redaction work you'll have to do, and your budget in order to decide which one works best for you.
  • If you've got an older version of Acrobat, two third-party plug-ins for Adobe Acrobat, Redax and Redact-It, are time-tested and have substantial followings in the legal community.
  • If you are using an older version of Adobe Acrobat and can't or don't want to upgrade or purchase an add-on tool, the National Security Agency has produced a document that outlines a laborious but effective redaction procedure.
  • If you've got an old version of Acrobat, no money for an upgrade or a plug-in, and only a handful of documents to redact, you might want to consider printing out the documents, whipping out a black magic marker, and redacting information the old-fashioned way. Photocopy the redacted printouts to reduce the chance that the text can be read through the marker, then scan the photocopies.
If you do commit to redacting documents electronically:
  • Make sure you know how to use your chosen redaction tool. Most of them are pretty straightforward, but slip-ups are possible, and you don't want slip-ups circulating on the Web. All of the software tools listed above are well-documented, so take the time needed to review and digest said documentation.
  • Prepare a test file and familiarize yourself with your chosen software tool before you start working with real live documents. If you can get a disinterested third party (preferably one with lots of IT or digital forensics experience) to review your test file and verify that the information you've redacted really is gone, by all means do so.
  • This may seem a bit obvious, but someone once asked me, so I'm going to come right out and say it: don't redact your original e-documents. Chances are, your documents will one day be fully discloseable, so make electronic copies of them, redact the copies, and keep both the copies and the originals. Doing so increases your storage and preservation commitments, but there really aren't any good alternatives, particularly for records warranting permanent retention.
  • Keep abreast of the relevant legal and digital forensics literature: people are trying to figure out how to "break" all of the tools listed above and recover information redacted with these tools. One of them may eventually succeed, at which point all bets are off.
Finally, a gentle disclaimer: the above information is . . . simply information, not legal, financial, medical, dental, or any other kind of advice. As is the case with everything on this blog, it's not necessarily reflective of the opinions and policies of my employer, either. It does reflect my own knowledge at the time of this writing, but, as is the case with all things electronic, electronic redaction technology and best practices change rapidly. It's really up to you to investigate the options for yourself and to make sure that the electronic information you redact really can't be recovered.

Happy redacting!

Monday, November 2, 2009

MARAC Fall 2009, S1: Solutions to Acquiring and Accessing Electronic Records

Pavonia Arcs, by Robert Pfitzenmeier (2004), Newport, Jersey City, 29 October 2009.

Along with Ricc Ferrante (Smithsonian Institution Archives) and Mark Wolfe (M.E. Grenander Department of Special Collections and Archives, University at Albany), I had the good fortune to participate in this session, which was graciously chaired by Sharmila Bhatia (U.S. National Archives and Records Administration).

Ricc Ferrante discussed the challenges of accessioning and preserving archival e-mail created by employees of the Smithsonian Institution's semi-autonomous museums and research institutes. His experience should resonate with many government and college and university archivists. Until late 2005, the Smithsonian's component facilities used a variety of e-mail applications, and retention guidelines were implemented in 2008. As a result, the archives is both actively soliciting transfers of cohesive groups (i.e., accounts) of documented and backed-up messages at predetermined intervals and passively accepting transfers of older groupings of records in a variety of formats.

Ricc then discussed the processing of these e-mails, which is performed on PC or Mac desktop computers. Incoming transfers are backed up, analyzed and documented, converted to a preservation format, and securely stored. The Smithsonian Institution Archives uses a tool to convert accounts or groupings of messages in formats other than MBOX to the MBOX format, and the Collaborative Electronic Records Project (CERP) parser then converts the MBOX files to an XML-based preservation format. Experimenting with the MBOX conversion tool and the CERP parser has been on my to-do list for some time, so I was really glad I got the chance to hear Ricc discuss these tools.

Mark Wolfe discussed how the M.E. Grenander Department of Special Collection and Archives is using Google Mini, a modestly priced "plug and play" search appliance that will index up to 300,000 documents, to improve access to its student newspapers. Prior to the installation of Google Mini, a paper card file was the only access mechanism for these publications, and Google MIni has made it possible for staff to find information about people who became prominent well after they left the university (e.g., gay rights activist Harvey Milk, '51), respond quickly to reference inquiries, and enhance access to the newspapers.

Mark also highlighted the shortcomings of Google Mini's indexing of digitized materials. When assigning titles, it looks for the most prominent text on a given page, which in a newspaper may be part of an ad, not a story. Dates are another problem. When sorting search results by date, it hones in on the date the digital file was created, not the date of the scanned original. The former problem can be corrected, albeit with considerable effort, by manually changing the author, title, etc. properties of the files, which are in text-based PDF format. However, the date properties, which help to safeguard the authenticity of born-digital files, cannot easily be changed and thus inhibit date-based access to scanned archival materials. There's been a lot of talk lately about how the management of born-digital and born-again digital materials will eventually converge, but Mark's talk is a good reminder that we're not quite there yet.

My presentation concerned our capture of New York State government sites and the redaction (i.e.. removal of legally restricted information from records prior to making them accessible) of electronic records converted to PDF format. In lieu of giving an exhaustive recap, I'll just offer a few words of advice to people contemplating electronic redaction. At present, there are several good tools for redacting PDF files, including the built-in tool bundled with Adobe Acrobat 8 and 9, Redax, and Redact-It. If you are using an older version of Adobe Acrobat and can't or don't want to upgrade or purchase an add-on tool, the National Security Agency has produced a document that outlines a laborious but effective redaction procedure. If you commit to electronic redaction, you need to keep abreast of the relevant legal and digital forensics literature: people are trying to figure out how to crack these tools and techniques and recover redacted information, and one of them may eventually succeed.

There are also several really bad PDF redaction techniques. Never, ever use Adobe Acrobat's Draw or Annotate tool to place black, white, etc. boxes over information you wish to redact. Another spectacularly bad idea: "redacting" a word processing document by changing the font color to white or using a shading or highlighting feature to obscure the text and then converting the document to PDF format.

Want to know why these options are so bad? Read this. And this. And this. And this. And this. And this, too (thanks to John J. @ W&L for drawing my attention to this recent blunder.)