Epic PDF Redaction Fails, a Horror Story

Read this article in French.

Many tools are available to redact electronic documents, but not everyone is familiar with them. Today's post will show you the redaction mistakes that made the news in the last years. Don't reproduce this at home!

Table of Contents

What is redaction?

In its most neutral definition, redaction is the process of permanently removing visible elements from a document.

We’ve already written about redaction in PDF files in two previous articles:


In today’s post, we will show you the errors you should not reproduce when redacting a PDF. 
But before that, we’ll go through a bit of the history of this technique. 

Etymology

According to Merriam-Webster, the verb redact has been in use in English since the 15th century, deriving from the Latin redigere “to drive, lead, or bring back, get together, collect, arrange, reduce”.

In the early 19th century, redaction simply means to organize or edit. We still find this meaning in French, where “un rédacteur” is an editor, usually in a media. 

From redaction to censorship

Beyond its neutral definition based on etymology, the act of redaction often takes on a moral aspect with the notion of censorship. 

Since writing exists, we have tried to hide sensitive words.

Recently and thanks to new x-rays technologies, the letters of Marie-Antoinette to the Count of Fersen have revealed their secrets. While the content of their correspondence is mainly political, the sentences where Fersen writes about his feelings are redacted (by himself!) to avoid censorship. 

Redacted letter of Marie-Antoinette to the Fersen Count
The secrets of Marie-Antoinette's letters to the Fersen count revealed. Arch. nat., 440AP/1. © CRC

The line between redaction and censorship is thin.

In English, one definition of redaction is “confidential text and images in a document that have been censored, deleted, or obscured.”

In French, “biffer” and “caviarder” both mean to redact, but the latter also means to censor. “Caviarder” first meaning is, in fact, “to cover with a black coating (as black as caviar), to make illegible a passage of a text prohibited by the censor.”

Depending on the context in which the document is produced, one can consider redaction an act of censorship. Fortunately, nowadays, we mainly talk about redaction in a more neutral, less political and moral context, as we will see next. 

From the court...

If legal issues pushed the use of redaction in all countries, in the US the rise of its adoption is linked to two specific events. 

First was the need to protect witnesses from criminal organizations like the mafia when they were mentioned in documents produced to courts. 
Then, since the Freedom of Information Act (FOIA) in 1966, government agencies need to make a vast amount of documents available to the public. However, the release of this type of document is only possible if all elements referring to national security issues or personal and sensitive information are removed. This is possible with the “exemption section” of the FOIA. 

And this is why many documents that fall into the FOIA domain look like this:

Heavily redacted file
In the early 1970s, the US government conducted surveillance on ex-Beatle John Lennon. This is a letter from FBI director J. Edgar Hoover to the Attorney General. J. Edgar Hoover, Public domain, via Wikimedia Commons

We did not have electronic documents or PDF redaction in the seventies, so we still had to rely on a good black marker to erase information. 

But this doesn’t mean that we’re doing better today, even with the right tools widely available. 

The Mueller report

The case that made PDF redaction famous to the public. 

Trending: ‘redact,’ ‘redacted’
Lookups spiked 4,000% on March 29, 2019
Merriam-Webster

In 2019, the US Department of Justice released the Mueller report.
To set the context, Special Prosecutor Robert Mueller was in charge of investigating suspicions of collusion between Moscow and Donald Trump’s campaign team in 2016. In his report, Mueller excludes any collusion between the teams of Donald Trump and the Russian power. However, he highlights, without being able to decide, a series of incidents that tend to show that the president sought to obstruct the progress of the investigation and questions about possible obstruction of the course of justice.

Using the Freedom of Information Act, several media like BuzzFeed News, and later CNN, sued to gain access to the report. 

The Department of Justice granted their wish, however with slight modifications. 
Indeed, about one-eighth of the lines are redacted, falling into the Exemption section.

Redacted pages of the Mueller report
A sample of the Mueller report, as shown on the NPR article of April 18, 2019.

As a side note, there are many things to say about the version the Department of Justice provided to the public. Among others, the main problems were:

  • the report wasn’t searchable,
  • its size was very, very large.

 

Both aspects made the report hard to share and search, and it shouldn’t have been so, especially with a PDF file.

Read more about how the DoJ could have done better with this file in the PDF Association article A Technical and Cultural Assessment of the Mueller Report PDF.

...to the office

Beyond defense secrecy, all companies and organizations worldwide need to manage sensitive and personal data.

Some industries have a legal obligation not to make public some of their data, such as the legal industry with judicial secrecy and the health industry with medical and professional confidentiality.

But even for all other companies or organizations, regulations exist at several levels to protect employees and customers.

  • GDPR in Europe and, in particular, the CNIL in France;
  • The California Consumer Privacy Act in the US;
  • Brazil recently implemented a similar law, and other countries are following up.


Justice departments act relatively fast in this type of case. For example, eight months after implementing the GDPR, the CNIL in France claimed Google more than 50 million euros, and the fine was confirmed by the Council of State in 2020.
In terms of sanctions, breaches in the GDPR can cost up to 20 million euros or 4% of companies’ global turnover.

PDF redaction epic fails: 3 exemples

Unfortunately, there are many cases of redaction gone wrong.
However, we can thank the people and companies in the following three examples to show everyone practices to avoid.  

Facebook likes your personal data

Even if people no longer trust the famous social network, it’s actually a redaction fail that prevented it from selling personal data to a third party in 2012. 

In 2017, a poorly redacted PDF showed that Facebook considered charging major companies at least $250,000 to access users’ personal data.

The Ars Technica reporter Cyrus Farivar uncovered the issue thanks to a simple copy-paste of a 2017 court document into a text editor like Notepad or an office document. This is the most basic test if you want to know if a document is appropriately redacted. If you see the redacted content in the text editor, this means that that text was simply hidden and not removed. 

You will see in the following example that it’s also the most frequent error people make when they redact a document. It’s not because a text is covered with black (with a highlighter or annotation) that it’s not there anymore.

Paul Manafort and technology

Paul Manafort was chairman of Donald Trump’s 2016 presidential campaign.

His series of technology mishandling is so long that it makes him the textbook example for digital security campaigns (think about reusing old passwords and emails, failing to convert documents, and storing incriminating messages in the Cloud).

His team of lawyers didn’t know better when they attempted to redact a sensitive passage. Unfortunately, simply by copying and pasting the redacted paragraph, it was possible to read the blacked-out portions and discover new details about Manafort’s relationship with Konstantin Kilimnik, a former associate with ties to Russia. 

The AstraZeneca contract with EU

It seems that we can see a *slight* technology improvement in our next case.

At the beginning of last year, and for transparency purposes, the European Commission published the redacted contract it engaged with the vaccine company AstraZeneca

At first sight, the contract seems correctly redacted with proper tools. However, the PDF’s bookmarks referring to redacted content were overlooked.

What was the redacted part about? The total purchase price of this contract (€870m).

How to avoid PDF redaction fails

The technology of permanently deleting data in PDFs is not new, but it is (apparently) still not well known.
In fact, as early as 1998 the company Appligent files a patent on PDF editing (as a reminder, Adobe released the first version of PDF in 1993). In 2006 in version 1.7, Adobe added an addendum on redaction annotations.

Copy-paste and other tips

So we’ve seen what happens when you just put black rectangles or annotations on confidential information thanks to the copy-paste test; you don’t delete the text or image underneath, you just hide it.

We’ve also taken note to check our bookmarks thanks to AstraZeneca, but you also need to check other places, such as:

  • image captions
  • hyperlinks
  • embedded files
  • attachments

But the job is not over

As a rule with PDF, you should always ask yourself: “if I don’t see it, does it mean that it’s not there?”

Like any other electronic document, PDF is made of a visible part, what we see when we open it, and a so-called invisible one that is more or less hidden.  

The problem with sensitive information? You can find them in the less-visible parts of the PDF. So let’s look for them. 

Sanitization

If redaction is the process to remove visible information from an electronic document, sanitization is the same process, but for hidden (or I prefer the expression “less-visible”) information.

By hidden information, we mean:

  • file metadata (author, title, creation date, PDF version, etc.),
  • embedded files and images metadata,
  • annotations,
  • comments,
  • hidden text layers (OCR layer, underneath the annotations).


Often, sensitive data you have carefully spent time identifying on a PDF may appear elsewhere. For example, you have removed a person’s name from the text, but it may also be found in the metadata of images or in comments.

While sanitization, like redaction, can be automated, it is often necessary to do a manual verification, especially in the following cases:

  • images in different formats,
  • spreadsheets,
  • attached files,
  • indexes.
A word about metadata

Search engines can access metadata, so be very careful when releasing a redacted document online. 

Metadata can contain virtually any information and can be related either to the document in general or to separate objects contained within the document like images, fonts, etc.

Standard reader software does not usually access all of this information, but can still be extracted by advanced PDF processing solutions.
Something that developers must keep in mind when redacting documents.

To sum up

With PDF redaction, using the right tools is essential. But what is crucial is to check all the sources of information.
In the case of complex semantic redaction processes that involve more than simply removing phone or social security numbers (simple processes that can easily be automated), no tool can beat a trained person. 

Try our redaction tool online

If you’re looking for an advanced redaction SDK, check our toolkits for developers

Cheers!

Elodie

avepdf icon
Do you need help to manage your electronic documents?

Related articles