DLP Technology – Why Does it Fail?

Alex Haynes
CISO Cheshire Datasystems Ltd.

Long ago, before GDPR, there was a class of technology called DLP that claimed to solve all your data leakage and data protection issues.  An acronym that stands for “data loss prevention” (it can also be referred to as “data leakage protection” or “data loss protection,” depending on who you are talking to) was supposed to be able to label data automatically, apply rules and then make a decision on whether to allow the data to pass through the system or to prevent it from being used. It is most often found in web and email gateways. These are two vectors that are often used to transmit data and are a cause of accidental or malicious data breaches. However, breaches continue unabated, with a lot of them accidental. So why isn’t DLP Technology a panacea for these kinds of issues if it is so widely deployed?  First, to understand the issues with DLP Technology, it is important to understand how it works.

Challenges with DLP Today

DLP Technology relies heavily on pattern matching to identify certain types of data. For example, a U.S. Social Security number is a nine-digit number separated by hyphens as follows: NNN-NN-NNNN. If you have DLP Technology in force and you want to prevent Social Security numbers from being accessed, this technology will seek out and identify numbers in this format, however there is a major side effect.

Any nine-digit numbers that are separated by hyphens in the same format, including those that are not a Social Security number will also have their access blocked. If you think of invoice numbers, purchase order numbers, tracking numbers, phone numbers and any other string of digits that your company can employ in day-to-day operations, you start to get an idea of the scale of the problem.

Due to the aforementioned, DLP Technology can quickly become a blocker for business as it will start to block legitimate data that is not a risk; therefore, the solution quickly becomes to set the tools to
“alert” — meaning let the data through but alert someone that it has identified a type of data it thinks is a Social Security number. To monitor these alerts, you obviously need personnel (and this often falls to security operations teams) to monitor, filter, and tweak the rules until they are getting a low false-positive rate.

When the type of data being monitored becomes less structured, the problems get even worse. Let’s say you wanted to stop anything labeled “confidential” from leaving the company. You could input a rule to look for that word, but then very quickly, you will be bombarded with a deluge of false positives. Any email with the word “confidential” in it will be flagged, however, the sentence could be something benign, such as, “This is NOT confidential, so you are free to distribute.”

Again, false positives abound, and blocking this automatically will cause chaos within the business. There needs to be a manual validation of alerts to weed out false positives.

Trivial to Bypass

Once you understand how pattern matching works, it is easy to bypass, even accidentally.  The aforementioned “confidential” rule will not pick up “C-O-N-F-I-D-E-N-T-I-A-L” because of the hyphens or even “C O N F I D E N T I A L” because of the spaces.

It gets even more troublesome if you start to use encoding, such as base64 encoding (a way of encoding binary data into text). The word “confidential” encoded in base64 looks like this: Y29uZmlkZW50aWFs. It will look like random gibberish to most people but can be decoded easily by using any base64 online decoder and, of course, it can cut through your DLP like butter so any employee who does a bit of research can base64 encode entire reams of data, exfiltrate it and decode it on the other side, and it will never be picked up.

Why Use it at All?

DLP still has its uses for structured data, that is, data that follows a predefined format. A credit card number is a good example because it follows a very specific format and even has a “checksum” to make sure the number is valid. This means false positives are low (although they never completely disappear). Private encryption keys and tools like AWS keys also follow an extremely specific format so rules can be applied to detect and apply controls on these kinds of data. While false positives will still happen, they do a good job of detecting these key pieces of data that can cause trouble if they go places they shouldn’t.

Pattern Matching Personal Data

The most common question with DLP Technology is how to detect personal data. Well, with pattern matching, it is near impossible. As an example, think of how a person’s name is structured. The only constant is their name usually does not have any numbers; however, a name can be any length, any number of words and even can have hyphens in them. In a pattern-matching scenario, this is equivalent to “find me a word followed by another word”. 

Just One Tool Among Many

Despite its issues, DLP still has its uses for enforcing information security but should not be seen as a catch-all solution to data leakage and data breaches in general. If it forms part of a robust set of controls that encompass good access control, system updates and efficient monitoring, DLP Technology can become a valuable tool in the constant battle to keep our data safe.


Alex Haynes

Tags: , , , , , ,