Legal battles over compulsory data decryption are making headlines. The publicity will likely continue as encryption technology proliferates in both consumer and enterprise markets. The arguments on both sides of this issue merit careful consideration and discourse before any comprehensive policy decision is made or legal precedent is set. One side argues that alternate decryption mechanisms (i.e., back doors) are critical to law enforcement and intelligence gathering activities. The other insists that such mechanisms effectively end the confidentiality assurances that our constitutional rights demand.
But, where does that leave cases like that of former Philadelphia police sergeant, Francis Rawls? The former cop has been in jail now for over two years, held in contempt of court for failing to unlock (and decrypt) two hard drives. Issues discussed in media coverage of his case range from whether his Fifth Amendment rights have been violated1, to limitations on jail time based on failure to comply with an order to testify in federal judicial proceedings2,3. Missing from this discussion, however, is why or whether decryption is needed at all.
Hash functions are one way mathematical functions that produce a result commonly known as a message digest, a hash value, or more simply, a hash.
In the Rawls case, forensic examination showed that the encrypted drives in question had been used in a computer that had visited child abuse sites4 and that Rawls “had downloaded thousands of files known by their hash values to be child pornography. The files “…had been stored on the encrypted external hard drives.”5 Hash functions, and the data integrity and authenticity assurances they provide, appear briefly in the case’s coverage as rationale for the prosecution’s assertions that illicit files reside on the encrypted drives.6 The question is, why aren’t the hash values and the probative evidence they provide center stage? More succinctly, why is there a case at all?
Hash functions are one way7 mathematical functions that produce a result commonly known as a message digest, a hash value, or more simply, a hash. Hashes are often referred to digital fingerprints as they are, arguably unique representations of data (e.g., a message, a file).8 Hashes are used in myriad cryptographic applications to assure data integrity (i.e., that no changes have been made to a set of data) and authenticity (i.e., that the source of data can be verified). They do this by producing, regardless of the size of the data input to the hash function, a data string of a set size specific to the input data. The set size is determined by the hash function used. If a change is made to the input data, even something as small as capitalizing a single letter, adding a space or removing a punctuation mark, the output data string will be different. Furthermore, hash functions are deterministic, meaning that no matter how many times the same data is input to the function, the output will be the same.
Owing to these properties, hash functions are foundational enablers of technology innovations that have evolved, and continue to evolve in any number of areas, from authenticating Internet connections and ensuring the integrity of exchanged messages to blockchain and Bitcoin and other cryptocurrencies. Hash functions and the assurances they provide are not unknown to law enforcement or legal communities either. Because two data sets with the same hash value are accepted as being the same data, hashes are currently used in identifying, collecting, establishing a chain of custody, analyzing and authenticating, in court, digital evidence.9
An argument against using file hash values to ‘prove’ that a file on a computer system is the same as a known illicit file (e.g., pornographic images involving minors) centers on collision-resistance or lack of, with certain widely used hash algorithms. Two such hash algorithms, Message Digest 5 (MD5) and Secure Hash Algorithm-1 (SHA-1), have been shown to be susceptible to collision attacks.10,11 A collision attack against hash algorithms occurs when two different files (or data inputs, also referred to as preimages) can be found that result in the same hash value.
While the probability of finding files “in the wild” that produce the same hash value is low12, evidence that collisions can be generated14 is likely enough to generate doubt in legal cases. Fortunately, MD5 and SHA-1 are not the only options when it comes to hashing. Hash algorithms, including SHA-256 and SHA-314, exist and have for some time, which are collision resistant, as well as 2nd preimage resistant. 2nd preimage resistance is a property that ensures, given one input (preimage), another input or preimage can’t be found that results in the same hash value. Another option to overcome concerns over MD5 and SHA-1 collisions is to perform file comparison with hash values from more than one hash algorithm. For example, as MD5 and SHA-1 are independent functions, collisions between hashes produced from both functions are extremely likely.
Given the power of hash functions, and the assurances of data integrity and uniqueness they provide, it’s surprising that they haven’t been used to greater effect in court; the Rawls case being one example. It’s not hard to imagine that the participants in such cases, e.g, judge, jury, and counsel, lack the specific technical knowledge on (and trust in) the science of hash functions necessary to factor them appropriately into their cases and decisions, and unless a trusted expert in the field is present and can articulate the assurances provided by hashing, they’re likely to be disregarded as valid evidence.
However, these same people likely trust, knowingly or not, the power of hash algorithms in other personally relevant ways. To name a few, hashing is used to assure the authenticity of websites with which they may share personal and private information, in password storage applications (personal or used by entities they interact with online), and is likely used by the antivirus solution they trust to keep their devices free of malware.
In addition to these applications, hash functions and hashes are critical to blockchain technology. While the mechanics of blockchain are outside the scope of this article, blockchain is a distributed ledger of transactions that uses hash functions to support the immutability and integrity of the transactions stored on it. The hash values ensure that the ledger cannot be altered, as making a change to any record, no matter how small, would cause changes to the entire chain, which is not possible.15
To realize the power these functions have on the ability to objectively decide such disputes, let alone their power to provide probative evidence of criminal wrongdoing (much the way that DNA evidence does), the science behind this technology and the assurances it provides, needs to be recognized, understood and consistently applied by the legal community.
Blockchain is perhaps now best known for its use in cryptocurrency systems, Bitcoin is one example, but its implications are far more wide reaching. The ability of blockchain to authenticate digital information and maintain immutability has applications in many areas: contract execution and enforcement; government affairs, including assuring election integrity; guaranteeing the source and authenticity of physical assets purchased through supply chain auditing; and maintaining ownership records for assets such as land or intellectual property.16
These areas mentioned, among others, are common sources of legal dispute now, often centered around the existing and validity of transactions. Resolving them can take large amounts of resources, both human and financial, and often depend on subjective or unobtainable information and records. The security properties of blockchain, and by extension, hashes have the ability to reduce or eliminate the subjectivity involved in rendering legal decisions over these types of disputes.
To realize the power these functions have on the ability to objectively decide such disputes, let alone their power to provide probative evidence of criminal wrongdoing (much the way that DNA evidence does), the science behind this technology and the assurances it provides, needs to be recognized, understood and consistently applied by the legal community. There does seem to be recognition as hash algorithms already fit the examples included in Federal Rule of Evidence 901(b)(4) as they provide “distinctive characteristics” of the item of evidence17, and are identified as methods to establish the authenticity of data Managing Discovery of Electronic Information: A Pocket Guide for Judges.18
However, the use of hashes in legal cases, while increasing, is not yet universal or standardized. While there appears to be growing acceptance of hash values in forensics, policies must be developed that acknowledge this tool as more than a justification for a warrant or for the admissibility of evidence, but as a valuable and mathematically significant piece of evidence to be factored in when rendering decisions. As technologies such as blockchain are implemented in, and provide foundational enabling capabilities that power the national economy, security and governance, to name a few, it becomes more vital that these technologies are both understood and given their appropriate due.
In 2012, the American Bar Association (ABA) amended its Model Rule of Professional Conduct, which governs lawyer competence, to include the guidance that lawyers should keep abreast of changes in the law and its practice, including the “benefits and risks associated with relevant technology…”19 Enabling competence in technology starts at a minimum with legal education. While law schools may assume that their students are technologically competent owing to their use of technology in the classroom, there needs to be a recognition of the difference between familiarity and competence. Without this education, and without dispositive policy that both guides and directs the inclusion and due consideration of hash functionality, legal discussions and decisions in cases like the Rawls case and those that will inevitably come with expanded use of blockchain, will remain mired in emotion and opinion instead of fact.
- Both a federal judge and the 3rd US Circuit Court of Appeals did not agree with Rawls’ contention that forcing him to unlock the drives amounted to a violation of his Fifth Amendment right against being compelled to testify against oneself. https://arstechnica.com/tech-policy/2017/03/man-jailed-indefinitely-for-refusing-to-decrypt-hard-drives-loses-appeal/
- The filing pointed out that Rawls’ stay in prison had already exceeded the maximum 18-month sentence under the 28 USC § 1826 statute for failure to comply with an order to testify or provide other information in federal judicial proceedings. https://www.theregister.co.uk/2017/08/30/ex_cop_jailed_for_not_ decrypting_data/
- the appeals court verdict that “no temporal limitation on the amount of time that a contemnor can be confined for civil contempt when it is undisputed that the contemnor has the ability to comply with the underlying order.” https://www. theregister.co.uk/2017/08/30/ex_cop_jailed_for_not_decrypting_data/
- https://www.theregister.co.uk/2017/08/30/ex_cop_jailed_for_not_decrypting_ data/
- Forensic examination also disclosed that Doe [Rawls] had downloaded thousands of files known by their “hash” values to be child pornography. The files, however, were not on the Mac Pro, but instead had been stored on the encrypted external hard drives. Accordingly, the files themselves could not be accessed. https://arstechnica.com/tech-policy/2017/03/man-jailed-indefinitely-for-refusing-to-decrypt-hard-drives-loses-appeal/
- In computer science, a one-way function is a function that is easy to compute on every input, but hard to invert given the image of a random input. https://en.wikipedia.org/wiki/One-way_function
- Baker, Doris and Mel, H.X. Cryptography Decrypted. Addison Wesley. 2001.
- Steve Mead of NIST put together a nice paper and presentation on the Viability of MD5 and SHA-1 Hashes, which explains the limitation. In it, he describes the number of possible MD5 values as 2^128, which is 1,700,000,000,000,000, 000,000,000,000,000,000,000,000 possible combinations. Statistically to have a 50 percent probability of a duplicate file hash, the number of unique files would need to reach 850,000,000,000,000,000,000,000,000,000,000,000,000. https://www.forensicmag.com/article/2008/12/hash-algorithm-dilemma%E2%80%93hash-value-collisions