Thursday, January 17, 2013

Cloud hashing

Hash functions are everywhere. Getting this webpage to your computer has required at least a handful of hash calculations. We use them for identification, authentication, or even data structures, such as hash tables or Bloom filters. A particular strain of hash functions, cryptographic hash functions, tends to have more narrow uses: authentication (e.g., as part of HMAC), pseudorandom number generation, password storage (e.g., as part of HMAC in PBKDF2), and so on.

Until a few years ago, everyone was happy: MD5 ruled the land, while SHA-1 sometimes made an appearance. Disaster struck in 2004, when Xiaoyun Wang successfully found collisions in MD5, and seriously damaged SHA-1's reputation. Today, the landscape is different: MD5 and SHA-1 are no longer recommended, and SHA-2 and SHA-3 are the appointed replacements.

Yet, SHA-2 and SHA-3 do not fulfill every hash function requirement adequately. In particular, they are fairly slow in software, clocking at between 10 and 20 clocks per byte on modern x86 processors. In comparison, MD5 and SHA-1 clock between 4 and 6 clocks per byte. Applications that need to hash massive amounts of data will notice this gap. Therefore, such applications will often make the conscious choice to use SHA-1 or MD5, because a potentially insecure product is often better than an unresponsive product.

There are other options to consider when fast hashing (and non-strict adherence to standards) is required. Several SHA-3 finalists were very fast in software, such as BLAKE or Skein. Both are in the same order of magnitude of speed as MD5, and have passed the exhaustive level of scrutiny needed to get to SHA-3 finalist.

The recently announced BLAKE2 function takes advantage of the analysis already done on BLAKE, and strips the design of any unnecessary computational overhead. This results on a function that is expected to be as secure as SHA-3 in every way, but actually faster than MD5. Furthermore, BLAKE2 adds support for tree (parallel) hashing, which is a godsend for big data applications where sequential hashing would be simply too slow regardless of how optimized the design got.

So what are the applications of fast (and/or parallel) hashing?

File blacklisting/whitelisting is an obvious one; matching files to known hash sets is only dependent on disk and hashing speed. Fast-enough hashing can also benefit advanced filesystems. A remarkable case is ZFS, where a strong hash is supported (SHA-256), but disabled by default for performance reasons. SHA-256 is also required for other advanced features, such as deduplication.

Ultimately, pretty much every current hash use case does benefit from a faster function: faster equals less power consumed, and less overhead for the actual application. One exception is password storage, but you should not be using a plain hash function for that anyway!