Saturday, March 21, 2015

Complexity is the bugdoor's friend

Backdoors are a fashionable topic these days, ever since the BULLRUN program was uncovered by the Snowden leaks. Bruce Schneier and others recently wrote a survey on the topic, which covers much of what is known about backdoors, available here. You should also check the Underhanded Crypto Contest out, which has ended a couple of weeks ago.

Backdoors come in all shapes and sizes: some can be mathematical in naturelike the Dual EC generatorwhile others can be simply coding mistakes in a key routine in an application, also known as “bugdoors”. They may be used to leak (parts of) secret keys, derandomize random number generators, bypass authentication measures, or simply leak plaintext. Today I will show an example of the latter.

There is a now well-known method to introduce a “bugdoor” in RC4 implementations:

#define TOBYTE(x) (x) & 255
#define SWAP(x,y) do { x^=y; y^=x; x^=y; } while (0)

static unsigned char A[256];
static int i=0, j=0;

void init(char *passphrase) {
  int passlen = strlen(passphrase);
  for (i=0; i<256; i++)
    A[i] = i;
  for (i=0; i<256; i++) {
    j = TOBYTE(j + A[TOBYTE(i)] + passphrase[j % passlen]);
    SWAP(A[TOBYTE(i)], A[j]);
  }
  i = 0; j = 0;
}

unsigned char encrypt_one_byte(unsigned char c) {
    int k;
    i = TOBYTE(i+1);
    j = TOBYTE(j + A[i]);
    SWAP(A[i], A[j]);
    k = TOBYTE(A[i] + A[j]);
    return c ^ A[k];
}


This method relies on the clever (and old!) trick of swapping registers without using a temporary variable by using XOR (or, in general, addition and subtraction on any commutative group). When i == j, however, this trick will instead zero the state byte. Eventually the entire state will be zero, and the result of RC4 encryption will be the unaltered plaintext!

I present now a twist on this trick, this time in C++. Here it is:

#include <cstddef>
#include <cstdint>

#include <tuple>
#include <utility>

namespace security {
  namespace util {
    // type safety is great
    template <typename T> struct wrap {
      T x;
      wrap(const T &x = T()) : x{x} {}
      operator T &() { return x; }
    };

    using  uint8 = wrap< std::uint8_t>;
    using uint16 = wrap<std::uint16_t>;
    using uint32 = wrap<std::uint32_t>;
    using uint64 = wrap<std::uint64_t>;
  }

  namespace rc4 {
    template <typename uint8 = util::uint8>
    inline void stream(unsigned char * s, std::size_t slen,
                       const unsigned char * k, std::size_t klen) {
      using std::swap;
      uint8 S[256];

      for (int i = 0; i < 256; ++i) {
        S[i] = i;
      }

      for (int i = 0, j = 0; i < 256; ++i) {
        j = (j + S[i] + k[i % klen]) & 255;
        swap(S[i], S[j]);
      }

      for (uint8 i = 0, j = 0; slen--;) {
        i += 1;
        j += S[i];
        swap(S[i], S[j]);
        *s++ = S[(S[i] + S[j]) & 255];
      }
    }
  }
}

// elsewhere
namespace security {
  namespace util {
    template <template <typename...> class U, typename... V>
    void swap(U<V...>& x, U<V...>& y) {
      // Option #1
      // decltype(auto) t = x; x = y; y = t;
      // Option #2
      using std::tie;
      tie(x, y) = tie(y, x);
    }
  }
}

This one relies on a combo of C++ idiosyncrasies. Starting with the swap, I present two similar ways to subvert it:
  • Option #1 uses the C++14 decltype(auto) keyword, which is essentially the auto keyword with decltype semantics. What this means is that the type of t will be wrap<uint8_t> &, instead of the expected wrap<uint8_t>. The result of the swap will then be (y, y) instead of (y, x).
  • Option #2 uses std::tie instead. The idea is the same here: tie(x, y) creates an std::tuple<wrap<uint8_t>&, wrap<uint8_t>&>, that is, a tuple of references, and thus the assignment will necessarily overwrite one of the registers.
In either case, the result will be that the RC4 internal state will now have 2 equal values, instead of 2 swapped values. This will ensure that eventually we will have an output of always (or almost always) the same value. This is somewhat superior to Wagner and Biondi's method, in that simply by looking at the ciphertext it is not as easy to detect that something has gone horribly wrong.

But this is not enough; rc4::stream explicitly imports std::swap, so this (correct) version of the swap should be used instead. To get around this, we use C++’s argument-dependent lookup (ADL). ADL basically says that, when given an unqualified (i.e., without the explicit std::) function call, C++ will also look for the function in the argument's namespaces. Therefore, the call swap(S[i], S[j]) will find both std::swap and security::util::swap, since wrap<uint8_t> also lives in security::util. The wrap<T> type may be justifiable as a type employed for added security checks, e.g., to detect integer overflow/range or to ensure two’s complement semantics. In this example it is a very simple barebones wrapper around T.

However, if we place a generic swap(T&, T&) swap function in security::util, we will get a compiler error, since the call is now ambiguous according to C++ rules. Since swap(wrap<uint8_t> &, wrap<uint8_t>&) would be too obvious, I went with swap(U<V>&, U<V>&). C++ orders overloadable functions by specialization, and since U<V>& is “more specialized” than T&, U<V>& gets priority and solves the ambiguity.

This is not yet enough. Since we defined swap after rc4::stream, std::swap would still be picked, since the name lookup process would not look at the later definition. To work around this, we make rc4::stream a template function, which forces the lookup of swap to occur at instantiation time instead of when the function is defined.

What do we learn from all of this? Well, certainly that C++ is a complex language, but I doubt that was ever in question. C++11 is in many ways a nicer language than its predecessor, but it is not a simpler one. And in complexity it is easier to hide flawsit would certainly take me much longer to spot the flaw in this version than in Wagner and Biondi’s.

Many of our current woes with cryptography-related implementations, such as Heartbleed, the Triple Handshake Attack, and OpenSSL’s regular CVEs, can be (partially) traced back to complexity; it is therefore imperative to look at such code as not only a blackbox, but also a source of attack surface which can lead to privacy loss. On our audits, unnecessary complexity and over-engineering are often things that we flag, despite not being flaws in and of themselves, because they often enable subtle security issues to creep in unannounced.

Thursday, January 8, 2015

OpenSSL's squaring bug, and opportunistic formal verification

OpenSSL's latest round of security advisories is out today, and besides the usual memory safety issues—seemingly endemic in its DTLS implementation—there is also an interesting bug in multiprecision arithmetic. In this post I will describe how this bug could have been avoided, in the security community's sacred tradition of simple post facto solutions.

The bug

We can look at this OpenSSL commit for the bug details. I will focus on one of the affected functions, mul_add_c2; what follows applies similarly to the other one. Here's the diff:

 #define mul_add_c2(a,b,c0,c1,c2) {     \
        BN_ULONG ta=(a),tb=(b),t0;      \
        BN_UMULT_LOHI(t0,t1,ta,tb);     \
-       t2 = t1+t1; c2 += (t2<t1)?1:0;  \
-       t1 = t0+t0; t2 += (t1<t0)?1:0;  \
-       c0 += t1; t2 += (c0<t1)?1:0;    \
+       c0 += t0; t2 = t1+((c0<t0)?1:0);\
        c1 += t2; c2 += (c1<t2)?1:0;    \
+       c0 += t0; t1 += (c0<t0)?1:0;    \
+       c1 += t1; c2 += (c1<t1)?1:0;    \
        }

The issue here is incorrect carry propagation. This results in a very hard to spot bug, unlikely to be caught by randomized testing, especially with 64-bit limbs.

Luckily, this bug does not seem to be easily exploited to perform bug or invalid point point attacks, but the next time we may not be so lucky. Additionally, this is the kind of bug that the often-touted solution to OpenSSL's security problems—memory-safe languages like Rust, Go, OCaml and so on—is powerless against. What's the answer?

Avoiding the bug

The obvious answer to avoiding subtle bugs like the above is, of course, more auditing. But auditing is not cheap; people with both the ability to understand and spot flaws in delicate cryptographic code are rare and hard to come by. 

By now, the open-source community has also been shown to be lackluster in its capability to find these mistakes; Eric Raymond's “given enough eyeballs, all bugs are shallow” aphorism has proved to be an overstatement, and catastrophic vulnerabilities linger for years, and sometimes decades, unidentified.

One answer is formal verification. Proving code correct has obvious advantages; this process often also highlights flaws in the programmer's thinking (e.g., what is actually being proved does not match what is intended). Formal verification tools have come a long way, but they are still very hard to use. The verification of a Curve25519 scalar multiplication implementation is still A Big Deal; Adam Langley has also written on the state of formal verification on the wild.

While Langley concerns himself with a slightly different issue, ensuring limbs do not overflow on unsaturated arithmetic, our predicament here suggests another, conceptually simpler approach: (dis)prove that mul_add_c2, which uses saturated arithmetic and overflows limbs by definition, is equivalent to \((c_2 2^{64} + c_1 2^{32} + c_0) + 2ab\). For this we'll use an SMT solver, in this case Z3 for convenience, directly.

For bite-sized functions like mul_add_c2 the situation is not so bad. Suppose that the programmer, concurrently with the C code, had also written the following Python code, using z3py:

from z3 import *

W = 32 # word length

# t0 + t1*2**W = a * b
def BN_UMULT_LOHI(a, b):
  prod = ZeroExt(W, a) * ZeroExt(W, b)
  t0 = Extract(1*W-1,  0, prod)
  t1 = Extract(2*W-1,  W, prod)
  return t0, t1

# b ? 1 : 0
def COND(b):
  return If(b, BitVecVal(1, W), BitVecVal(0, W))

a, b, c0, c1, c2 = BitVecs("a b c0 c1 c2", W)
ref = Concat(c2, c1, c0) + 2 * ZeroExt(2*W, a) * ZeroExt(2*W, b)

t0, t1 = BN_UMULT_LOHI(a, b)
t2 = t1+t1; c2 += COND(ULT(t2,t1));
t1 = t0+t0; t2 += COND(ULT(t1,t0));
c0 += t1; t2 += COND(ULT(c0,t1));
c1 += t2; c2 += COND(ULT(c1,t2));

prove( Concat(c2, c1, c0) == ref )

Running this program immediately tells us not only that the function is incorrect, but also gives us an example of an input where the function fails! Better yet, this is verifiable by anyone with the appropriate software, instead of solely by domain experts.

$ python openssl_bug.py 
counterexample
[c2 = 3774871552,
 b = 2567394084,
 a = 3592503423,
 c0 = 4044407142,
 c1 = 3283914074]

Note that while we are using Z3 here for its convenient Python bindings, we can easily export the expression to the standard SMTLIB2 format instead, and use any solver we like. Plugging in the corrected code,

t0, t1 = BN_UMULT_LOHI(a, b)
c0 += t0; t2 = t1 + COND(ULT(c0, t0));
c1 += t2; c2 += COND(ULT(c1, t2));
c0 += t0; t1 += COND(ULT(c0, t0));
c1 += t1; c2 += COND(ULT(c1, t1));

we can now see that Z3 finds no problem with the code:

$ python openssl_corrected.py 
proved

Success! Now, obviously this requires more developer time; but the code is mostly shared between the real implementation and the verification, and the size of the functions to be proved correct is rather limited. Yet we believe this is still a net gain against easy to make mistakes, which can stay hidden for a long time, even in the best implementations. Considering OpenSSL's status as critical Internet infrastructure, it seems warranted to go the extra mile on routines like this.

Theorem provers are often useful in our software auditing approach at Kryptos Logic; they are faster than most humans at finding failure cases for tedious functions, and leave us free to reason about other parts of the system.

Update


Somebody points out that the proof is incorrect; the issue here is that we are working over the bitvector logic, which makes our proof implicitly modulo \(2^{96}\). The underlying assumption here, which we also made, is that the result fits into the 3 output words. This is reasonable in the context in which the function is used, but without context it does make the proof incorrect. An easy way to correct this is to add a few bits of slack to account for overflow:

ref = ZeroExt(W, Concat(c2, c1, c0)) + 2 * ZeroExt(3*W, a) * ZeroExt(3*W, b)
# ...
prove( ZeroExt(W, Concat(c2, c1, c0)) == ref )


This results in a failure now!

$ python openssl_corrected.py 
counterexample
[c2 = 4294967295,
 b = 12788566,
 a = 4294967288,
 c0 = 2786395306,
 c1 = 4282404864]

We can resolve this by adding a constraint on the maximum size of the \( c \) at input:

assumption = ULT(Concat(c2, c1, c0), 2**(3*W-1))
ref = ZeroExt(W, Concat(c2, c1, c0)) + 2 * ZeroExt(3*W, a) * ZeroExt(3*W, b)
# ...
solve( ZeroExt(W, Concat(c2, c1, c0)) != ref, assumption )

Now the proof works again under the assumption that \(c < 2^{95}\) (we use z3py's solve function here to use multiple constraints; the prove function is implemented in terms of solve).

$ python openssl_corrected2.py 
no solution