KLEE w/ CPT

The source code for the modified KLEE used in this series of research can be found here.

This includes both the Constraint Provenance Tracking (CPT) introduced in [Chau et al., NDSS 2019], as well as a port of the path constraint extraction and constraint cross-validation (special thanks to Prof. Endadul Hoque for contributing his code) used in [Chau et al., IEEE S&P 2017].

The scripts directory contains some useful scripts. For example, the run-klee.sh wrapper script make it easier to invoke KLEE using recommended settings, and it saves outputs of KLEE into a text file that can later be parsed. parse-klee-dump.py can be used to parse the KLEE outputs into individual paths. run-civic-simple.sh can be used to cross-validate implementations (see below). batch-replay.sh can be used to replay a collection of concrete counterexamples to the test harness.

How to build (native)

The easiest way is to use the klee-setup.sh script, which is the main bootstrap script that does everything from scratch. It has been tested on Ubuntu 18.04 and 20.04. Prior to building LLVM-3.4 and KLEE, it will first try to install some dependencies from source, apt-get and pip. Please adjust this portion to suit your needs.

There is also a rebuild-klee-after-everything-else.sh script, which is for rebuilding KLEE after all the other prerequisites (like Z3, STP and LLVM-3.4) are already taken care of. This is useful if you frequently change the code of KLEE and need to test your modifications.

How to build (docker)

Alternatively, you can also build a docker image using this Dockerfile. Simply run

docker build -t klee-cpt https://szeyiuchau.github.io/files/klee-cpt.Dockerfile

After a successful build, the resulting docker image would have a size of around 5.7GB.

Example

Assuming you have successfully built KLEE, now let's see an example of CPT in action.

Consider this simple program testmemcmp.c:

#include <stdint.h>
#include <string.h>

int ret_val = 2147483647;
void ret_val_ready(void){
    // does nothing but necessary
    return;
}

int ourEntryFunction(void)
{
    int ret = 0;

    uint8_t a, b;

    klee_make_symbolic( &a, sizeof(uint8_t), "a");
    klee_make_symbolic( &b, sizeof(uint8_t), "b");

    if (memcmp(&a, &b, sizeof(uint8_t)) == 0)
        ret = 0;
    else
        ret = 1;

    return ret;
}

int main( int argc, char *argv[] )
{

    ret_val = ourEntryFunction();
    ret_val_ready();

    return 0;
}

And get the bitcode of it:

clang -c -g -emit-llvm testmemcmp.c

Notice that CPT requires source level debug information, hence the -g switch.

Now prepare an empty file called filter.txt:

echo > filter.txt

Then run KLEE on the bitcode:

klee -search=dfs -libc=uclibc -function-filter-file=filter.txt testmemcmp.bc

Obviously there are only 2 possible paths, capturing the possibility of whether a equals to b or not.
The text inside >>>>> and <<<<< are the path constraints that KLEE maintains (subject to optimizations).
The text inside +++++ and +++++ are the vanilla path constraints, with the source level origin of each clause attached (after the @ sign).

>>>>>
PATH: 1
RETURN VALUE: 1
CONSTRAINT:
array a[1] : w32 -> w8 = symbolic
array b[1] : w32 -> w8 = symbolic
(query [(Eq false
             (Eq 0
                 (Sub w32 (ZExt w32 (Read w8 0 a))
                          (ZExt w32 (Read w8 0 b)))))]
        false)

<<<<<<
++++++
(Eq false
     (Eq 0
         (Sub w32 (ZExt w32 (Read w8 0 a))
                  (ZExt w32 (Read w8 0 b)))))    @klee-uclibc/libc/string/memcmp.c:33

++++++


>>>>>
PATH: 2
RETURN VALUE: 0
CONSTRAINT:
array a[1] : w32 -> w8 = symbolic
array b[1] : w32 -> w8 = symbolic
(query [(Eq 0
             (Sub w32 (ZExt w32 (Read w8 0 a))
                      (ZExt w32 (Read w8 0 b))))]
        false)

<<<<<<
++++++
(Eq 0
     (Sub w32 (ZExt w32 (Read w8 0 a))
              (ZExt w32 (Read w8 0 b))))     @klee-uclibc/libc/string/memcmp.c:33

++++++

Notice the origin of the clause that checks whether a equals to b came from the implementation of memcmp() from the instrumented uClibc.

While this is technically accurate, in many cases we are not interested in the internals of libc and other shim wrappers (e.g. OPENSSL_memcmp()), but rather the dynamics of higher layers (say implementations of protocols and other algorithms). Hence CPT provides a way to filter functions by name. If a function contributing to a clause is being filtered, then its caller would be used as the origin instead.

Now let's see how filtering would help.

First add memcmp() to the list:

echo "memcmp" > filter.txt

Now run KLEE again on the same bitcode:

klee -search=dfs -libc=uclibc -function-filter-file=filter.txt testmemcmp.bc

And this time the line that calls memcmp() is shown as the origin of the clause:

>>>>>
PATH: 1
RETURN VALUE: 1
CONSTRAINT:
array a[1] : w32 -> w8 = symbolic
array b[1] : w32 -> w8 = symbolic
(query [(Eq false
             (Eq 0
                 (Sub w32 (ZExt w32 (Read w8 0 a))
                          (ZExt w32 (Read w8 0 b)))))]
        false)

<<<<<<
++++++
(Eq false
     (Eq 0
         (Sub w32 (ZExt w32 (Read w8 0 a))
                  (ZExt w32 (Read w8 0 b)))))    @testmemcmp.c:19

++++++


>>>>>
PATH: 2
RETURN VALUE: 0
CONSTRAINT:
array a[1] : w32 -> w8 = symbolic
array b[1] : w32 -> w8 = symbolic
(query [(Eq 0
             (Sub w32 (ZExt w32 (Read w8 0 a))
                      (ZExt w32 (Read w8 0 b))))]
        false)

<<<<<<
++++++
(Eq 0
     (Sub w32 (ZExt w32 (Read w8 0 a))
              (ZExt w32 (Read w8 0 b))))     @testmemcmp.c:19

++++++

Instrumented Test Subjects

Here are the instrumented source trees we used to test various implementations of PKCS#1 v1.5 signature verifications for [Chau et al., NDSS 2019].

Each implementation is in a separate directory. The zip file shared above is a little large as we include the original source archives to make it self-contained. The build system of each implemenation is modified to use clang (through wllvm) so that in the end we can get the bitcode amenable to KLEE. Other libraries that an implementation depends upon (e.g. Nettle and GMP in the case of GnuTLS) are also built and linked similarly. Other modifications include injecting our test buffer into the signature verification code, and occasionally changing the visibility of certiain internal functions (e.g. removing the static keyword).

To play with the instrumented source, the general flow is to first invoke the reset-<name>.sh script, which will extract the original source, and then invoke the build-<name>.sh script, which will apply the required modifications and build the implementation for KLEE, and then invoke the copy-and-build-harness.sh, which will copy and build a specified test harness. Then one can run KLEE on the bitcode of the test harness.

OpenSSL and GnuTLS both have 2 versions, where the old ones (openssl-0.9.7h and gnutls-1.4.2) were used as feasibility study (see TABLE III of the paper).

Prior to v4.0.1, MatrixSSL uses two different signature verification functions for checking signatures of CRL and certificates, hence the two directories matrixssl-3-9-1-cert and matrixssl-3-9-1-crl.

axtls-2.1.3-fix was used to test the patch we developed for axTLS.

The symcert-test-len directories for each implementation correspond to TH3 discussed in the paper. While most implementations can simply use the same instrumented source tree for {TH1,TH2,TH3}, some needed workarounds to avoid concretization errors for TH3, and hence the additional <name-version>-symAsn1Len directories.

For each implementation, TH1 and TH2 share the same test harness code in symcert directories, with only differing header file (macros in symcert.h).

The configuration used for TH1:

    #define PRINT_PKCS1_BUFFER      0
    #define TEST_PADDING_LEN        1
    #define TEST_ALGO_OID_LEN       0
    #define TEST_ALGO_PARAM_LEN     1
    #define TEST_HASH_LEN           1
    #define USE_SYMBOLIC_PREFIX     1
    #define USE_SYMBOLIC_PADDING    1
    #define USE_SYMBOLIC_ALGOID     0

The configuration used for TH2:

    #define PRINT_PKCS1_BUFFER      0
    #define TEST_PADDING_LEN        1
    #define TEST_ALGO_OID_LEN       1
    #define TEST_ALGO_PARAM_LEN     0
    #define TEST_HASH_LEN           0
    #define USE_SYMBOLIC_PREFIX     0
    #define USE_SYMBOLIC_PADDING    0
    #define USE_SYMBOLIC_ALGOID     1

Extracted Path Constraints

Here are the parsed path constraints extracted from implementations of PKCS#1 v1.5 signature verification.

Note: Though the zip file shared above is only about 230MB, to unzip all of the path constraints and CPT information you would need around 2.3GB of storage space in total.

The meanings of {TH1,TH2,TH3} are same as above. In the context of PKCS#1 v1.5 signatures, when e, n, H(m) are fixed, there should only be one accepting path for each of {TH1,TH2,TH3}. Multiple accepting paths in any of the three test harnesses (e.g. strongSwan) indicate leniency in signature verification.

Some of the path constraints extracted from implementations of X.509 certificate chain validation can be found here.

Cross-validation

Here are the results of cross-validating implementations. In [Chau et al., NDSS 2019], instead of pairwise comparison, we used the new version of GnuTLS as the anchor and see if each of the other implementations agrees with it. Differential analysis like this can help find subtle problems, because sometimes even when there is only one accepting path, the implementation can still be overly lenient (e.g. TH3 of mbedTLS).

Cross-validation can be easily done in a paralleled manner by running the run-civic-simple.sh script on the path constraints that were parsed by the parse-klee-dump.py script. Both scripts are part of the source release discussed earlier.

GUI for making ASN.1 objects concolic

Additionally, here is the GUI we made for making X.509 certificates concolic, as used in [Chau et al., IEEE S&P 2017].

To use the GUI, open asn1js/index.html with a modern browser, then decode any ASN.1 objects (say an X.509 certificate). Then you can click on a leaf node that you want to make symbolic, then input a human readable prefix (which would be used to name the corresponding symbolic variables), and adjust its length if desired. The corresponding concolic byte array in C code would be automatically generated. This GUI should also be able to make other non-X.509 ASN.1 objects concolic.

This is based on the open-source ASN.1 JavaScript decoder made by Lapo Luchini.