Palisade Magazine

 
September 2005
sonali-gupta

Code Obfuscation - Part 2: Obfuscating Data Structures

by Sonali Gupta, SANS, GCIH |  Discuss this article »» (6)
Code Obfuscation

In the August issue of Palisade we introduced code obfuscation to protect software from reverse engineering. Obfuscation is the process of transforming the source code to confuse both a person or a program trying to reverse engineer the code. Source code is obfuscated using obfuscation software like Dotfuscator .

In this issue we look at data obfuscation, a class of obfuscation techniques that targets the data structures in a program. We present the different methods of data obfuscation with examples and also analyze their quality based on the parameters we defined earlier. These parameters were potency — the ability to confuse a human reverse engineer, resilience — the ability to fool a deobfuscator, and cost — the increase in execution time and memory required in the obfuscated code.

Before we go into the nitty gritty details, here’s an outline of the various techniques we will discuss:

  • Obfuscating Arrays
    • Split , Merge, Flatten, Fold, Reorder arrays
  • Obfuscating Classes
    • Split a class, Insert new class, Reorder members
  • Obfuscating Variables
    • Substitute code for static data, Merge, Promote, Ruffle values, Split variables

Obfuscating Arrays

To start with, let’s look at the different ways to make arrays difficult to read and make sense of. The objective is to confuse a human being or a deobfuscator from understanding the true nature and purpose of the array.

First, an array could be split into multiple sub-arrays to confuse the reader about its structure. For example, consider an array which holds a list of names: if we split it into two arrays, it will be interpreted as two different lists of names.

Array 01

A related idea is to merge multiple arrays into one, like this:

Array 02

We could even fold an array, i.e. increase the number of dimensions…

Array 03

…Or flatten an array, i.e. reduce the number of dimensions.

Array 04

Each of the above makes the array more difficult to interpret. Observe that array splitting and folding increase the complexity of the code. They thus confuse the reverse engineer, i.e. increase the potency of the transform. On the other hand, array merging and flattening decrease the complexity. But they increase the obscurity of a program because they introduce bogus structure or remove structure from the original program.

Another method for obfuscating arrays is to reorder elements within the array. A mapping function f can be defined such that the ith element in the original array is relocated to the jth position where j = f(i) . This function can be used for the mapping of elements in the original and reordered array. Reverse engineers can figure out what’s going on here, so this has low potency and resilience; but these are also free since there is no change in memory or execution time associated with shuffling the positions of array elements.

Obfuscating Classes

Now that we have seen how arrays can be obfuscated, let’s look at obfuscating more complex data structures: classes. Our objective is to confuse the reverse engineer from making sense of the exact working of the class.

The complexity of a class increases with its depth, i.e. distance from the root of the inheritance tree. Intuitively, you’d have guessed that increasing the depth and the number of descendants for a class are the means to obfuscate a class.

Thus to start with, a class can be split into multiple classes, each of which is derived from the previous class. In the example below we split class C into two classes, a base class C1 which has some of the members of C and a class C2 which is inherited from C1 and contains the rest of the members of C. The objects of class C in the original code would be substituted by objects of class C2 in the obfuscated code.

Code 01

The complexity of a class also increases with the number of its direct descendants. So we could confuse a reverse engineer by inserting new bogus classes . In the example below we start with a class C1 and its child C2. For obfuscation, we introduce a bogus class C3 which is inherited from C1 and then inherit C2 from this new class. The members that C3 introduces, i.e. variable v3 and method M3() are bogus methods introduced for obfuscation.

Code 02

These methods have medium potency; but the resilience is low because a deobfuscator can easily merge the split classes or remove the bogus classes. To increase the resilience, splitting and insertion are often combined in practice. Also, objects of the new class are instantiated and their methods invoked to fool the deobfuscator from recognizing these as bogus classes. The cost associated with these transformations is low.

Yet another method to obfuscate classes is to play with the ordering of variables. Since source code is mostly organized to maximize its locality, logically related items occur near each other. Looking at the order in which variables, methods, etc. are declared can prove useful for reverse engineering. Hence, changing the ordering of methods and instance variables also makes a class more obscure.

Obfuscating Variables

Let’s now look at a class of techniques to obfuscate the humble variable.

The simplest technique is to substitute a variable with code . Static data such as character strings can be substituted with a piece of code that generates it dynamically. To confuse the reverse engineer further, this code could also generate other extraneous, nonsense strings. The potency and resilience increases a lot if we use multiple small functions to compute the static strings, and if these functions are then scattered in the normal control flow of the program.

Here’s a simple example of how the literal string “abc” can be obfuscated:

Code 03

A second method is to merge multiple variables into a single variable assuming the combined range of these variables fit into the target variable. For example, two 32-bit integers X and Y can be merged into a 64-bit long variable V using the formula V = (232 *Y + X). Y occupies the first 32 bits and X the next 32 bits of the 64-bit variable. The resilience in this case is low because a deobfuscator can also guess that the variable V consists of two merged variables by examining the arithmetic operations. A variation on this theme is to merge variables v1, v2, …, vk into an array V[k] of an appropriate type (a type that can hold each of these different variables). For example, if the variables are of type short integer, integer and long integer, they can be merged into an array of type long integer.

Another interesting technique is to change the scope of a variable. Transform a local variable to a global variable, for example.

Code 04

Suppose we have functions F1() and F2() both using a local variable i and F1 and F2 do not execute at the same time, the variable i can be promoted to a global variable and shared by the two. It takes the reverse engineer longer to figure out the exact meaning of that global variable, that it is actually two independent variables hidden behind one.

Variables can also be promoted from a specialized storage class to a more generic one to increase complexity. For example, an integer variable can be transformed into an integer object.

Code 05

While each of the above transforms are not very effective on there own, they can be useful when combined with other transforms.

A more powerful technique is to ruffle the values directly. An example is replacing a variable i by a derived value c1*i +c2 . Suppose we choose c1 = 4, c2 = 1. In this transformation again the potency, resilience as well as cost increases with increasing complexity of the transformation chosen.

Code 06

A very effective obfuscation technique for variables which take a fixed range of values (like Boolean variables) is to split them into multiple variables of different types. The idea here is to confuse the reverse engineer by disguising the actual number of variables being used as well as their data type - the split variables can be different in type form the original variable. This is the last and also most complex technique we discuss today, so let’s go step by step.

To split a variable V into two variables p and q, we need to define:

  • A function that maps p and q to V, V = f(p,q)
  • A reverse function g(V) that maps V to p and q
  • New operations defined on p and q that correspond to the operations on V

For instance, let’s split a Boolean variable V into two short integer variables p and q. We design that p and q take only integer values 0 or 1. Further, we define that p and q having the same value i.e. (p = 0, q = 0) or (p = 1, q = 1) corresponds to V = false and p and q having different values i.e. (p = 0, q = 1) or (p = 1, q = 0) corresponds to V = true.

We can then define the functions f(p,q) and g(V) as follows:

g(V)

f(p,q)

V

p q

0

0

False

0

1

True

1

0

True

1

1

False

Next we define counterparts for operations on the boolean variable V in terms of p and q. These can be specified in the form of a lookup table for each operator. So all logical operations on the Boolean V are masked as arithmetic operations on the integers p and q. That’s a good way to confuse the reverse engineer. The potency and resilience of the transformation increases with increase in the number of variables we split the original variable into, but so does the cost.

Summary

We have looked at different data obfuscation techniques in this article. Let’s quickly recap how effective each of these techniques are in terms of Potency, Resilience and Cost.

Transform

Potency

Resilience

Cost

Obfuscating Arrays

Split array

Varies

Weak

Free

Merge array

Varies

Weak

Free

Flatten array

Varies

Weak

Free

Fold array

Varies

Weak

Free

Reorder array

Low

Weak

Cheap

Obfuscating Classes

Split class

Medium

Low

Free

Insert class

Medium

Low

Free

Reorder class members

Low

One-way

Free

Obfuscating Variables

Substitute static data with code

Varies with complexity of function

Merge variables

Low

Weak

Free

Promote variables

Low

Strong

Free

Ruffle the value

Varies with complexity of transform

Split variables

Varies with number of variables

What’s next in Code Obfuscation?

Data obfuscation does not make your programs “fool-proof” against reverse engineering; but it adds a second level of defense. The job of the reverse engineer or deobfuscator becomes so much more difficult now. Next month, we’ll look at Control Obfuscation, another group of interesting techniques to making your code secure from reverse engineering.

Discussion is open for this article — there are 6 reader comments. Add yours.