Performance

6 Performance

We have measured the time and memory needs of these implementations on the three examples. The execution times were collected by the Unix.times function, while the memory usage was measured as the top_heap_words given by the quick_stat function of the OCaml Gc module (that provides an interface to the garbage collector).

All the programs were run on a PC powered by an Intel Core 2 Duo CPU clocked at 2.33 GHz with 2 GB of memory, running the linux kernel version 2.6.26 with amd64 architecture. Software versions were OCaml 3.11.2, caml-shift august 2010, equeue 2.2.9.

All our examples use very simple threads that cooperate heavily. kpn has a fixed number of threads, sieve is interesting because it constantly creates new threads. sorter has both a number of threads (all initially spawn) and a number of operations depending on the problem size. Moreover, its number of threads can easily be made huge (for sorting a list of 3000 numbers, there’re about 4.5 million threads). To measure thread creation time alone, we will also run it with a parameter -d that terminates the program as soon as the sorting network has been set up.

In the following text and graphs, we’ll call sys the implementation using system threads, vm the VM threads, callcc and dlcont the continuation captures, tramp the trampolined style, cont the continuation monad, promise the promise monad, and equeue the event-based one.

(a) Bytecode version

(b) Native-code version

Figure 10: kpn, execution time

(a) Bytecode version

(b) Native-code version

Figure 11: sieve, execution time

(a) Bytecode version

(b) Native-code version

Figure 12: sorter, execution time

(a) Bytecode version

(b) Native-code version

Figure 13: sorter -d, setup time only

Execution time

Figure 10 shows the execution time for kpn, on a log-log graph. callcc is notably slow. Even heavy-weight sys is much faster, vm being ten times faster. The other implementations are rather similar, tramp being slightly better in the bytecode version.

We note that dlcont performance is on par with promises and equeue for bytecode but slower in native code. VM threads are much better than system threads and are only slightly slower than the light weigth implementations.

Figure 11 shows the execution times for the sieve. equeue performance is terrible (much worse than sys). The problem is in the implementation of the Equeue module. As we said, events are presented to each handler in turn until one accepts it. In effect the threads are performing active wait on the MVars. Thus, equeue does not scale with the number of threads.

As a side note, it seems that a simple change in Equeue implementation would dramatically improve performance for the sieve: events should be presented only to (or starting with) handlers set up after the handler that generated them, in the order they were set up. This way each event would be presented immediately to the handler that is waiting for it. This would take advantage of the very particular communication pattern of the concurrent sieve and is not generally applicable, of course.

vm and sys are both much slower than the other light weight implementations, among which tramp is the fastest (with cont only slightly above). Time doubles for promise and doubles again for callcc and dlcont in bytecode. dlcont performs very poorly in native code.

For sorter (Figure 12), callcc makes the system trash, equeue is not shown, while sys is shown only for lists of size 100 and 200. The number of threads used by sorter is about n×(n−1)/2 with n the list size, which means 19900 threads for n=200 and 44850 for 300. vm performs very badly. Here again tramp and cont are notably better. promises is always significantly slower than tramp and cont. This is certainly due to the much more complex implementation of the bind operator.

Figure 13 shows the time to set up the sorter network but not running it. The main difference is dlcont and callcc being rather good this time. Indeed, since the threads are not running, no continuation captures are performed!

This is the only figure where the some performances for bytecode are (slighlty) better than those for native code. According to OCaml’s documentation native code executables are faster but also bigger which can translate into larger startup time but this wouldn’t alone explain what we see here. Memory allocation may be slower since sorter -d essentially allocates (a large number of) threads and MVars.

Memory usage

We don’t show the graphs for kpn: they are as expected with all implementations having the same constant memory requirement with the notable exception of callcc whose memory consumption increases roughly linearly with time. There’s clearly a memory leak, but the authors had warned us of its experimental status.

The graphs for sieve are shown in Figure 14. tramp/cont are clearly the best. equeue is good too but values are shown only for the first few points since the program is so slow…We don’t show sys since we should also measure the memory used by the operating system itself in its managing of threads. vm is hidden behind dlcont.

dlcont is much above the other implementations for sorter on Figure 15, but is quite good (as callcc) with sorter -d since again no continuation captures occur. We don’t include the graphs for native code, they mainly show promise is better in native code than byte code.

(a) Bytecode version

(b) Native-code version

Figure 14: sieve, memory usage

(a) sorter

(b) sorted -d

Figure 15: sorter, memory usage for bytecode

Finally Figures 16, 17, and 18 present the same data with a different view: they show the memory requirements per thread. Apart from dlcont that is very good in native code, tramp and cont are always under the other implementations. It’s interesting to see on Figures 17 and 18 that its advantage is much larger when the threads are running. The advantage over promise is probably caused by the relative complexity introduced by the handling of promises.

(a) Bytecode version

(b) Native-code version

Figure 16: sieve, memory usage per thread

(a) Bytecode version

(b) Native-code version

Figure 17: sorter, memory usage per thread

(a) Bytecode version

(b) Native-code version

Figure 18: sorter -d, memory usage per thread

Other comments

It has been noted [2] that continuations used for implementing threads are one shot continuations. Specific optimized implementations may be designed for them but no such implementation currently exists for OCaml. Likewise, the equeue module is clearly designed with long standing handlers in mind. Since we use one shot handlers, we pay the cost of throwing an exception (to remove the current handler) at each cooperation point.