2009-04-28

Testing concurrent code doesn't

Here is an example that I used in my concurrent programming course this winter.

I'd originally intended to illustrate how easy it is to create race conditions. The example I ended up with illustrates a more important point: how futile it can be to try to discover race conditions through testing.

Here is the code in Java

public class Race {
 volatile int x = 0 ;
 
 public static void main(String[] args) throws InterruptedException {
  (new Race()).go() ;
 }
 
 public void go() throws InterruptedException  {
  final int ITERATIONS = 1000 ;
  Thread p = new Thread() {
   public void run() {
    for( int i = 0 ; i <ITERATIONS ; i++ ) {
     ++x ; } } } ;
  
  Thread q = new Thread() {
   public void run() {
    for( int i = 0 ; i <  ITERATIONS ; i++ ) {
     --x ; } } } ;

  
  System.out.println( "The initial value of x is: " + x ) ;
  
  // The two new threads start to run.
  
  p.start() ;  q.start();
  
  // Now both new threads are running or done.
  // The main thread waits until both are done.
  
  p.join() ;  q.join();
  System.out.println( "After "+ITERATIONS+" increments and "+ITERATIONS+" decrements" ) ;
  System.out.println( "the final value of x is: " + x ) ;
 }
}

This was intended to show the race between incrementing and decrementing. There are one thousand increments and one thousand decrements of x. You might expect that the final value of would be the same as its initial value, which is 0. But increment and decrement are not indivisible actions. Each increment consists of a load of x, an add, and a store of x; decrements are similar. If an increment and a decrement both load before either stores, the first action to store will have its result overwritten by the second. Given a thousand chances, it seemed likely that at least one increment or decrement would be lost. Unless the number of lost decrements and the number of lost increments just happen to be equal, we should see a nonzero final value for x.

I gave it a try:

The initial value of x is: 0
After 1000 increments and 1000 decrements
the final value of x is: 0

What gives? Perhaps not enough chances for the race to manifest. I tried ten thousand increments and decrements:

The initial value of x is: 0
After 10000 increments and 10000 decrements
the final value of x is: 0

OK, how about one million:

The initial value of x is: 0
After 1000000 increments and 1000000 decrements
the final value of x is: 0

At this point, I was starting to wonder if my understanding of Java was correct or whether my virtual machine had translated the conceptual load-increment-store sequence to a single --atomic-- instruction.

Ten million:

The initial value of x is: 0
After 10000000 increments and 10000000 decrements
the final value of x is: 4073375

Finally the result error manifested itself.

In retrospect it was obvious why my testing strategy was so ineffective. The start up time for the q thread was so great that the p thread was finished before the q thread even started.

The moral. When concurrency is in play: if even obvious errors that we know are there aren't easily revealed by testing, what chance can we have against subtle errors we don't know are there?