7. Helgrind: a thread error detector

Table of Contents

7.1. Overview
7.2. Detected errors: Misuses of the POSIX pthreads API
7.3. Detected errors: Inconsistent Lock Orderings
7.4. Detected errors: Data Races
7.4.1. A Simple Data Race
7.4.2. Helgrind's Memory State Machine
7.4.3. Transfers of Exclusive Ownership Between Threads
7.4.4. Restoration of Exclusive Ownership
7.4.5. A Summary of the Race Detection Algorithm
7.4.6. Interpreting Race Error Messages
7.5. Hints and Tips for Effective Use of Helgrind
7.6. Helgrind Options
7.7. A To-Do List for Helgrind

To use this tool, you must specify --tool=helgrind on the Valgrind command line.

7.1. Overview

Helgrind is a Valgrind tool for detecting synchronisation errors in C, C++ and Fortran programs that use the POSIX pthreads threading primitives.

The main abstractions in POSIX pthreads are: a set of threads sharing a common address space, thread creation, thread joinage, thread exit, mutexes (locks), condition variables (inter-thread event notifications), reader-writer locks, and semaphores.

Helgrind is aware of all these abstractions and tracks their effects as accurately as it can. Currently it does not correctly handle pthread barriers and pthread spinlocks, although it will not object if you use them. On x86 and amd64 platforms, it understands and partially handles implicit locking arising from the use of the LOCK instruction prefix.

Helgrind can detect three classes of errors, which are discussed in detail in the next three sections:

Following those is a section containing hints and tips on how to get the best out of Helgrind.

Then there is a summary of command-line options.

Finally, there is a brief summary of areas in which Helgrind could be improved.

7.2. Detected errors: Misuses of the POSIX pthreads API

Helgrind intercepts calls to many POSIX pthreads functions, and is therefore able to report on various common problems. Although these are unglamourous errors, their presence can lead to undefined program behaviour and hard-to-find bugs later in execution. The detected errors are:

  • unlocking an invalid mutex

  • unlocking a not-locked mutex

  • unlocking a mutex held by a different thread

  • destroying an invalid or a locked mutex

  • recursively locking a non-recursive mutex

  • deallocation of memory that contains a locked mutex

  • passing mutex arguments to functions expecting reader-writer lock arguments, and vice versa

  • when a POSIX pthread function fails with an error code that must be handled

  • when a thread exits whilst still holding locked locks

  • calling pthread_cond_wait with a not-locked mutex, or one locked by a different thread

Checks pertaining to the validity of mutexes are generally also performed for reader-writer locks.

Various kinds of this-can't-possibly-happen events are also reported. These usually indicate bugs in the system threading library.

Reported errors always contain a primary stack trace indicating where the error was detected. They may also contain auxiliary stack traces giving additional information. In particular, most errors relating to mutexes will also tell you where that mutex first came to Helgrind's attention (the "was first observed at" part), so you have a chance of figuring out which mutex it is referring to. For example:

Thread #1 unlocked a not-locked lock at 0x7FEFFFA90
   at 0x4C2408D: pthread_mutex_unlock (hg_intercepts.c:492)
   by 0x40073A: nearly_main (tc09_bad_unlock.c:27)
   by 0x40079B: main (tc09_bad_unlock.c:50)
  Lock at 0x7FEFFFA90 was first observed
   at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326)
   by 0x40071F: nearly_main (tc09_bad_unlock.c:23)
   by 0x40079B: main (tc09_bad_unlock.c:50)

Helgrind has a way of summarising thread identities, as evidenced here by the text "Thread #1". This is so that it can speak about threads and sets of threads without overwhelming you with details. See below for more information on interpreting error messages.

7.3. Detected errors: Inconsistent Lock Orderings

In this section, and in general, to "acquire" a lock simply means to lock that lock, and to "release" a lock means to unlock it.

Helgrind monitors the order in which threads acquire locks. This allows it to detect potential deadlocks which could arise from the formation of cycles of locks. Detecting such inconsistencies is useful because, whilst actual deadlocks are fairly obvious, potential deadlocks may never be discovered during testing and could later lead to hard-to-diagnose in-service failures.

The simplest example of such a problem is as follows.

  • Imagine some shared resource R, which, for whatever reason, is guarded by two locks, L1 and L2, which must both be held when R is accessed.

  • Suppose a thread acquires L1, then L2, and proceeds to access R. The implication of this is that all threads in the program must acquire the two locks in the order first L1 then L2. Not doing so risks deadlock.

  • The deadlock could happen if two threads -- call them T1 and T2 -- both want to access R. Suppose T1 acquires L1 first, and T2 acquires L2 first. Then T1 tries to acquire L2, and T2 tries to acquire L1, but those locks are both already held. So T1 and T2 become deadlocked.

Helgrind builds a directed graph indicating the order in which locks have been acquired in the past. When a thread acquires a new lock, the graph is updated, and then checked to see if it now contains a cycle. The presence of a cycle indicates a potential deadlock involving the locks in the cycle.

In simple situations, where the cycle only contains two locks, Helgrind will show where the required order was established:

Thread #1: lock order "0x7FEFFFAB0 before 0x7FEFFFA80" violated
   at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
   by 0x40081F: main (tc13_laog1.c:24)
  Required order was established by acquisition of lock at 0x7FEFFFAB0
   at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
   by 0x400748: main (tc13_laog1.c:17)
  followed by a later acquisition of lock at 0x7FEFFFA80
   at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
   by 0x400773: main (tc13_laog1.c:18)

When there are more than two locks in the cycle, the error is equally serious. However, at present Helgrind does not show the locks involved, so as to avoid flooding you with information. That could be fixed in future. For example, here is a an example involving a cycle of five locks from a naive implementation the famous Dining Philosophers problem (see helgrind/tests/tc14_laog_dinphils.c). In this case Helgrind has detected that all 5 philosophers could simultaneously pick up their left fork and then deadlock whilst waiting to pick up their right forks.

Thread #6: lock order "0x6010C0 before 0x601160" violated
   at 0x4C23C91: pthread_mutex_lock (hg_intercepts.c:388)
   by 0x4007C0: dine (tc14_laog_dinphils.c:19)
   by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178)
   by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so)
   by 0x51054CC: clone (in /lib64/libc-2.5.so)

7.4. Detected errors: Data Races

A data race happens, or could happen, when two threads access a shared memory location without using suitable locks to ensure single-threaded access. Such missing locking can cause obscure timing dependent bugs. Ensuring programs are race-free is one of the central difficulties of threaded programming.

Reliably detecting races is a difficult problem, and most of Helgrind's internals are devoted to do dealing with it. As a consequence this section is somewhat long and involved. We begin with a simple example.

7.4.1. A Simple Data Race

About the simplest possible example of a race is as follows. In this program, it is impossible to know what the value of var is at the end of the program. Is it 2 ? Or 1 ?

#include <pthread.h>

int var = 0;

void* child_fn ( void* arg ) {
   var++; /* Unprotected relative to parent */ /* this is line 6 */
   return NULL;
}

int main ( void ) {
   pthread_t child;
   pthread_create(&child, NULL, child_fn, NULL);
   var++; /* Unprotected relative to child */ /* this is line 13 */
   pthread_join(child, NULL);
   return 0;
}

The problem is there is nothing to stop var being updated simultaneously by both threads. A correct program would protect var with a lock of type pthread_mutex_t, which is acquired before each access and released afterwards. Helgrind's output for this program is:

Thread #1 is the program's root thread

Thread #2 was created
   at 0x510548E: clone (in /lib64/libc-2.5.so)
   by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so)
   by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
   by 0x4C23870: pthread_create@* (hg_intercepts.c:198)
   by 0x4005F1: main (simple_race.c:12)

Possible data race during write of size 4 at 0x601034
   at 0x4005F2: main (simple_race.c:13)
  Old state: shared-readonly by threads #1, #2
  New state: shared-modified by threads #1, #2
  Reason:    this thread, #1, holds no consistent locks
  Location 0x601034 has never been protected by any lock

This is quite a lot of detail for an apparently simple error. The last clause is the main error message. It says there is a race as a result of a write of size 4 (bytes), at 0x601034, which is presumably the address of var, happening in function main at line 13 in the program.

Note that it is purely by chance that the race is reported for the parent thread's access. It could equally have been reported instead for the child's access, at line 6. The error will only be reported for one of the locations, since neither the parent nor child is, by itself, incorrect. It is only when both access var without a lock that an error exists.

The error message shows some other interesting details. The sections below explain them. Here we merely note their presence:

  • Helgrind maintains some kind of state machine for the memory location in question, hence the "Old state:" and "New state:" lines.

  • Helgrind keeps track of which threads have accessed the location: "threads #1, #2". Before printing the main error message, it prints the creation points of these two threads, so you can see which threads it is referring to.

  • Helgrind tries to provide an explanation of why the race exists: "Location 0x601034 has never been protected by any lock".

Understanding the memory state machine is central to understanding Helgrind's race-detection algorithm. The next three subsections explain this.

7.4.2. Helgrind's Memory State Machine

Helgrind tracks the state of every byte of memory used by your program. There are a number of states, but only three are interesting:

  • Exclusive: memory in this state is regarded as owned exclusively by one particular thread. That thread may read and write it without a lock. Even in highly threaded programs, the majority of locations never leave the Exclusive state, since most data is thread-private.

  • Shared-Readonly: memory in this state is regarded as shared by multiple threads. In this state, any thread may read the memory without a lock, reflecting the fact that readonly data may safely be shared between threads without locking.

  • Shared-Modified: memory in this state is regarded as shared by multiple threads, at least one of which has written to it. All participating threads must hold at least one lock in common when accessing the memory. If no such lock exists, Helgrind reports a race error.

Let's review the simple example above with this in mind. When the program starts, var is not in any of these states. Either the parent or child thread gets to its var++ first, and thereby thereby gets Exclusive ownership of the location.

The later-running thread now arrives at its var++ statement. It first reads the existing value from memory. Because var is currently marked as owned exclusively by the other thread, its state is changed to shared-readonly by both threads.

This same thread adds one to the value it has and stores it back in var. This causes another state change, this time to the shared-modified state. Because Helgrind has also been tracking which threads hold which locks, it can see that var is in shared-modified state but no lock has been used to consistently protect it. Hence a race is reported exactly at the transition from shared-readonly to shared-modified.

The essence of the algorithm is this. Helgrind keeps track of each memory location that has been accessed by more than one thread. For each such location it incrementally infers the set of locks which have consistently been used to protect that location. If the location's lockset becomes empty, and at some point one of the threads attempts to write to it, a race is then reported.

This technique is known as "lockset inference" and was introduced in: "Eraser: A Dynamic Data Race Detector for Multithreaded Programs" (Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro and Thomas Anderson, ACM Transactions on Computer Systems, 15(4):391-411, November 1997).

Lockset inference has since been widely implemented, studied and extended. Helgrind incorporates several refinements aimed at avoiding the high false error rate that naive versions of the algorithm suffer from. A summary of the complete algorithm used by Helgrind is presented below. First, however, it is important to understand details of transitions pertaining to the Exclusive-ownership state.

7.4.3. Transfers of Exclusive Ownership Between Threads

As presented, the algorithm is far too strict. It reports many errors in perfectly correct, widely used parallel programming constructions, for example, using child worker threads and worker thread pools.

To avoid these false errors, we must refine the algorithm so that it keeps memory in an Exclusive ownership state in cases where it would otherwise decay into a shared-readonly or shared-modified state. Recall that Exclusive ownership is special in that it grants the owning thread the right to access memory without use of any locks. In order to support worker-thread and worker-thread-pool idioms, we will allow threads to steal exclusive ownership of memory from other threads under certain circumstances.

Here's an example. Imagine a parent thread creates child threads to do units of work. For each unit of work, the parent allocates a work buffer, fills it in, and creates the child thread, handing it a pointer to the buffer. The child reads/writes the buffer and eventually exits, and the waiting parent then extracts the results from the buffer:

typedef ... Buffer;

pthread_t child;
Buffer    buf;

/* ---- Parent ---- */                          /* ---- Child ---- */

/* parent writes workload into buf */
pthread_create( &child, child_fn, &buf );

/* parent does not read */                      void child_fn ( Buffer* buf ) {
/* or write buf */                                 /* read/write buf */
                                                }

pthread_join ( child );
/* parent reads results from buf */

Although buf is accessed by both threads, neither uses locks, yet the program is race-free. The essential observation is that the child's creation and exit create synchronisation events between it and the parent. These force the child's accesses to buf to happen after the parent initialises buf, and before the parent reads the results from buf.

To model this, Helgrind allows the child to steal, from the parent, exclusive ownership of any memory exclusively owned by the parent before the pthread_create call. Similarly, once the parent's pthread_join call returns, it can steal back ownership of memory exclusively owned by the child. In this way ownership of buf is transferred from parent to child and back, so the basic algorithm does not report any races despite the absence of any locking.

Note that the child may only steal memory owned by the parent prior to the pthread_create call. If the child attempts to read or write memory which is also accessed by the parent in between the pthread_create and pthread_join calls, an error is still reported.

This technique was introduced with the name "thread lifetime segments" in "Runtime Checking of Multithreaded Applications with Visual Threads" (Jerry J. Harrow, Jr, Proceedings of the 7th International SPIN Workshop on Model Checking of Software Stanford, California, USA, August 2000, LNCS 1885, pp331--342). Helgrind implements an extended version of it. Specifically, Helgrind allows transfer of exclusive ownership in the following situations:

  • At thread creation: a child can acquire ownership of memory held exclusively by the parent prior to the child's creation.

  • At thread joining: the joiner (thread not exiting) can acquire ownership of memory held exclusively by the joinee (thread that is exiting) at the point it exited.

  • At condition variable signallings and broadcasts. A thread Tw which completes a pthread_cond_wait call as a result of a signal or broadcast on the same condition variable by some other thread Ts, may acquire ownership of memory held exclusively by Ts prior to the pthread_cond_signal/broadcast call.

  • At semaphore posts (sem_post) calls. A thread Tw which completes a sem_wait call call as a result of a sem_post call on the same semaphore by some other thread Tp, may acquire ownership of memory held exclusively by Tp prior to the sem_post call.

7.4.4. Restoration of Exclusive Ownership

Another common idiom is to partition the lifetime of the program as a whole into several distinct phases. In some of those phases, a memory location may be accessed by multiple threads and so require locking. In other phases only one thread exists and so can access the memory without locking. For example:

int             var = 0;                         /* shared variable */
pthread_mutex_t mx  = PTHREAD_MUTEX_INITIALIZER; /* guard for var */
pthread_t       child;

/* ---- Parent ---- */                          /* ---- Child ---- */

var += 1; /* no lock used */

pthread_create( &child, child_fn, NULL );

                                                void child_fn ( void* uu ) {
pthread_mutex_lock(&mx);                           pthread_mutex_lock(&mx);         
var += 2;                                          var += 3;
pthread_mutex_unlock(&mx);                         pthread_mutex_unlock(&mx);
                                                }

pthread_join ( child );

var += 4; /* no lock used */

This program is correct, but using only the mechanisms described so far, Helgrind would report an error at var += 4. This is because, by that point, var is marked as being in the state "shared-modified and protected by the lock mx", but is being accessed without locking. Really, what we want is for var to return to the parent thread's exclusive ownership after the child thread has exited.

To make this possible, for every memory location Helgrind also keeps track of all the threads that have accessed that location -- its threadset. When a thread Tquitter joins back to Tstayer, Helgrind examines the locksets of all memory in shared-modified or shared-readable state. In each such lockset, if Tquitter is mentioned, it is removed and replaced by Tstayer. If, as a result, a lockset becomes a singleton set containing Tstayer, then the location's state is changed to belongs-exclusively-to-Tstayer.

In our example, the result is exactly as we desire: var is reacquired exclusively by the parent after the child exits.

More generally, when a group of threads merges back to a single thread via a cascade of pthread_join calls, any memory shared by the group (or a subset of it) ends up being owned exclusively by the sole surviving thread. This significantly enhances Helgrind's flexibility, since it means that each memory location may make arbitrarily many transitions between exclusive and shared ownership. Furthermore, a different lock may protect the location during each period of shared ownership.

7.4.5. A Summary of the Race Detection Algorithm

Helgrind looks for memory locations which are accessed by more than one thread. For each such location, Helgrind records which of the program's locks were held by the accessing thread at the time of each access. The hope is to discover that there is indeed at least one lock which is consistently used by all threads to protect that location. If no such lock can be found, then there is apparently no consistent locking strategy being applied for that location, and so a possible data race might result. Helgrind accordingly reports an error.

In practice this discipline is far too simplistic, and is unusable since it reports many races in some widely used and known-correct programming disciplines. Helgrind's checking therefore incorporates many refinements to this basic idea, and can be summarised as follows:

The following thread events are intercepted and monitored:

  • thread creation and exiting (pthread_create, pthread_join, pthread_exit)

  • lock acquisition and release (pthread_mutex_lock, pthread_mutex_unlock, pthread_rwlock_rdlock, pthread_rwlock_wrlock, pthread_rwlock_unlock)

  • inter-thread event notifications (pthread_cond_wait, pthread_cond_signal, pthread_cond_broadcast, sem_wait, sem_post)

Memory allocation and deallocation events are intercepted and monitored:

  • malloc/new/free/delete and variants

  • stack allocation and deallocation

All memory accesses are intercepted and monitored.

By observing the above events, Helgrind can infer certain aspects of the program's locking discipline. Programs which adhere to the following rules are considered to be acceptable:

  • A thread may allocate memory, and write initial values into it, without locking. That thread is regarded as owning the memory exclusively.

  • A thread may read and write memory which it owns exclusively, without locking.

  • Memory which is owned exclusively by one thread may be read by that thread and others without locking. However, in this situation no thread may do unlocked writes to the memory (except for the owner thread's initializing write).

  • Memory which is shared between multiple threads, one or more of which writes to it, must be protected by a lock which is correctly acquired and released by all threads accessing the memory.

Any violation of this discipline will cause an error to be reported. However, two exemptions apply:

  • A thread Y can acquire exclusive ownership of memory previously owned exclusively by a different thread X providing X's last access and Y's first access are separated by one of the following synchronization events:

    • X creates thread Y

    • X joins back to Y

    • X uses a condition-variable to signal at Y, and Y is waiting for that event

    • Y completes a semaphore wait as a result of X signalling on that same semaphore

    This refinement allows Helgrind to correctly track the ownership state of inter-thread buffers used in the worker-thread and worker-thread-pool concurrent programming idioms (styles).

  • Similarly, if thread Y joins back to thread X, memory exclusively owned by Y becomes exclusively owned by X instead. Also, memory that has been shared only by X and Y becomes exclusively owned by X. More generally, memory that has been shared by X, Y and some arbitrary other set S of threads is re-marked as shared by X and S. Hence, under the right circumstances, memory shared amongst multiple threads, all of which join into just one, can revert to the exclusive ownership state.

    In effect, each memory location may make arbitrarily many transitions between exclusive and shared ownership. Furthermore, a different lock may protect the location during each period of shared ownership. This significantly enhances the flexibility of the algorithm.

The ownership state, accessing thread-set and related lock-set for each memory location are tracked at 8-bit granularity. This means the algorithm is precise even for 16- and 8-bit memory accesses.

Helgrind correctly handles reader-writer locks in this framework. Locations shared between multiple threads can be protected during reads by locks held in either read-mode or write-mode, but can only be protected during writes by locks held in write-mode. Normal POSIX mutexes are treated as if they are reader-writer locks which are only ever held in write-mode.

Helgrind correctly handles POSIX mutexes for which recursive locking is allowed.

Helgrind partially correctly handles x86 and amd64 memory access instructions preceded by a LOCK prefix. Writes are correctly handled, by pretending that the LOCK prefix implies acquisition and release of a magic "bus hardware lock" mutex before and after the instruction. This unfortunately requires subsequent reads from such locations to also use a LOCK prefix, which is not required by the real hardware. Helgrind does not offer any equivalent handling for atomic sequences on PowerPC/POWER platforms created by the use of lwarx/stwcx instructions.

7.4.6. Interpreting Race Error Messages

Helgrind's race detection algorithm collects a lot of information, and tries to present it in a helpful way when a race is detected. Here's an example:

Thread #2 was created
   at 0x510548E: clone (in /lib64/libc-2.5.so)
   by 0x4E2F305: do_clone (in /lib64/libpthread-2.5.so)
   by 0x4E2F7C5: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
   by 0x4C23870: pthread_create@* (hg_intercepts.c:198)
   by 0x400CEF: main (tc17_sembar.c:195)

// And the same for threads #3, #4 and #5 -- omitted for conciseness

Possible data race during read of size 4 at 0x602174
   at 0x400BE5: gomp_barrier_wait (tc17_sembar.c:122)
   by 0x400C44: child (tc17_sembar.c:161)
   by 0x4C25DF7: mythread_wrapper (hg_intercepts.c:178)
   by 0x4E2F09D: start_thread (in /lib64/libpthread-2.5.so)
   by 0x51054CC: clone (in /lib64/libc-2.5.so)
  Old state: shared-modified by threads #2, #3, #4, #5
  New state: shared-modified by threads #2, #3, #4, #5
  Reason:    this thread, #2, holds no consistent locks
  Last consistently used lock for 0x602174 was first observed
   at 0x4C25D01: pthread_mutex_init (hg_intercepts.c:326)
   by 0x4009E4: gomp_barrier_init (tc17_sembar.c:46)
   by 0x400CBC: main (tc17_sembar.c:192)

Helgrind first announces the creation points of any threads referenced in the error message. This is so it can speak concisely about threads and sets of threads without repeatedly printing their creation point call stacks. Each thread is only ever announced once, the first time it appears in any Helgrind error message.

The main error message begins at the text "Possible data race during read". At the start is information you would expect to see -- address and size of the racing access, whether a read or a write, and the call stack at the point it was detected.

More interesting is the state transition caused by this access. This memory is already in the shared-modified state, and up to now has been consistently protected by at least one lock. However, the thread making the access in question (thread #2, here) does not hold any locks in common with those held during all previous accesses to the location -- "no consistent locks", in other words.

Finally, Helgrind shows the lock which has protected this location in all previous accesses. (If there is more than one, only one is shown). This can be a useful hint, because it typically shows the lock that the programmers intended to use to protect the location, but in this case forgot.

Here are some more examples of race reports. This not an exhaustive list of combinations, but should give you some insight into how to interpret the output.

Possible data race during write ...
  Old state: shared-readonly by threads #1, #2, #3
  New state: shared-modified by threads #1, #2, #3
  Reason:    this thread, #3, holds no consistent locks
  Location ... has never been protected by any lock

The location is shared by 3 threads, all of which have been reading it without locking ("has never been protected by any lock"). Now one of them is writing it. Regardless of whether the writer has a lock or not, this is still an error, because the write races against the previously observed reads.

Possible data race during read ...
  Old state: shared-modified by threads #1, #2, #3
  New state: shared-modified by threads #1, #2, #3
  Reason:    this thread, #3, holds no consistent locks
  Last consistently used lock for ... was first observed ...

The location is shared by 3 threads, all of which have been reading and writing it while (as required) holding at least one lock in common. Now it is being read without that lock being held. In the "Last consistently used lock" part, Helgrind offers its best guess as to the identity of the lock that should have been used.

Possible data race during write ...
  Old state: owned exclusively by thread #4
  New state: shared-modified by threads #4, #5
  Reason:    this thread, #5, holds no locks at all

A location that has so far been accessed exclusively by thread #4 has now been written by thread #5, without use of any lock. This can be a sign that the programmer did not consider the possibility of the location being shared between threads, or, alternatively, forgot to use the appropriate lock.

Note that thread #4 exclusively owns the location, and so has the right to access it without holding a lock. However, this message does not say that thread #4 is not using a lock for this location. Indeed, it could be using a lock for the location because it intends to make it available to other threads, one of which is thread #5 -- and thread #5 has forgotten to use the lock.

Also, this message implies that Helgrind did not see any synchronisation event between threads #4 and #5 that would have allowed #5 to acquire exclusive ownership from #4. See above for a discussion of transfers of exclusive ownership states between threads.

7.5. Hints and Tips for Effective Use of Helgrind

Helgrind can be very helpful in finding and resolving threading-related problems. Like all sophisticated tools, it is most effective when you understand how to play to its strengths.

Helgrind will be less effective when you merely throw an existing threaded program at it and try to make sense of any reported errors. It will be more effective if you design threaded programs from the start in a way that helps Helgrind verify correctness. The same is true for finding memory errors with Memcheck, but applies more here, because thread checking is a harder problem. Consequently it is much easier to write a correct program for which Helgrind falsely reports (threading) errors than it is to write a correct program for which Memcheck falsely reports (memory) errors.

With that in mind, here are some tips, listed most important first, for getting reliable results and avoiding false errors. The first two are critical. Any violations of them will swamp you with huge numbers of false data-race errors.

  1. Make sure your application, and all the libraries it uses, use the POSIX threading primitives. Helgrind needs to be able to see all events pertaining to thread creation, exit, locking and other synchronisation events. To do so it intercepts many POSIX pthread_ functions.

    Do not roll your own threading primitives (mutexes, etc) from combinations of the Linux futex syscall, counters and wotnot. These throw Helgrind's internal what's-going-on models way off course and will give bogus results.

    Also, do not reimplement existing POSIX abstractions using other POSIX abstractions. For example, don't build your own semaphore routines or reader-writer locks from POSIX mutexes and condition variables. Instead use POSIX reader-writer locks and semaphores directly, since Helgrind supports them directly.

    Helgrind directly supports the following POSIX threading abstractions: mutexes, reader-writer locks, condition variables (but see below), and semaphores. Currently spinlocks and barriers are not supported, although they could be in future. A prototype "safe" implementation of barriers, based on semaphores, is available: please contact the Valgrind authors for details.

    At the time of writing, the following popular Linux packages are known to implement their own threading primitives:

    • Qt version 4.X. Qt 3.X is fine, but not 4.X. Helgrind contains partial direct support for Qt 4.X threading, but this is not yet in a usable state. Assistance from folks knowledgeable in Qt 4 threading internals would be appreciated.

    • Runtime support library for GNU OpenMP (part of GCC), at least GCC versions 4.2 and 4.3. With some minor effort of modifying the GNU OpenMP runtime support sources, it is possible to use Helgrind on GNU OpenMP compiled codes. Please contact the Valgrind authors for details.

  2. Avoid memory recycling. If you can't avoid it, you must use tell Helgrind what is going on via the VALGRIND_HG_CLEAN_MEMORY client request (in helgrind.h).

    Helgrind is aware of standard memory allocation and deallocation that occurs via malloc/free/new/delete and from entry and exit of stack frames. In particular, when memory is deallocated via free, delete, or function exit, Helgrind considers that memory clean, so when it is eventually reallocated, its history is irrelevant.

    However, it is common practice to implement memory recycling schemes. In these, memory to be freed is not handed to malloc/delete, but instead put into a pool of free buffers to be handed out again as required. The problem is that Helgrind has no way to know that such memory is logically no longer in use, and its history is irrelevant. Hence you must make that explicit, using the VALGRIND_HG_CLEAN_MEMORY client request to specify the relevant address ranges. It's easiest to put these requests into the pool manager code, and use them either when memory is returned to the pool, or is allocated from it.

  3. Avoid POSIX condition variables. If you can, use POSIX semaphores (sem_t, sem_post, sem_wait) to do inter-thread event signalling. Semaphores with an initial value of zero are particularly useful for this.

    Helgrind only partially correctly handles POSIX condition variables. This is because Helgrind can see inter-thread dependencies between a pthread_cond_wait call and a pthread_cond_signal/broadcast call only if the waiting thread actually gets to the rendezvous first (so that it actually calls pthread_cond_wait). It can't see dependencies between the threads if the signaller arrives first. In the latter case, POSIX guidelines imply that the associated boolean condition still provides an inter-thread synchronisation event, but one which is invisible to Helgrind.

    The result of Helgrind missing some inter-thread synchronisation events is to cause it to report false positives. That's because missing such events reduces the extent to which it can transfer exclusive memory ownership between threads. So memory may end up in a shared-modified state when that was not intended by the application programmers.

    The root cause of this synchronisation lossage is particularly hard to understand, so an example is helpful. It was discussed at length by Arndt Muehlenfeld ("Runtime Race Detection in Multi-Threaded Programs", Dissertation, TU Graz, Austria). The canonical POSIX-recommended usage scheme for condition variables is as follows:

    b   is a Boolean condition, which is False most of the time
    cv  is a condition variable
    mx  is its associated mutex
    
    Signaller:                             Waiter:
    
    lock(mx)                               lock(mx)
    b = True                               while (b == False)
    signal(cv)                                wait(cv,mx)
    unlock(mx)                             unlock(mx)
    

    Assume b is False most of the time. If the waiter arrives at the rendezvous first, it enters its while-loop, waits for the signaller to signal, and eventually proceeds. Helgrind sees the signal, notes the dependency, and all is well.

    If the signaller arrives first, b is set to true, and the signal disappears into nowhere. When the waiter later arrives, it does not enter its while-loop and simply carries on. But even in this case, the waiter code following the while-loop cannot execute until the signaller sets b to True. Hence there is still the same inter-thread dependency, but this time it is through an arbitrary in-memory condition, and Helgrind cannot see it.

    By comparison, Helgrind's detection of inter-thread dependencies caused by semaphore operations is believed to be exactly correct.

    As far as I know, a solution to this problem that does not require source-level annotation of condition-variable wait loops is beyond the current state of the art.

  4. Make sure you are using a supported Linux distribution. At present, Helgrind only properly supports x86-linux and amd64-linux with glibc-2.3 or later. The latter restriction means we only support glibc's NPTL threading implementation. The old LinuxThreads implementation is not supported.

    Unsupported targets may work to varying degrees. In particular ppc32-linux and ppc64-linux running NTPL should work, but you will get false race errors because Helgrind does not know how to properly handle atomic instruction sequences created using the lwarx/stwcx instructions.

  5. Round up all finished threads using pthread_join. Avoid detaching threads: don't create threads in the detached state, and don't call pthread_detach on existing threads.

    Using pthread_join to round up finished threads provides a clear synchronisation point that both Helgrind and programmers can see. This synchronisation point allows Helgrind to adjust its memory ownership models as described extensively above, which helps Helgrind produce more accurate error reports.

    If you don't call pthread_join on a thread, Helgrind has no way to know when it finishes, relative to any significant synchronisation points for other threads in the program. So it assumes that the thread lingers indefinitely and can potentially interfere indefinitely with the memory state of the program. It has every right to assume that -- after all, it might really be the case that, for scheduling reasons, the exiting thread did run very slowly in the last stages of its life.

  6. Perform thread debugging (with Helgrind) and memory debugging (with Memcheck) together.

    Helgrind tracks the state of memory in detail, and memory management bugs in the application are liable to cause confusion. In extreme cases, applications which do many invalid reads and writes (particularly to freed memory) have been known to crash Helgrind. So, ideally, you should make your application Memcheck-clean before using Helgrind.

    It may be impossible to make your application Memcheck-clean unless you first remove threading bugs. In particular, it may be difficult to remove all reads and writes to freed memory in multithreaded C++ destructor sequences at program termination. So, ideally, you should make your application Helgrind-clean before using Memcheck.

    Since this circularity is obviously unresolvable, at least bear in mind that Memcheck and Helgrind are to some extent complementary, and you may need to use them together.

  7. POSIX requires that implementations of standard I/O (printf, fprintf, fwrite, fread, etc) are thread safe. Unfortunately GNU libc implements this by using internal locking primitives that Helgrind is unable to intercept. Consequently Helgrind generates many false race reports when you use these functions.

    Helgrind attempts to hide these errors using the standard Valgrind error-suppression mechanism. So, at least for simple test cases, you don't see any. Nevertheless, some may slip through. Just something to be aware of.

  8. Helgrind's error checks do not work properly inside the system threading library itself (libpthread.so), and it usually observes large numbers of (false) errors in there. Valgrind's suppression system then filters these out, so you should not see them.

    If you see any race errors reported where libpthread.so or ld.so is the object associated with the innermost stack frame, please file a bug report at http://www.valgrind.org.

7.6. Helgrind Options

The following end-user options are available:

--happens-before=none|threads|all [default: all]

Helgrind always regards locks as the basis for inter-thread synchronisation. However, by default, before reporting a race error, Helgrind will also check whether certain other kinds of inter-thread synchronisation events happened. It may be that if such events took place, then no race really occurred, and so no error needs to be reported. See above for a discussion of transfers of exclusive ownership states between threads.

With --happens-before=all, the following events are regarded as sources of synchronisation: thread creation/joinage, condition variable signal/broadcast/waits, and semaphore posts/waits.

With --happens-before=threads, only thread creation/joinage events are regarded as sources of synchronisation.

With --happens-before=none, no events (apart, of course, from locking) are regarded as sources of synchronisation.

Changing this setting from the default will increase your false-error rate but give little or no gain. The only advantage is that --happens-before=threads and --happens-before=none should make Helgrind less and less sensitive to the scheduling of threads, and hence the output more and more repeatable across runs.

--trace-addr=0xXXYYZZ and --trace-level=0|1|2 [default: 1]

Requests that Helgrind produces a log of all state changes to location 0xXXYYZZ. This can be helpful in tracking down tricky races. --trace-level controls the verbosity of the log. At the default setting (1), a one-line summary of is printed for each state change. At level 2 a complete stack trace is printed for each state change.

In addition, the following debugging options are available for Helgrind:

--trace-malloc=no|yes [no]

Show all client malloc (etc) and free (etc) requests.

--gen-vcg=no|yes|yes-w-vts [no]

At exit, write to stderr a dump of the happens-before graph computed by Helgrind, in a format suitable for the VCG graph visualisation tool. A suitable command line is:

valgrind --tool=helgrind --gen-vcg=yes my_app 2>&1 | grep xxxxxx | sed "s/xxxxxx//g" | xvcg -

With --gen-vcg=yes, the basic happens-before graph is shown. With --gen-vcg=yes-w-vts, the vector timestamp for each node is also shown.

--cmp-race-err-addrs=no|yes [no]

Controls whether or not race (data) addresses should be taken into account when removing duplicates of race errors. With --cmp-race-err-addrs=no, two otherwise identical race errors will be considered to be the same if their race addresses differ. With With --cmp-race-err-addrs=yes they will be considered different. This is provided to help make certain regression tests work reliably.

--hg-sanity-flags=<XXXXXX> (X = 0|1) [000000]

Run extensive sanity checks on Helgrind's internal data structures at events defined by the bitstring, as follows:

100000 at every query to the happens-before graph

010000 after changes to the lock order acquisition graph

001000 after every client memory access (NB: not currently used)

000100 after every client memory range permission setting of 256 bytes or greater

000010 after every client lock or unlock event

000001 after every client thread creation or joinage event

Note these will make Helgrind run very slowly, often to the point of being completely unusable.

7.7. A To-Do List for Helgrind

The following is a list of loose ends which should be tidied up some time.

  • Track which mutexes are associated with which condition variables, and emit a warning if this becomes inconsistent.

  • For lock order errors, print the complete lock cycle, rather than only doing for size-2 cycles as at present.

  • Document the VALGRIND_HG_CLEAN_MEMORY client request.

  • Possibly a client request to forcibly transfer ownership of memory from one thread to another. Requires further consideration.

  • Add a new client request that marks an address range as being "shared-modified with empty lockset" (the error state), and describe how to use it.

  • Document races caused by gcc's thread-unsafe code generation for speculative stores. In the interim see http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html and http://lkml.org/lkml/2007/10/24/673.

  • Don't update the lock-order graph, and don't check for errors, when a "try"-style lock operation happens (eg pthread_mutex_trylock). Such calls do not add any real restrictions to the locking order, since they can always fail to acquire the lock, resulting in the caller going off and doing Plan B (presumably it will have a Plan B). Doing such checks could generate false lock-order errors and confuse users.

  • Performance can be very poor. Slowdowns on the order of 100:1 are not unusual. There is quite some scope for performance improvements, though.