The Secret Life of Concurrent Data Structures

A concurrent data structure is “a particular way of storing and organizing data for access by multiple computing threads (or processes) on a computer.” In this blog entry, we’ll be covering one of the hidden sides of concurrent data structures that are not so documented in the literature. We’ll be looking at insertion and deletion operations, and comparing the relative complexity of implementing these two operations.

Let’s focus our attention to concurrent data structures in a shared-memory environment where multiple threads are concurrently reading and/or writing. For a thorough background into concurrent data structures, I recommend reading The Art Of Multiprocessor Programming by Herlihy and Shavit (Morgan Kaufmann 2008, revised printing 2012). An assumption for this discussion will be that all concurrent data structures under consideration are linearizable.

First question: How do we construct concurrent data-structures? There are at least three common techniques used today, plus a fourth technique that I’ll briefly go over.

Coarse-Grained Locking

The first method is coarse-grained locking. Take your precious data, and wrap a giant lock around it.

Several large open source projects have relied on course-grained locking. The Linux kernel used a big kernel lock when it first introduced support for simultaneous multithreading in Linux 2.0. Work began in 2008 to remove the big kernel lock, and it was carefully replaced with fine-grained locks until it was removed in Linux 2.6.39. The default Python interpreter (CPython) uses a global interpreter lock to prevent multiple native threads from executing Python bytecode concurrently.

Fine-Grained Locking

The second method is fine-grained locking. Smash that coarse-grained lock into multiple locks! Each fine-grained lock is responsible for protecting a region of the data. Operations on the data are responsible for acquiring one or more of these locks in order to read, modify, and/or write.

Fine-grained locking can improve the overall throughput of a concurrent system. However, one must be careful to avoid the perils of concurrency including deadlock, livelock, starvation, preemption, priority inversion, convoying, etc. You must also have a strategy for releasing shared resources when a thread unexpectedly dies.

Lock-Free Programming

The third method extends the concept of fine-grained locking its logical extreme, which is to have lock-free programming.

Lock‑free programming is programming without locks. Or more formally, a lock‑free implementation of a concurrent data structure is one that guarantees that some thread can complete an operation in a finite number of steps regardless of the execution of the other threads. Lock‑free programming uses atomic operations, such as atomic swap, test-and-set, fetch-and-add, compare-and-swap, load-link/store-conditional, etc., to maintain consistent state. Lock‑free programming can improve system throughput and has desirable liveness properties. It is also tricky to design and implement correctly. The term wait‑free was introduced in (Herlihy, 1991). One of the earliest formal definitions of lock‑free is found in (Massalin and Pu, 1991). Jeff Preshing has a clever introduction to lock‑free programming.

Transactional Memory

The fourth method is to use transactional memory. Transactional memory allows several memory operations to be grouped together, and execute atomically. Either the entire sequence of operations appear to occur atomically, or else the system state is unchanged.

Transactions are a very familiar concept in a database management system. Transactional memory is still somewhat of an exotic concept. It can be implemented either in software or in hardware, or as a hybrid in software with hardware support. The Intel Haswell microarchitecture will have support for transactional synchronization. Scala, Haskell, and Clojure have support for transactional memory semantics (as do other languages). There are active con- and pro- discussions on the benefits of transactional memory.

More is Easy, Less is Hard

What about the secret life of concurrent data structures? Imagine you are building your own concurrent data structure. Let’s imagine this data structure as an abstract blob. There are three basic operations that we will want to support: get(), insert(), and delete(). Make a prediction on the relative effort necessary to design and implement each of the tree operations.

My ballpark estimate is that the relative effort for these operations amounts to 10% get(), 30% insert(), and 60% delete(). Common sense dictates that the modify operations are more complex than the read operation. Common sense might also lead us to think that the insert() and delete() operations are approximately equal in complexity. This is almost never the case!

It is almost always easier to correctly move from a state of less information to a state of more information than it is to move from a state of more information to a state of less information. This is even ignoring the extra complexity of ensuring that you have hygienic access to objects in memory. In other words I’m assuming a runtime which cleans up memory references, and (mostly) eliminates the ABA problem.

The relative difficulty of the delete() operation increases as one moves from coarse-grained locking to fine-grained locking to lock-free programming. To illustrate the delete() complexity phenomenon let’s take a look into lock-free linked lists. If only get() and insert() operations are necessary, then the lock-free linked list is identical in structure to the sequential linked list.

Lock-free Linked Lists

A linked-list is a collection of nodes. Each node is composed of a value, and a possibly null link reference to another node. For simplicity, lets assume that values in the linked list are immutable. The get() operation performs an atomic read on each link reference as it traverses the list in search of the target element. The insert() operation locates the predecessor node where the new node is to be inserted and attempts a compare-and-swap (CAS) operation on the link reference on the predecessor node. If the CAS operation fails then rinse and repeat until successful.

What happens when we add support for the delete() operation? It is illustrative to look at the literature ,and see how the design of the lock-free linked list has evolved over time.

Valois (1995) introduced the first lock-free linked list algorithm. Consistency is maintained by using auxiliary nodes which are defined as nodes that do not store values. Every normal node in the list is required to have an auxiliary node as its predecessor and its successor. With this design, consistency is maintained at the cost of a x2 storage overhead.

Greenwald (1999) constructed a lock-free linked list that does not rely on auxiliary nodes. The design relies on the double compare-and-swap operation (the ability to compare-and-swap two disjoint memory locations) which is not available on modern hardware.

The modern lock-free linked list design was introduced by Harris (2001) and Michael (2002). In this design two CAS operations are used to perform deletion. The first CAS operation marks the node as logically deleted. The second CAS operation physically removes the node. The linearization point is the first CAS operation. This separation of concerns between logical deletion and physical deletion is a powerful technique. It forms the basis of many modern lock-free data structures. The get() and insert() implementations need to be revised in light of this technique. The other operations need to either ignore or eliminate logically deleted nodes.

The secret messiness of deletion is indirectly exposed throughout the literature. The seminal book Purely Functional Data Structures introduces immutable red-black trees that support get() and insert() but not delete() operations.

The recent paper on the design and implementation of Bw-trees includes the following innocent sentence: “[Node] merges are decidedly more complicated than splits, and we need more atomic actions to accomplish them.”

The Bw-tree paper is definitely worth a read. I have seen other papers in the literature on concurrent data structures that simply leave out the delete() operation. Possibly these papers have sufficient research contributions such that they stand on their own merit without this operation, or possibly the data structure is designed for those contexts in which information is never deleted.

Conclusions

When you are designing and/or implementing a concurrent data structure, the final task in your workflow is typically the deletion operation. But this is the most complex of the operations you will be supporting. You must plan ahead in your design with an eye on the deletion problem. Use the technique of separation of concerns to distinguish between logical deletion and physical deletion. To my knowledge, this is the most common approach used today in supporting lock-free deletions.

Caveat emptor

My personal experience with concurrent data structures has been primarily with ordered data structures. I have a hunch that the relative complexity of deletion for unordered concurrent data structures may not be as high. But I don’t have the evidence to back up that claim. A good place to start reading up on concurrent hash tables is “Hopscotch Hashing.”