原子操作的危险
Introduction 介绍
Most engineers reach for atomic operations in an attempt to produce some lock-free mechanism. Furthermore, programmers enjoy the intellectual puzzle of using atomic operations. Both of these lead to clever implementations which are almost always ill-advised and often incorrect. Algorithms involving atomic operations are extremely subtle. For example, discovering a general-purpose, efficient, lock-free, singly-linked list algorithm took significant research and required care to implement. Almost all programmers make mistakes when they attempt direct use of atomic operations. Even when they don’t make mistakes, the resulting code is hard for others to maintain. 大多数工程师尝试原子操作,试图制造一些无锁机制。此外,程序员喜欢使用原子操作的智力难题。这两者都导致了聪明的实现,而这些实现几乎总是不明智的,而且往往是不正确的。涉及原子运算的算法极其微妙。例如,发现一个通用的、高效的、无锁的、单链表算法需要大量的研究和实现。几乎所有的程序员在尝试直接使用原子操作时都会犯错误。即使他们没有犯错误,生成的代码对其他人来说也很难维护。
Atomic operations should be used only in a handful of low-level data structures which are written by a few experts and then reviewed and tested thoroughly. Unfortunately, many people attempt to write lock-free code, and this is almost always a mistake. Please do not fall into this trap: do not use atomic operations. If you do, you will make mistakes, and those will cost the owners of that code time and money. 原子操作只能用于少数低级数据结构,这些数据结构由少数专家编写,然后进行彻底审查和测试。不幸的是,许多人试图编写无锁代码,而这几乎总是一个错误。请不要落入这个陷阱:不要使用原子操作。如果你这样做,你就会犯错误,而这些错误将花费代码所有者的时间和金钱。
There are a number of existing higher-level components that are already carefully crafted, reviewed, and tested. Use them if they do what you need. Otherwise, use mutexes. 有许多现有的高级组件已经经过精心制作、审查和测试。如果他们做了你需要的事情,就使用他们。否则,请使用互斥。
Note: the document is centered around C++, but similar arguments apply to other languages as well. See research!rsc for a more detailed discussion of hardware, programming language, and Go memory models. 注意:文档以C++为中心,但类似的论点也适用于其他语言。查看研究!rsc对硬件、编程语言和Go内存模型进行了更详细的讨论。
Existing Components 现有组件
Reach for commonly-available concurrency components before inventing your own solution using atomics. The list below serves as a guide, but is not exhaustive. Libraries such as the C++ Standard Library, Abseil, and Folly all contain relevant components. 在使用原子技术发明自己的解决方案之前,先了解常见的并发组件。下面的列表是一个指南,但并非详尽无遗。C++标准库、Abseil和Folly等库都包含相关组件。
- std::shared_ptr and folly’s hazard pointer for reference counting
- std::call_once and absl::call_once for one-time initialization - boost::asio::thread_pool for thread pooling
- absl::Notification for one-time notifications
- std::latch, std::barrier, absl::Barrier, and absl::BlockingCounter for barrier synchronization
- std::future for creating value promises
- Userspace RCU for read-copy-update algorithms and lock-free containers
- thread_local for better locality
- folly’s concurrency library for concurrent storage and queues
- folly::TokenBucket for rate limiting
Atomic Trickiness 原子狡猾
Atomic operations introduce two separate kinds of hazards: 原子能操作带来两种不同的危险:
First, unless you exclusively use atomic operations that maintain ordering semantics for all shared memory accesses (notably memory_order_seq_cst
operations), both compilers and processors can and will visibly reorder memory accesses per the C++ standard. Programming rules in these cases become far more complicated, and experts often still have trouble pinning them down precisely. Many people find it particularly surprising that such reordering doesn’t always stop at traditional synchronization operations, like a mutex acquisition. 首先,除非您专门使用原子操作来维护所有共享内存访问的排序语义(尤其是 memory_order_seq_cst
操作),否则编译器和处理器都可以而且将明显地根据C++标准对内存访问进行重新排序。在这种情况下,编程规则变得更加复杂,专家们通常仍难以准确地确定它们。许多人发现,这种重新排序并不总是停留在传统的同步操作(如互斥获取)上,这尤其令人惊讶。
If you do restrict yourself to sequentially-consistent operations, you avoid this issue, but may well find that your code now runs slower on ARM and POWER than if you had used mutexes. ARM and POWER are weakly-ordered systems, so special CPU load instructions or memory fences are required to achieve sequential consistency. This is not required on strongly-ordered platforms like x86. 如果您确实将自己限制为顺序一致的操作,则可以避免此问题,但很可能会发现您的代码现在在ARM和POWER上的运行速度比使用互斥对象时慢。ARM和POWER是弱序系统,因此需要特殊的CPU加载指令或内存围栏来实现顺序一致性。这在像x86这样的强有序平台上是不需要的。
Second, it’s extremely difficult to write code in a world in which a) only individual memory accesses are atomic, and b) no way to achieve mutual exclusion over larger code sections exists. Object lifetime management is difficult in a concurrent setting. CAS-based algorithms are subject to the ABA problem. Unexpected and unreproducible thread interleavings occur. Sequences of atomic operations are then not atomic as a whole. Before approaching atomic operations, you must be ready for all of these problems and understand the language memory model with respect to ordering, atomicity, visibility, and data races. 其次,在这样一个世界里编写代码是极其困难的:a)只有单独的内存访问是原子的,b)没有办法在更大的代码段上实现互斥。对象生存期管理在并发环境中很困难。基于CAS的算法受到ABA问题的制约。出现意外且无法产生的线程穿插。原子操作序列就不是一个整体的原子。在进行原子操作之前,您必须为所有这些问题做好准备,并了解有关排序、原子性、可见性和数据竞争的语言内存模型。
Don’t assume x86 semantics. Hardware platform guarantees matter only if you are programming in assembly. Higher-level language (C++/Java/Go) compilers can break your code. Furthermore, ARM and POWER provide notably different and more complex memory models; these can also break your code if you run on a variety of hardware. 不要假定x86语义。硬件平台保证只有在程序集中编程时才重要。高级语言(C++/Java/Go)编译器可能会破坏您的代码。此外,ARM和POWER提供了明显不同且更复杂的内存模型;如果在各种硬件上运行,这些也可能破坏代码。
Let’s consider two examples based on real code that demonstrate these two kinds of subtleties related to atomic operations. First example: 让我们考虑两个基于真实代码的示例,它们展示了与原子操作相关的这两种微妙之处。第一个例子:
1
2
3
4
5
6
7
8
9
10
11
12
std::atomic<bool> data_ready = false;
double data = 0.0;
void Thread1() {
data = 1.23;
data_ready.store(true, std::memory_order_relaxed);
}
void Thread2() {
if (data_ready.load(std::memory_order_relaxed))
CHECK(data == 1.23);
}
The code is seemingly correct: Thread1 initializes the data first and then sets the flag, Thread2 ensures that the flag is set and only then reads the data. What can possibly go wrong? 代码似乎是正确的:Thread1首先初始化数据,然后设置标志,Thread2确保设置标志,然后读取数据。什么可能出错?
With optimizations enabled, gcc compiles this code to: 启用优化后,gcc将此代码编译为:
% g++ -O2 test.cc -S && cat test.s
Thread1:
movabsq $4608218246714312622, %rax # 1. Load the constant into RAX
movl $1, data_ready(%rip) # 2. Store 1 into data_ready
movq %rax, data(%rip) # 3. Store RAX register into data
ret
If Thread2 is executed between instructions 2 and 3 of Thread1, the CHECK
in Thread2 will fail. Note that the compiler does exactly what we asked it to do. The operations on data_ready
are indeed atomic; they are just reordered with other memory accesses. 如果在Thread1的指令2和3之间执行Thread2,那么Thread2中的 CHECK
将失败。请注意,编译器完全按照我们的要求执行。 data_ready
上的操作实际上是原子操作;它们只是与其他存储器访问一起被重新排序。
Another example, this time with implicit memory_order_seq_cst
. Here, we have a concurrent object pool based on a lock-free stack, whose algorithm tries to work around the ABA problem in a non-traditional way: 另一个例子,这次使用隐式 memory_order_seq_cst
。在这里,我们有一个基于无锁堆栈的并发对象池,其算法试图以非传统的方式解决ABA问题:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
template<typename T>
class ConcurrentPool {
public:
ConcurrentPool(size_t size)
: head_(0),
size_(size),
array_(new Node[size]) {
for (size_t i = 0; i < size; i++)
array_[i].next.store(i + 1);
array_[size - 1].next.store(kEnd);
}
T* Get() {
while (size_.load() > 1) {
size_t head1 = head_.load();
size_t next1 = array_[head1].next.exchange(kEnd);
if (next1 != kEnd) {
if (head_.compare_exchange_strong(head1, next1)) {
size_.fetch_sub(1);
return &array_[head1].v;
} else {
array_[head1].next.exchange(next1);
}
}
}
return nullptr;
}
void Put(T* v) {
Node *n = reinterpret_cast<Node*>(v);
size_t i = n - &array_[0];
size_t head1;
do {
head1 = head_.load();
n->next.store(head1);
} while (!head_.compare_exchange_strong(head1, i));
size_.fetch_add(1);
}
private:
struct Node {
T v;
atomic<size_t> next;
};
atomic<size_t> head_;
atomic<size_t> size_;
unique_ptr<Node[]> array_;
static const size_t kEnd = -1;
};
Before reading further try to spot the bug in this code. 在进一步阅读之前,请尝试找出这段代码中的错误。
The bug is basically impossible to discover by testing and manual code inspection. It was found by an automatic checker of synchronization algorithms. The particular execution that leads to the bug: 这个bug基本上是不可能通过测试和手动代码检查发现的。它是由同步算法的自动检查器发现的。导致错误的特定执行:
- Thread 1 reads
head_ = 0
inGet()
. 线程1读取Get()
中的head_ = 0
。 - Thread 0 reads
head_ = 0
inGet()
. 线程0读取Get()
中的head_ = 0
。 - Thread 0 removes element 0 from the stack,
now head_ = 1
. 线程0从堆栈中删除元素0,now head_ = 1
。 - Thread 0 starts putting the element 0. 线程0开始放置元素0。
- Thread 0 reads
head_ = 1
, and sets the next field of the element 0 to 1. 线程0读取head_ = 1
,并将元素0的下一个字段设置为1。 - Thread 1 executes
exchange
on the next field of the element 0. It reads 1 and writes -1. 线程1对元素0的下一个字段执行exchange
。它读1,写-1。 - Thread 2 gets the element 1 from the stack, now
head_ = 2
. 线程2从堆栈中获取元素1,现在是head_ = 2
。 - Thread 0 fails with
compare_exchange
inPut()
, re-readshead_ = 2
, and writes 2 to the next field of the element 0. 线程0因Put()
中的compare_exchange
而失败,重新读取head_ = 2
,并将2写入元素0的下一个字段。 - Thread 0 succeeds with
compare_exchange
inPut()
. Nowhead_ = 0
. 线程0与Put()
中的compare_exchange
一起成功。现在#2。 - Thread 1 succeeds with
compare_exchange
inGet()
. Nowhead_ = 1
(howeverhead_
must be equal to 2!). 线程1在Get()
中以compare_exchange
成功。现在#2(但是#3必须等于2!)。 - Thread 0 pops element 1 from the stack. 线程0从堆栈中弹出元素1。
Now both threads 0 and 2 work with the element 1. Bang! 现在,线程0和2都使用元素1。猛敲!
Performance Considerations 性能注意事项
Programmers assume that mutexes are expensive, and that using atomic operations will be more efficient. But in reality, acquiring and releasing a mutex is cheaper than a cache miss; attention to cache behavior is usually a more fruitful way to improve performance. Furthermore, lock-free data structures are often more expensive than using mutexes. A mutex allows arbitrary changes to be made to a complex data structure; if the same changes must be made without a mutex, the result is likely to take more atomic read-modify-write and memory fence instructions, not fewer. 程序员认为互斥是昂贵的,并且使用原子操作会更有效率。但在现实中,获取和释放互斥对象比缓存未命中更便宜;关注缓存行为通常是提高性能的一种更有成效的方法。此外,无锁数据结构通常比使用互斥锁更昂贵。互斥锁允许对复杂的数据结构进行任意更改;如果必须在没有互斥的情况下进行相同的更改,那么结果可能需要更多的原子读修改写和内存围栏指令,而不是更少。
People wish to avoid mutex contention when concurrency is high. Reducing concurrency is best solved by partitioning locked data structures to avoid mutex contention. For example, it is easier, more efficient, and more useful to build a high-concurrency hash table from many normal hash tables, each with its own reader-writer mutex, than to build one lock-free hash table using atomic operations. 当并发性很高时,人们希望避免互斥争用。减少并发最好通过对锁定的数据结构进行分区来避免互斥争用。例如,与使用原子操作构建一个无锁哈希表相比,从许多普通哈希表构建一个高并发哈希表(每个哈希表都有自己的读写器互斥)更容易、更高效、更有用。
Thread-local caching and batching of updates of centralized state is another technique that usually vastly outperforms centralized lock-free algorithms. For example, tcmalloc uses it to achieve outstanding scaling while relying only on mutexes for synchronization. 线程本地缓存和集中状态更新的批处理是另一种通常大大优于集中无锁算法的技术。例如,tcmalloc使用它来实现出色的伸缩性,同时只依赖于互斥进行同步。
Reference-counting can help to significantly reduce the size of critical sections in some scenarios. Namely, read-lock a container, find the necessary object, increment reference counter, unlock, and return: 在某些情况下,引用计数有助于显著减少关键部分的大小。即读取锁定一个容器,找到必要的对象,增加引用计数器,解锁并返回:
1
2
3
4
5
6
7
8
9
V *find(T key) {
lock_guard l(mutex);
V *v = container.find(key);
if (v != nullptr)
v->refcount.Acquire();
return v;
// Work with the object v happens outside of the mutex.
// Caller calls v->refcount.Release() when done with the object.
}
The Read-Copy-Update/Multiversion-Concurrency-Control technique allows one to achieve linear scaling for read-mostly data structures. 读取-复制更新/多版本并发控制技术允许对读取为主的数据结构实现线性缩放。
Testing Considerations 测试注意事项
Unit tests do not provide good enough coverage for lock-free algorithms; they explore a negligible part of all possible thread interleavings. For a small synchronization algorithm with N=10 atomic operations and T=4 threads, the total number of possible thread interleavings is O(T^(TN)) ~= 10^24. Memory order relaxations result in an even larger number of potential executions. Unit tests will cover a thousand executions at best. 单元测试没有为无锁算法提供足够好的覆盖范围;他们探索了所有可能的线索穿插中微不足道的一部分。对于N=10个原子操作和T=4个线程的小型同步算法,可能的线程交错总数为O(T^(TN))~=10^24。记忆顺序的放松会导致更多的潜在执行。单元测试最多可以覆盖一千次执行。
Moreover, x86 hardware can’t yield all executions possible on POWER and ARM platforms. Code compiled with a particular version of compiler and flags may not be able to yield executions possible with a different compiler or flags. Future compilers are likely to more aggressively reorder memory accesses than current compilers. 此外,x86硬件不能在POWER和ARM平台上产生所有可能的执行。使用特定版本的编译器和标志编译的代码可能无法使用不同的编译器或标志执行。未来的编译器可能会比当前的编译器更积极地重新排序内存访问。
The human brain is poor at reasoning about concurrent algorithms that are not sequentially consistent. Any non-trivial lock-free algorithm requires careful review by several experts, verification with formal checkers, and exhaustive stress testing on different hardware at a minimum. 人类大脑不善于对顺序不一致的并发算法进行推理。任何非平凡的无锁算法都需要几位专家的仔细审查,使用正式的检查器进行验证,并至少在不同的硬件上进行详尽的压力测试。
Note that even mutex-based algorithms can be complex (or a lock can be simply forgotten). Use ThreadSanitizer to test for data races and certain kinds of deadlocks. 请注意,即使是基于互斥的算法也可能很复杂(或者锁可能被简单地遗忘)。使用ThreadManitizer测试数据争用和某些类型的死锁。
Bug Examples Bug示例
Here are examples of several bugs in algorithms based on atomic operations. The bugs are harmful, tricky, and were lurking in our codebases for years. 以下是基于原子操作的算法中的几个错误的示例。这些错误是有害的、棘手的,并且潜伏在我们的代码库中多年。
Linux kernel lock-free fd lookup Linux内核无锁fd查找
The bug was introduced on Sep 9, 2005 as part of a migration from a spinlock to RCU refcounting. The change introduced a bug in how the code needs to react on a narrow window of semi-inconsistent state exposed by concurrent updates. It was fixed ten years later, on Jul 1, 2015. 该错误于2005年9月9日引入,是从spinlock迁移到RCU refcounting的一部分。这一变化引入了一个错误,即代码需要如何对并发更新暴露的半不一致状态的狭窄窗口做出反应。十年后的2015年7月1日,它被修复。
Data Plane Development Kit’s RTE Ring 数据平面开发工具包的RTE环
The bug existed in the first public release of DPDK, which was on Mar 11, 2013. There was a bug with issuing a zero objects dequeue with multiple consumers. It was possible to get more than one thread to succeed the compare-and-set operation and observe starvation or even deadlock in the while loop that checks for preceding dequeues. The same was possible on the enqueue path. The bug was fixed on Mar 22, 2016. 该漏洞存在于2013年3月11日DPDK的首次公开发布中。在向多个使用者发出零对象出列时出现错误。可以让多个线程成功执行比较和设置操作,并在检查前面出队列的while循环中观察到饥饿甚至死锁。排队路径上也可能出现同样的情况。该错误已于2016年3月22日修复。
sync.WaitGroup
The bug was introduced on Jul 18, 2011 as part of a WaitGroup rewrite that was intended to improve scalability. The change indeed improved performance and scalability, but it also replaced a simple mutex-based algorithm with a trickier one based on atomic operations. The bug occurred in very rare circumstances but led to arbitrary memory corruptions. It was discovered and fixed only on Apr 10, 2014. The bug was caused by an unexpected thread interleaving. 该漏洞于2011年7月18日引入,作为WaitGroup重写的一部分,旨在提高可扩展性。这一变化确实提高了性能和可扩展性,但它也用基于原子操作的更棘手的算法取代了简单的基于互斥的算法。该错误发生在极少数情况下,但会导致任意内存损坏。它于2014年4月10日才被发现并修复。该错误是由意外的线程交错引起的。
Parallel GC 并行GC
The bug was introduced on Sep 30, 2011 and fixed only on Jan 15, 2014. The bug led to arbitrary memory corruptions on overloaded machines. The bug was due to unexpected thread interleaving. 该漏洞于2011年9月30日引入,直到2014年1月15日才修复。该错误导致过载机器上的任意内存损坏。该错误是由于意外的线程交错造成的。
org.jctools.maps.NonBlockingHashMap org.ctools.maps.NonBlockingHashMap
The bug was introduced sometime before Feb 2009. The bug allowed the remove operation to return the same item more than once. The root cause was an inconsistency between a failed CAS and a subsequent atomic read of the same field. It was identified on Jan 15, 2018 and fixed on Jan 21, 2018 after much discussion. 这个bug是在2009年2月之前引入的。该错误允许删除操作多次返回同一项。根本原因是失败的CAS与同一字段的后续原子读取之间不一致。它于2018年1月15日被确认,经过多次讨论后于2018年2月21日被修复。
Comments powered by Disqus.