Chaos of "pseudo sharing"

Posted by bhavesh on Thu, 09 Dec 2021 04:15:34 +0100

This article belongs to the concurrent programming series. Through the previous articles, we learned the concept of cache rows in CPU. A brief review is that the cache line is the smallest unit of CPU read-write cache, generally 64 bytes. In addition, the current CPU has three levels of cache, which are divided into L1 Cache, L2 Cache and L3 Cache from close to the CPU core. Based on this background knowledge, I will continue to talk about another major topic related to caching today: fast sharing!

False Sharing is commonly referred to as "False Sharing" and also translated into "False Sharing". First, let's make it clear. Please note that on "today's" CPU, False Sharing occurs on L3 Cache. Because L1 Cache and L2 Cache are on-chip caches (each kernel has its own), L3 Cache is shared among multiple cores. However, the word "today" in this sentence is also very important, because on the old CPU, L2 Cache is shared among multiple cores, so there will be pseudo sharing on L2 Cache. For example, this article:

http://cpp-today.blogspot.com/2008/05/false-sharing-hits-again.htmlcpp-today.blogspot.com

For example consider two threads writing in not overlapping memory locations, they do not need any synchronization so happily you are driven to think that you are able to split your data in order to implement a lock less algorithm. Unfortunately you hold a sparkling "centrino duo" processor in where both cores do share the L2 cache and then the data you partition in memory can be mapped on the same cache line.

The L2 Cache mentioned here is shared among multiple cores, but you can look back at the publication time of this article:

Saturday, May 31, 2008

It's been 13 years. Therefore, when we read materials online, we need to learn to distinguish. It is very important to realize the timeliness of technical articles. Of course, the explanations of other parts are generally no problem.

In addition, it is worth emphasizing that many information about pseudo sharing on the Internet is the default behavior under the optimization level of O0. Although it becomes easier to talk about pseudo contribution, it has no guiding significance for our actual work! Because many conclusions will be inconsistent when O2 is turned on, this paper discusses it when O2 is turned on. This is also the reason why the conclusion of this paper is somewhat "messy". Some articles know some conclusions first, and then try to use code examples to support the conclusions. If there is no evidence, then change the code examples... I feel that it is also meaningless. What is it.

Well, another big opening. Let's go straight to the subject. Start with the previous code:

#include <thread>
#include <vector>
#include <iostream>
#include <chrono>
using namespace std;
const int N = 10000; // vector v size
const int M = 2;     // vector sum size
void foo(const vector<int>& v, vector<long>& sum, int id) {
    for (int i = 0; i < v.size(); ++i) {
        if (i%M == id) {
            sum[id] += i;
        }
    }
// When you need to turn on O2, you must output the final result, otherwise it will be erased as an unused variable!
    cout << "sum " << id << " = " << sum[id] << endl;
}
int main () {
    vector<int> v;
    for (int i = 0; i < N; ++i) {
        v.push_back(i);
    }

    {   // Code block 1
        vector<long> sum(M, 0);
        auto start = chrono::steady_clock::now();
        vector<thread> td;
        for (int i = 0; i < M; ++i) {
            td.emplace_back(foo, std::cref(v), std::ref(sum), i);
        }
        for (int i = 0; i < M; ++i) {
            td[i].join();
        }
        auto end = chrono::steady_clock::now();
        cout<< "block 1 cost:" << chrono::duration_cast<chrono::microseconds>(end - start).count()<<endl;
    }
    cout << "----------" << endl;
    {   // Code block 2
        vector<long> sum(M, 0);
        auto start = chrono::steady_clock::now();
        for (int i = 0; i < M; ++i) {
            foo(v, sum, i);
        }
        auto end = chrono::steady_clock::now();
        cout<< "block 2 cost:" << chrono::duration_cast<chrono::microseconds>(end - start).count()<<endl;
    }
}

There is an initial vector v, which stores 100 numbers incremented from 0. Then the foo() function is implemented to calculate the sum of odd bit elements or even bit functions in the vector according to the parameters of the function. Code block 1 is executed concurrently by multithreading, and code block 2 is executed by ordinary serial logic.

I'll test it directly on my Macbook. Although it is not a Linux environment, the conclusions are basically the same:

clang++ -std=c++11 -pthread -O2 false_sharing.cpp

./a.out

Which code block do you think will take less time for logic?

In theory, code block 1 is faster on multi-core machines. But the result is:

sum 0 = 24995000
sum 1 = 25000000
block 1 cost:318
----------
sum 0 = 24995000
sum 1 = 25000000
block 2 cost:88

Try a few more times

sum sum 0 = 1 = 24995000
25000000
block 1 cost:286
----------
sum 0 = 24995000
sum 1 = 25000000
block 2 cost:125

(multithreaded cout, sum0 and sum1 sometimes print disorderly, but the calculation results are correct)

The results were surprising and the serial time was lower. The reason for this is False Sharing. If you have read the previous articles, you will know the number of cache lines, the minimum unit of CPU operation cache, and the size of a cache line is 64 bytes. Sum in the above code is of type vector, where two consecutive long are stored. These two long are on the same cache line. When two threads read and write sum[0] and sum[1] respectively, although it seems that they do not use the same variable, they actually affect each other. For example, when a thread writes sum[0], it operates sum[0] cached in CPU0, which will also invalidate the cache line of sum[1] cached in CPU1, resulting in re reading, which is a time-consuming operation.

Later, I tried to change M to 8, that is, sum[8], and then i%8 in the foo function. The final result is still code block 2, which takes less time.

So what should we do to make full use of multi-core? For this example, multiple threads do not operate on the same variable, but there are a lot of reads and writes to variables in the same cache line in a single thread. At this time, a local variable can be introduced.

Add a foo2 function:

void foo2(const vector<int>& v, vector<long>& sum, int id) {
    long s = 0;
    for (int i = 0; i < v.size(); ++i) {
        if (i%M == id) {
            s += i;
        }
    }
    sum[id] = s;
    cout << "sum " << id << " = " << sum[id] << endl;
}

In the main function, write a code block 3:

    {   // Code block 3
        vector<long> sum(M, 0);
        auto start = chrono::steady_clock::now();
        vector<thread> td;
        for (int i = 0; i < M; ++i) {
            td.emplace_back(foo2, std::cref(v), std::ref(sum), i);
        }
        for (int i = 0; i < M; ++i) {
            td[i].join();
        }
        auto end = chrono::steady_clock::now();
        cout<< "block 3 cost:" << chrono::duration_cast<chrono::microseconds>(end - start).count()<<endl;
    }

Final output:

sum sum 0 = 1 = 24995000
25000000
block 1 cost:286
----------
sum 0 = 24995000
sum 1 = 25000000
block 2 cost:125
----------
sum 0 = 24995000
sum 1 = 25000000
block 3 cost:121

sum sum 1 = 25000000
0 = 24995000
block 1 cost:402
----------
sum 0 = 24995000
sum 1 = 25000000
block 2 cost:105
----------
sum 0 = 24995000
sum 1 = 25000000
block 3 cost:178

sum sum 1 = 25000000
0 = 24995000
block 1 cost:240
----------
sum 0 = 24995000
sum 1 = 25000000
block 2 cost:95
----------
sum 0 = 24995000
sum 1 = 25000000
block 3 cost:93

sum sum 0 = 24995000
1 = 25000000
block 1 cost:307
----------
sum 0 = 24995000
sum 1 = 25000000
block 2 cost:96
----------
sum 0 = 24995000
sum 1 = 25000000
block 3 cost:117

It can be seen that code block 3 is already faster than code block 1, which is equal to code block 2. It is possible that code block 2 takes less time, because code block 3 has the additional overhead of thread creation and scheduling after all.

We can enlarge the size of array v to 1000000 (one million), and then execute:

sum 1 = 250000000000
sum 0 = 249999500000
block 1 cost:6318
----------
sum 0 = 249999500000
sum 1 = 250000000000
block 2 cost:9596
----------
sum 0 = 249999500000
sum 1 = 250000000000
block 3 cost:1178

sum 0 = 249999500000
sum 1 = 250000000000
block 1 cost:9474
----------
sum 0 = 249999500000
sum 1 = 250000000000
block 2 cost:9579
----------
sum 0 = 249999500000
sum 1 = 250000000000
block 3 cost:1613

At that time, the advantages of code blocks were obvious, and an interesting phenomenon was that when the number of array v reached one million, code block 1 was even more efficient than code block 2... So we can't generalize about pseudo sharing sometimes. Whether pseudo sharing occurred or not, the efficiency must be lower than that of serial, but the third method is the optimal solution anyway.

The basic data types mentioned earlier. Let's take another example of struct:

#include <thread>
#include <vector>
#include <iostream>
#include <chrono>
using namespace std;
const int N = 2;       // vector v size
const int T = 1000000;   // Number of cycles
struct Data {
    int a;
    int b;
};
void bar1(Data& d) {
    for (int i = 0; i < T; ++i) {
        d.a++;
    }
    cout<< "d.a:" << d.a << endl;
}
void bar2(Data& d) {
    for (int i = 0; i < T; ++i) {
        d.b++;
    }
    cout<< "d.b:" << d.b << endl;
}

int main () {
    Data d = { 0, 0 };
    auto start = chrono::steady_clock::now();
    thread t1(bar1, std::ref(d));
    thread t2(bar2, std::ref(d));
    t1.join();
    t2.join();
    auto end = chrono::steady_clock::now();
    cout<< "cost:" << chrono::duration_cast<chrono::microseconds>(end - start).count()<<endl;
}

The Data type has two members a and b, which accumulate 1 million times in two threads respectively. This time I found a Linux with multiple cores to compile (O2 is not directly opened here):

g++ -std=c++11 -pthread false_sharing2.cpp

Run output:

d.a:1000000
d.b:1000000
cost:6326

Try to get rid of multithreading and change to serial logic, and the results are similar.

Next, we will introduce a method of retransmission to give full play to the advantages of multi-core CPU. If the local variable is used in the first example, it is too limited. After all, not all thread callback functions can abstract such a local variable. Then let's put two variables on two cache lines respectively!

Although the fields in struct also have byte alignment by default, they are generally aligned with the word length, that is, 8-byte alignment. a and b fields are still in the same cache line. To this end, we need to use gcc extension syntax to align it according to 64 bytes:

__attribute__((aligned(64)))

This is the attribute function of gcc.

You can define a macro:

#define ALIG __attribute__((aligned(64)))

Then redefine a Data:

struct Data {
    int ALIG a;
    int ALIG b;
};

That is to add after every int__ attribute__((aligned(64))), and then both a and b are 64 bytes. I don't believe you can sizeof Data, which will output 128. Of course, sizeof of a and b is still 4, because the filled bytes are not included in the field.

By the way, if memory alignment is to be done, not every machine cache line size is 64 bytes. If your CPU is L3 cache, it depends on the size of L3 cache lines! Not L1. If you don't have L3, look at L2.

okay. Let's recompile and execute it.

d.a:1000000
d.b:1000000
cost:3341

The time consumption has been significantly reduced, and the multiple execution is relatively stable. It was not aligned before, but the time-consuming fluctuates all the time, and it is relatively high.

The general tutorial on pseudo sharing is basically over after throwing byte alignment, but I have something to say. We didn't drive O2 just now. Drive it

g++ -std=c++11 -pthread -O2 false_sharing2.cpp

The cost of unaligned code output is the same as that of aligned code output. On the machine I use, the numbers in both cases are more than 200, and there is no obvious difference in height.

More miraculously, if I remove the multithreaded code and directly call bar1() and bar2() serially, then the cost is: 51!

Do you understand here? The so-called pseudo sharing leads to poor performance, which can not be generalized. After O2 is turned on, the serial code is sometimes the most efficient, because the serial code is easier to be optimized by the compiler and CPU, while the optimization ability of multithreaded code is not so strong. At this time, padding byte alignment cache line size is not effective. In the first code example of this paper, it is shown that when the number of v increases, the serial code will obviously slow down. The difference between the two case s is, of course, related to the specific execution logic.

So don't regard pseudo sharing as a scourge, and don't use 64 byte alignment as a panacea. Everything has to be actually tested to know. Don't throw out the conclusion under O0 and take it as a standard. I don't think it's meaningful

Programmer Think

Chaos of "pseudo sharing"

Hot Topics