Why is Python much slower than C + +?

Posted by lightningstrike on Sun, 19 Dec 2021 23:20:16 +0100

Why do people focus on GIL? Here, the standard line of the subject is a single threaded DFS processed by bit... There is almost no room for GIL to play, okay

For this eight Queen DFS, my C + + code needs to calculate about 18s for 15 without some evaluative pruning (open O2 for about 8.6 seconds, which is basically consistent with the main description), but it can be determined that loop and recursion are used in your solution. What needs to be analyzed next is the details of Python's slowness and whether it can be improved.

Here are two pieces of code for testing. The first is Python:

class="highlight">

#!/usr/bin/env python3

import time

def calc(n, i=0, cols=0, diags=0, trans=0):
    if i == n:
        return 1
    else:
        rt = 0
        for j in range(n):
            col = 1 << j
            diag = 1 << (i - j + n - 1)
            tran = 1 << (i + j)
            if (col & cols) == 0 and (diag & diags) == 0 and (tran & trans) == 0:
                rt += calc(n, i+1, cols | col, diags | diag, trans | tran)
        return rt

if __name__ == '__main__':
    t = time.time()
    print(calc(13))
    print(time.time() - t)

And C + + Code:

#include <chrono>
#include <iostream>

using namespace std;

long calc(int n, int i = 0, long cols = 0, long diags = 0, long trans = 0) {
    if (i == n) {
        return 1;
    } else {
        long rt = 0;
        for (int j = 0; j < n; j++) {
            long col = (1 << j);
            long diag = (1 << (i - j + n - 1));
            long tran = (1 << (i + j));
            if (!(col & cols) && !(diag & diags) && !(tran & trans)) {
                rt += calc(n, i + 1, col | cols, diag | diags, tran | trans);
            }
        }
        return rt;
    }
}

int main() {
    auto t = chrono::system_clock::now();
    cout << calc(13) << endl;
    cout << (chrono::system_clock::now() - t).count() * 1e-6 << endl;
    return 0;
}

The C + + code here is not written according to OOP. How simple is it

The test machine configuration is Core i7 4870HQ, and the compiler uses clang + + 8.1 0, the Python interpreter is Cpython 3.6 0 I didn't test the data volume of 15, but only test 13, because 15 takes too much time

Since multithreading is not involved here at all, it basically has nothing to do with GIL.

For n=13, the C + + code ran for 0.48 seconds. In order to ensure that the compiler didn't work quietly, I specially typed it as - O0 (in fact, turning on O2 can reach about 0.2 seconds). Python ran for 24 seconds.

For this example, the most direct impact is that Python is interpreted and executed sentence by sentence. C + + is compiled into local code first, and there is type checking during compilation. There is no dynamic type and dynamic checking, and compiler optimization can be carried out.

After that, we should consider whether we can improve the efficiency a little bit?

Then, according to the general rule, Python's loop is very slow. We can consider changing it to list expansion:

def calc(n, i=0, cols=0, diags=0, trans=0):
    if i == n:
        return 1
    else:
        return sum(
            [
                calc(n, i + 1, cols | (1 << j), diags | (1 << (i - j + n - 1)), trans | (1 << (i + j)))
                for j in range(n)
                if (cols & (1 << j)) == 0 and (diags & (1 << (i - j + n - 1))) == 0 and (trans & (1 << (i + j))) == 0
            ]
        )

It should be faster, and it has been verified in real time: such Python code needs to run in about 18 seconds. There is still an order of magnitude difference, which does not solve the fundamental problem, but it shows that the implementation of for loop in CPython is actually not fast at all.

Then think about it. If we use other interpreters, especially those containing JIT, it will try to compile the code into local binary code and execute it during execution. At the same time, it can also give some additional optimization. Will it be much better?

So just try PyPy3(5.8.0-beta, Python 3.5.3), how fast can the code be?

In fact, if you simply replace the interpreter with PyPy, the original 24s Python source code will only take about 1s. A single JIT can improve the performance by an order of magnitude, which fully shows that the performance of the official CPython interpreter is really bad

PyPy's JIT is simple and pure, not very radical, but the same code can show completely different poor performance with the help of better JIT and higher performance libraries. For example, if llvm is used for JIT, and some mature math libraries can be used for optimization. We know that C extensions such as numpy can greatly improve the performance of Python in numerical calculation. Similarly, we can use Python or write Python extensions directly in C to strengthen the computing power. But people are lazy. It's really troublesome to rewrite the code. For Python, an ecologically powerful gadget, if you simply use the simple structure of numpy and Python's own standard structure in your computing code, using numba may be the simplest and fastest way.

#!/usr/bin/env python3

import time


from numba import jit


@jit
def calc(n, i=0, cols=0, diags=0, trans=0):
    if i == n:
        return 1
    else:
        rt = 0
        for j in range(n):
            col = 1 << j
            diag = 1 << (i - j + n - 1)
            tran = 1 << (i + j)

            if (col & cols) == 0 and (diag & diags) == 0 and (tran & trans) == 0:
                rt += calc(n, i+1, cols | col, diags | diag, trans | tran)
        return rt



if __name__ == '__main__':
    t = time.time()
    print(calc(13))
    print(time.time() - t)

Here we simply add two lines of code: import JIT from numba and decorate our calculation function with JIT. The running time of this code is directly shortened to 0.4s, which is almost the same as that of the O0 compiled program in C + +. This also takes into account the fact that JIT needs to be preheated. This code only takes about 6.5s to calculate the scale of 15, which is even better than the C + + version with O2 on.

The reason is that JIT not only converts the code into local machine code during operation, but also tries to optimize it. If you analyze the running process with things like cProfile, you can clearly see the optimization process.

That's all for this sharing

Programmer Think

Why is Python much slower than C + +?

Hot Topics