Skip to content(if available)orjump to list(if available)

Is Python Code Sensitive to CPU Caching? (2024)

PhilipRoman

IME CPU caches can be observed in almost all languages, regardless of how detached they are from the machine. What I'm really wondering is, if branch prediction can be observed from interpreted code.

exyi

It is! Although my test case is probably an unrealistically bad scenario:

It's the classic, why is processing sorted array faster than unsorted one

    def f(arr):
        r = True
        for x in arr:
           if x > 128: r = not r
        return r

    arr = [ random.randint(0, 255) for i in range(0, 1000_000) ]
    arr_sorted = list(sorted(arr))

    %timeit f(arr)
    %timeit f(arr_sorted)

Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms for sorted. For comparison, in a compiled language, the difference is 4x

ammar2

Edit: Analyzed the wrong thing earlier.

This depends on the Python version, but if it has the specializing interpreter changes, the `COMPARE_OP` comparing the integers there is probably hitting a specialized `_COMPARE_OP_INT` [1].

This specialization has a ternary that does `res = (sign_ish & oparg) ? PyStackRef_True : PyStackRef_False;`. This might be the branch that ends up getting predicted correctly?

Older versions of Python go through a bunch of dynamic dispatch first and then end up with a similar sort of int comparison in `long_richcompare`. [2]

[1] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

[2] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...

dzaima

This isn't actually timing the sorting, but just the (dumb) function f.

LPisGood

This is a really good example. It is more like branch prediction than standard data/instruction caching.

I wonder if you could do Spectre type vulnerabilities in python. You would need some way to leak micro-architectural state, so without being particularly clever maybe python code could only be used as a gadget or something.

cma

Python speed up is probably from small integer caching, a sorted array will have runs of pointers to the same integers adjacent. The compiled language one is probably branch prediction right?

exyi

I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.

I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me

The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)

bgwalter

That seems very likely. The benchmark should probably use a range that is guaranteed to be outside of the cached smallint range.

GBhpkmY

[dead]

vlovich123

Why would you think that branch prediction wouldn’t be? Do you think the interpreter is adding too much false sharing to the branch predictor to render it worthless?

almostgotcaught

you do realize that the Python

  if x > 128: 
     r = not r
doesn't necessarily correspond to one branch at the ASM instruction level right? in fact, a priori, it doesn't even correspond to absolutely any branches at the ASM instruction level.

vlovich123

Yes I do and that’s neither here nor there. There almost certainly ASM instruction level branches to implement the conditional and the branch predictor isn’t tied to a single instruction - the rough high level mental model is a cache of a few of the least significant digits of the CPU to the prediction although in practice it’s far more complicated. Since the predictor is right like 80% of the time, it means that even when there’s false sharing of a lot of branches, the CPU does a good job predicting. It’s performance is primarily impacted when execution takes both branches closer to 50/50 than to 100/0.

That’s why I asked about false sharing specifically as that’s the main way I can think of that Python code wouldn’t be able to observe the branch predictor performance - because the interpreter has so many branches internally that it dominates any branches that might be caused by the Python code itself.

exyi

It corresponds to a way more than one branch at instruction level. The branch prediction AFAIK does not care based on what are you branching, it just assumes branches will go in similar sequences as they did last time. If the Python 'if' is never taken, the instruction-level predictor will learn that after the comparison operation, there is an 'if' operation and then another array access operation. If the Python 'if' is unpredictable unpredictable, CPU predictor can't be sure which opcode are we processing after 'if', so there will be penalty.

saagarjha

Yep, this is why Spectre mitigations are applied to browsers.

stingraycharles

It should be, as most interpreted code uses a JIT native compiler at this point, which implies the branch prediction is native as well.

I would be happy to stand corrected if someone knows more about this, though!

pansa2

> most interpreted code uses a JIT native compiler at this point

I suspect most JavaScript code is JIT-compiled - but most Python, Ruby etc is interpreted (via bytecode).

tarruda

AFAIK CPython doesn't JIT compile.

pansa2

Recent versions of CPython do have an experimental JIT compiler. I'm not sure how widely-used it is, though.

https://peps.python.org/pep-0744/

eulgro

My guess would be that branch misprediction does have an impact on interpreted language, but much less. If bytecode instructions take on average 20 CPU cycles to execute, and the branch misprediction penalty is 50 CPU cycles, the relative cost of a misprediction is much smaller than in compiled code.

adgjlsfhk1

the counterpoint is that a lot of those 20 instructions are also branches

Delk

I really don't see why you wouldn't expect to find cache-sensitivity in Python, or in some other similarly high-level language.

Python sequences are defined as "support[ing] efficient element access using integer indices" [1]. Python lists are sequences and thus must support random access. In practice that means the implementation is a (dynamically resizing) array allocated contiguously. That means spatial locality is relevant.

If the list type were defined simply as an iterable collection, with no requirement of constant-time random access, the definition would be abstract enough that an implementation might end up being something else than a contiguous array. But if you define the type so that it supports constant-time random access [2], you pretty much end up with cache-friendly sequential access as well.

If you don't define the list type as supporting random access, you also sacrifice asymptotic algorithmic efficiency for lookups by index. Any language that cares about efficiency at all separates collections that support efficient random access from those that don't. (For example, Python has lists and dicts/sets for different access patterns. The Java standard library has separate types for contiguous arrays/lists, hashtable-backed collections and tree-backed collections because the three have different algorithmic efficiencies for different access patterns. In practice it leads to different properties in terms of cache-friendliness as well.)

[1] https://docs.python.org/3/glossary.html#term-sequence

[2] As usual, the Python documentation is a bit vague on the details. It doesn't really say random access has to be constant-time, only that it has to be "efficient". So you might be able to have a non-constant time implementation such as an indexable skiplist while arguing that it's efficient, but you'd have to go out of your way to do that.

zahlman

>I really don't see why you wouldn't expect to find cache-sensitivity in Python, or in some other similarly high-level language... Python lists are sequences and thus must support random access. In practice that means the implementation is a (dynamically resizing) array allocated contiguously.

The contents of the dynamically allocated array, in the C implementation, are pointers (PyObject), since the lists are heterogeneous and must be able to store any Python object. That entails indirection, which defeats spatial locality. Even if you iterate over cached data, you have to jump around to retrieve the actual objects to do anything* with them.

Delk

Sure. But that's one less level of indirection than if you also had to jump around to get the reference to the object in the first place.

PaulHoule

"Mechanical sympathy" effects are so strong on modern CPUs you feel them even in languages like Python that are far from the metal.

null

[deleted]

SteelByte

Great overview of CPU caching and its impact on Python performance. The benchmark really drives home the point. It would be interesting to see how other dynamic languages like JavaScript compare, and if JIT compilers are able to optimize away some of these caching inefficiencies.

saagarjha

No, they can’t. The “caching inefficiencies” (?!) are part of the hardware and can’t be optimized away by a JIT. I think you have a very confused understanding of the post.

davvid

The article didn't mention Python's atomic refcounts. They're very bad for cache usage as they're constantly invalidating cache.

senderista

Non-atomic refcounts invalidate cache as well, they're just not as expensive.

quotemstr

The only version of Python that uses atomic reference counting is the very new free threaded version.