Inside a function, the bytecode is:
2 0 SETUP_LOOP 20 (to 23)
3 LOAD_GLOBAL 0 (xrange)
6 LOAD_CONST 3 (100000000)
9 CALL_FUNCTION 1
12 GET_ITER
>> 13 FOR_ITER 6 (to 22)
16 STORE_FAST 0 (i)
3 19 JUMP_ABSOLUTE 13
>> 22 POP_BLOCK
>> 23 LOAD_CONST 0 (None)
26 RETURN_VALUE
At the top level, the bytecode is:
1 0 SETUP_LOOP 20 (to 23)
3 LOAD_NAME 0 (xrange)
6 LOAD_CONST 3 (100000000)
9 CALL_FUNCTION 1
12 GET_ITER
>> 13 FOR_ITER 6 (to 22)
16 STORE_NAME 1 (i)
2 19 JUMP_ABSOLUTE 13
>> 22 POP_BLOCK
>> 23 LOAD_CONST 2 (None)
26 RETURN_VALUE
The difference is that STORE_FAST
is faster (!) than STORE_NAME
. This is because in a function, i
is a local but at toplevel it is a global.
To examine bytecode, use the dis
module. I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile
builtin.
By default compilers optimize for "average" processor. Since different processors favor different instruction sequences, compiler optimizations enabled by -O2
might benefit average processor, but decrease performance on your particular processor (and the same applies to -Os
). If you try the same example on different processors, you will find that on some of them benefit from -O2
while other are more favorable to -Os
optimizations.
Here are the results for time ./test 0 0
on several processors (user time reported):
Processor (System-on-Chip) Compiler Time (-O2) Time (-Os) Fastest
AMD Opteron 8350 gcc-4.8.1 0.704s 0.896s -O2
AMD FX-6300 gcc-4.8.1 0.392s 0.340s -Os
AMD E2-1800 gcc-4.7.2 0.740s 0.832s -O2
Intel Xeon E5405 gcc-4.8.1 0.603s 0.804s -O2
Intel Xeon E5-2603 gcc-4.4.7 1.121s 1.122s -
Intel Core i3-3217U gcc-4.6.4 0.709s 0.709s -
Intel Core i3-3217U gcc-4.7.3 0.708s 0.822s -O2
Intel Core i3-3217U gcc-4.8.1 0.708s 0.944s -O2
Intel Core i7-4770K gcc-4.8.1 0.296s 0.288s -Os
Intel Atom 330 gcc-4.8.1 2.003s 2.007s -O2
ARM 1176JZF-S (Broadcom BCM2835) gcc-4.6.3 3.470s 3.480s -O2
ARM Cortex-A8 (TI OMAP DM3730) gcc-4.6.3 2.727s 2.727s -
ARM Cortex-A9 (TI OMAP 4460) gcc-4.6.3 1.648s 1.648s -
ARM Cortex-A9 (Samsung Exynos 4412) gcc-4.6.3 1.250s 1.250s -
ARM Cortex-A15 (Samsung Exynos 5250) gcc-4.7.2 0.700s 0.700s -
Qualcomm Snapdragon APQ8060A gcc-4.8 1.53s 1.52s -Os
In some cases you can alleviate the effect of disadvantageous optimizations by asking gcc
to optimize for your particular processor (using options -mtune=native
or -march=native
):
Processor Compiler Time (-O2 -mtune=native) Time (-Os -mtune=native)
AMD FX-6300 gcc-4.8.1 0.340s 0.340s
AMD E2-1800 gcc-4.7.2 0.740s 0.832s
Intel Xeon E5405 gcc-4.8.1 0.603s 0.803s
Intel Core i7-4770K gcc-4.8.1 0.296s 0.288s
Update: on Ivy Bridge-based Core i3 three versions of gcc
(4.6.4
, 4.7.3
, and 4.8.1
) produce binaries with significantly different performance, but the assembly code has only subtle variations. So far, I have no explanation of this fact.
Assembly from gcc-4.6.4 -Os
(executes in 0.709 secs):
00000000004004d2 <_ZL3addRKiS0_.isra.0>:
4004d2: 8d 04 37 lea eax,[rdi+rsi*1]
4004d5: c3 ret
00000000004004d6 <_ZL4workii>:
4004d6: 41 55 push r13
4004d8: 41 89 fd mov r13d,edi
4004db: 41 54 push r12
4004dd: 41 89 f4 mov r12d,esi
4004e0: 55 push rbp
4004e1: bd 00 c2 eb 0b mov ebp,0xbebc200
4004e6: 53 push rbx
4004e7: 31 db xor ebx,ebx
4004e9: 41 8d 34 1c lea esi,[r12+rbx*1]
4004ed: 41 8d 7c 1d 00 lea edi,[r13+rbx*1+0x0]
4004f2: e8 db ff ff ff call 4004d2 <_ZL3addRKiS0_.isra.0>
4004f7: 01 c3 add ebx,eax
4004f9: ff cd dec ebp
4004fb: 75 ec jne 4004e9 <_ZL4workii+0x13>
4004fd: 89 d8 mov eax,ebx
4004ff: 5b pop rbx
400500: 5d pop rbp
400501: 41 5c pop r12
400503: 41 5d pop r13
400505: c3 ret
Assembly from gcc-4.7.3 -Os
(executes in 0.822 secs):
00000000004004fa <_ZL3addRKiS0_.isra.0>:
4004fa: 8d 04 37 lea eax,[rdi+rsi*1]
4004fd: c3 ret
00000000004004fe <_ZL4workii>:
4004fe: 41 55 push r13
400500: 41 89 f5 mov r13d,esi
400503: 41 54 push r12
400505: 41 89 fc mov r12d,edi
400508: 55 push rbp
400509: bd 00 c2 eb 0b mov ebp,0xbebc200
40050e: 53 push rbx
40050f: 31 db xor ebx,ebx
400511: 41 8d 74 1d 00 lea esi,[r13+rbx*1+0x0]
400516: 41 8d 3c 1c lea edi,[r12+rbx*1]
40051a: e8 db ff ff ff call 4004fa <_ZL3addRKiS0_.isra.0>
40051f: 01 c3 add ebx,eax
400521: ff cd dec ebp
400523: 75 ec jne 400511 <_ZL4workii+0x13>
400525: 89 d8 mov eax,ebx
400527: 5b pop rbx
400528: 5d pop rbp
400529: 41 5c pop r12
40052b: 41 5d pop r13
40052d: c3 ret
Assembly from gcc-4.8.1 -Os
(executes in 0.994 secs):
00000000004004fd <_ZL3addRKiS0_.isra.0>:
4004fd: 8d 04 37 lea eax,[rdi+rsi*1]
400500: c3 ret
0000000000400501 <_ZL4workii>:
400501: 41 55 push r13
400503: 41 89 f5 mov r13d,esi
400506: 41 54 push r12
400508: 41 89 fc mov r12d,edi
40050b: 55 push rbp
40050c: bd 00 c2 eb 0b mov ebp,0xbebc200
400511: 53 push rbx
400512: 31 db xor ebx,ebx
400514: 41 8d 74 1d 00 lea esi,[r13+rbx*1+0x0]
400519: 41 8d 3c 1c lea edi,[r12+rbx*1]
40051d: e8 db ff ff ff call 4004fd <_ZL3addRKiS0_.isra.0>
400522: 01 c3 add ebx,eax
400524: ff cd dec ebp
400526: 75 ec jne 400514 <_ZL4workii+0x13>
400528: 89 d8 mov eax,ebx
40052a: 5b pop rbx
40052b: 5d pop rbp
40052c: 41 5c pop r12
40052e: 41 5d pop r13
400530: c3 ret
Best Answer
If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with
-O0
(compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.
On Intel Haswell,
div r64
is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.shr rax, 1
does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.For comparison, 32-bit division is faster, but still horrible vs. shifts.
idiv r32
is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.As you can see from looking at gcc's
-O0
asm output (Godbolt compiler explorer), it only uses shifts instructions. clang-O0
does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see
div_by_13
in the above godbolt link).gcc -Os
(optimize for size) does use IDIV for non-power-of-2 division, unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.Helping the compiler
(summary for this case: use
uint64_t n
)First of all, it's only interesting to look at optimized compiler output. (
-O3
).-O0
speed is basically meaningless.Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory,
g++ -O3 -march=bdver3
andg++ -O3 -march=skylake
will do the right thing. (Or-march=native
.) Or-mtune=...
to just tune, without using instructions that other CPUs might not support.My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
In this case: your
n
has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove thatn
can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)You should have used
uint64_t n
, so it can just SHR. And so it's portable to systems wherelong
is only 32-bit (e.g. x86-64 Windows).BTW, gcc's optimized asm output looks pretty good (using
unsigned long n
): the inner loop it inlines intomain()
does this:The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
The FLAGS input of
cmov
(produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
Beating the compiler
GCC did a pretty good job here. It could save one code byte by using
inc edx
instead ofadd edx, 1
, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use
cmovc
instead oftest
/cmovz
.See @johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple
n
values in parallel in an interleaved loop (see below).LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components (
[base+idx+const]
, which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).So
lea rcx, [rax + rax*2]
/inc rcx
is only 2c latency, faster thanlea rcx, [rax + rax*2 + 1]
, on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in
main
, but that's fine because each thread can just check a range ofn
values and produce a pair of integers as a result.Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same
max
/maxi
. This creates more instruction-level parallelism.The trick is deciding whether to wait until all the
n
values have reached1
before getting another pair of startingn
values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where
n
hadn't reached1
yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors ofn
values up in the air. Maybe only worth with 256b vector (4xuint64_t
).I think the best strategy to make detection of a
1
"sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a1
in an element, the increment-vector will have a zero, and +=0 is a no-op.Untested idea for manual vectorization
You can and should implement this with intrinsics instead of hand-written asm.
Algorithmic / implementation improvement:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
@EOF points out that
tzcnt
(orbsf
) could be used to do multiplen/=2
iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalarn
s in parallel in different integer registers, though.So the loop might look like this:
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding
n>>1
, though. @Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).We can safely use BSF or TZCNT interchangeably, because
n
can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use
REP BSF
, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use__builtin_ctzll
even with-mno-bmi
.They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for @Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
Improvements to ideas / code from other answers:
@hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow:
ROR rdi,2
is faster thanSHRD rdi,rdi,2
, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in @hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)
Also improved the output-printing to convert to a string and make one
write()
instead of writing one char at a time. This minimizes impact on timing the whole program withperf stat ./collatz
(to record performance counters), and I de-obfuscated some of the non-critical asm.@Veedrac's code
I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter
k
and then adding it tocount
later changes what clang does, but that slightly hurts gcc.See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)