Program execution is faster and more efficient when microprocessors can continuously fetch instructions from the cache. This is particularly important in High-Performance Computing (HPC) and Real-Time Systems (RTS), where execution times have to be predictable and not fluctuate. In the event of a cache miss, new instructions have to be fetched from RAM to repopulate the cache, causing performance to suffer.
The more predictable the execution path of a program, the better for performance. Branchless programming is a coding technique to minimize the number of possible execution branches. Think of if statements executing one part of the program or another; when the predicted code is in one branch, but the other branch needs to run, this causes inefficiencies and delays.
Here we take a look at two versions of a function to determine which of two numbers is the largest. The first version uses execution branches with an if statement, and the other version uses branchless programming. The code was tested on a MacBook Pro running macOS Sequoia 15.5 with an M1 Max and on a PC running Ubuntu 24.04 with an Intel Core i7-7567U.
You can reproduce the experiment yourself and share your observations. We will see how to do that in the Run it yourself section. The following table shows the execution times (in milliseconds) with different levels of compiler optimization.
| Function | Architecture | No Optimization | -O1 | -O2 | -O3 | -Os |
|---|---|---|---|---|---|---|
| branch | M1 Max | 41794.074 | 0.005 | 0.003 | 0.003 | 0.003 |
| branchless | M1 Max | 37546.447 | 0.001 | 0.001 | 0.001 | 0.001 |
| branch | Corei7-7567U | 35539.531 | 258.027 | 0.005 | 0.005 | 0.005 |
| branchless | Corei7-7567U | 54218.671 | 507.746 | 0.004 | 0.004 | 0.004 |
On macOS with Apple Silicon, the branchless code consistently outperforms the branched code across all levels of compiler optimization. The same is not observed on the Ubuntu PC with an Intel processor. The branchless code is only faster when compiled with -O2 or higher levels of optimization.
Clone or download this repository.
Create a directory for the binary files:
mkdir binCompile the code:
gcc -std=c2x branches.c -o bin/branchesRun the program:
./bin/branchesRepeat compilation and running with different levels of compiler optimization:
gcc -std=c2x -O1 branches.c -o bin/branches
./bin/branches
gcc -std=c2x -O2 branches.c -o bin/branches
./bin/branches
gcc -std=c2x -O3 branches.c -o bin/branches
./bin/branches
gcc -std=c2x -Os branches.c -o bin/branches
./bin/branchesYou can also take a look at the assembly code for each function for different optimization levels. Although the branchless code may contain more instructions, it avoids jumps. Fetching and running instructions from the cache is faster than a cache miss requiring fetching new instructions from RAM.
gcc -std=c2x -S branches.c -o branches.asm
gcc -std=c2x -S -O1 branches.c -o branches_O1.asm
gcc -std=c2x -S -O2 branches.c -o branches_O2.asm
gcc -std=c2x -S -O3 branches.c -o branches_O3.asm
gcc -std=c2x -S -Os branches.c -o branches_Os.asmApache License, Version 2.0. See the LICENSE file for more info.