Skip to content

dcirne/branches

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Branch vs. Branchless

Program execution is faster and more efficient when microprocessors can continuously fetch instructions from the cache. This is particularly important in High-Performance Computing (HPC) and Real-Time Systems (RTS), where execution times have to be predictable and not fluctuate. In the event of a cache miss, new instructions have to be fetched from RAM to repopulate the cache, causing performance to suffer.

The more predictable the execution path of a program, the better for performance. Branchless programming is a coding technique to minimize the number of possible execution branches. Think of if statements executing one part of the program or another; when the predicted code is in one branch, but the other branch needs to run, this causes inefficiencies and delays.

Here we take a look at two versions of a function to determine which of two numbers is the largest. The first version uses execution branches with an if statement, and the other version uses branchless programming. The code was tested on a MacBook Pro running macOS Sequoia 15.5 with an M1 Max and on a PC running Ubuntu 24.04 with an Intel Core i7-7567U.

You can reproduce the experiment yourself and share your observations. We will see how to do that in the Run it yourself section. The following table shows the execution times (in milliseconds) with different levels of compiler optimization.

Function Architecture No Optimization -O1 -O2 -O3 -Os
branch M1 Max 41794.074 0.005 0.003 0.003 0.003
branchless M1 Max 37546.447 0.001 0.001 0.001 0.001
branch Corei7-7567U 35539.531 258.027 0.005 0.005 0.005
branchless Corei7-7567U 54218.671 507.746 0.004 0.004 0.004

On macOS with Apple Silicon, the branchless code consistently outperforms the branched code across all levels of compiler optimization. The same is not observed on the Ubuntu PC with an Intel processor. The branchless code is only faster when compiled with -O2 or higher levels of optimization.

Run it yourself

Clone or download this repository.

Create a directory for the binary files:

mkdir bin

Compile the code:

gcc -std=c2x branches.c -o bin/branches

Run the program:

./bin/branches

Repeat compilation and running with different levels of compiler optimization:

gcc -std=c2x -O1 branches.c -o bin/branches
./bin/branches

gcc -std=c2x -O2 branches.c -o bin/branches
./bin/branches

gcc -std=c2x -O3 branches.c -o bin/branches
./bin/branches

gcc -std=c2x -Os branches.c -o bin/branches
./bin/branches

You can also take a look at the assembly code for each function for different optimization levels. Although the branchless code may contain more instructions, it avoids jumps. Fetching and running instructions from the cache is faster than a cache miss requiring fetching new instructions from RAM.

gcc -std=c2x -S branches.c -o branches.asm
gcc -std=c2x -S -O1 branches.c -o branches_O1.asm
gcc -std=c2x -S -O2 branches.c -o branches_O2.asm
gcc -std=c2x -S -O3 branches.c -o branches_O3.asm
gcc -std=c2x -S -Os branches.c -o branches_Os.asm

License

Apache License, Version 2.0. See the LICENSE file for more info.

About

Branch vs. branchless code.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages