Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
472d64c
basic infrastructure, dpas version is working
bashbaug Jan 4, 2024
16c343c
improved address arithmetic
bashbaug Jan 4, 2024
1da0c2c
added vnni versions
bashbaug Jan 5, 2024
11e0eef
cleanup
bashbaug Jan 5, 2024
d637ee6
host code cleanup
bashbaug Jan 5, 2024
52b9550
add SIMD16 versions and emulation
bashbaug Jan 6, 2024
074e0a5
add support for PVC, which does not support SIMD8
bashbaug Jan 6, 2024
1ca8f73
fix warning
bashbaug Jan 8, 2024
f2b00f3
add 2D block read variants
bashbaug Jan 9, 2024
0bb5529
reenable all variants
bashbaug Jan 9, 2024
7b89cfe
add vnni block read variants
bashbaug Jan 9, 2024
ca4b3cd
fix typo in emulation path
bashbaug Jan 9, 2024
3ffcf3e
start to add block tiled versions
bashbaug Jan 12, 2024
b469713
improve block tiled versions
bashbaug Jan 12, 2024
098f339
more improvements
bashbaug Jan 12, 2024
ce7866f
add more block tiled variants
bashbaug Jan 12, 2024
48c3bc2
refactor device code into a helper file
bashbaug Jan 13, 2024
feb1064
switch to timing using event profiling
bashbaug Jan 13, 2024
4e89026
more refactorization and simplification
bashbaug Jan 15, 2024
c7edcd6
add tiled block read kernels for PVC
bashbaug Jan 15, 2024
a433769
fix block read tiled kernels and execute them
bashbaug Jan 15, 2024
d4eb405
fix typo affecting one of the SIMD16 kernels
bashbaug Jan 17, 2024
b6be2d4
fix a few more bugs and improve validation testing
bashbaug Jan 17, 2024
d76df7e
add support for a larger A matrix block read
bashbaug Jan 17, 2024
756d2e9
switch the tiled dpas order
bashbaug Jan 17, 2024
4caea7b
temporarily disable the large a matrix block load
bashbaug Jan 19, 2024
0fb3d66
add support for launching more than one subgroup per work group
bashbaug Jan 19, 2024
d09b982
add support for split barriers
bashbaug Jan 19, 2024
031e076
add support for larger K values for some tiled kernels
bashbaug Jan 19, 2024
16b7cda
rename tester host functions to match kernel names more closely
bashbaug Jan 19, 2024
83185fd
add support for more K tiles for the blockread kernels
bashbaug Jan 19, 2024
a24a5b0
start to add support for loading two K tiles at once
bashbaug Jan 22, 2024
38d03c0
fix type for emulated v2 block reads
bashbaug Jan 22, 2024
ab84bbe
performance improvements and bugfixes for DG2
bashbaug Jan 23, 2024
4ae2d95
add support for wide K block reads
bashbaug Jan 24, 2024
aa95c5e
minor improvements to tester program
bashbaug Jan 24, 2024
40d0d7a
try a larger B matrix block read for the VNNI kernel
bashbaug Jan 25, 2024
07311ec
add a mask argument to only run a subset of tests
bashbaug Feb 8, 2024
b19fe5e
initial support for prefetching
bashbaug Feb 8, 2024
a1efab2
add support for more prefetching
bashbaug Feb 8, 2024
92c90d4
add driver version output to tester
bashbaug Feb 8, 2024
5f29f59
add 8x2 tiled versions
bashbaug Feb 13, 2024
8a0ed40
add tf32 tester
bashbaug Feb 22, 2024
4712db3
add more tf32 variants and enable prefetching
bashbaug Feb 22, 2024
29d8bf7
add a few more non-blockread tiled tests
bashbaug Feb 22, 2024
c88dc58
add basic activation function support
bashbaug Feb 22, 2024
083a946
add support for prefetching multiple iterations ahead
bashbaug Feb 22, 2024
d5c3d6d
add support for even bigger block reads
bashbaug Feb 28, 2024
18096ee
add a way to generate tf32 dpas currently (disabled by default)
bashbaug Mar 1, 2024
459c109
increase prefetch distance
bashbaug Mar 2, 2024
b034c1d
fix DG2 prefetches
bashbaug Mar 2, 2024
697754d
use more helper functions for DG2 tiled kernels
bashbaug Mar 2, 2024
cbbdcb6
switch back to a smaller prefetch
bashbaug Mar 2, 2024
2d23f76
add support for 2D block prefetches
bashbaug Mar 3, 2024
14bc83e
add support for bigger transformed block reads
bashbaug Mar 4, 2024
02a6045
swap the B matrix NN and KK tiling dimensions
bashbaug Mar 4, 2024
a0c2e53
remove the 8x2 tiled kernels
bashbaug Mar 5, 2024
873c6ab
try a different order for prefetches and loads
bashbaug Mar 5, 2024
ef205e3
add support for more block prefetches
bashbaug Mar 5, 2024
2ea4d56
switch the prefetch order back for now
bashbaug Mar 6, 2024
f60930a
add support for larger work-groups in both dimensions
bashbaug Mar 6, 2024
54b9366
add support for initializing matrices with zero data
bashbaug Mar 7, 2024
8d23a8c
add support for setting round robin scheduling (disabled by default)
bashbaug Mar 7, 2024
133fb58
tell the compiler K is always greater than zero
bashbaug Mar 7, 2024
3989cae
switch the sum dimensions for consistency
bashbaug Mar 8, 2024
7c0c358
try a cooperative prefetch for the B matrix tile
bashbaug Mar 8, 2024
9a0317a
fix the cooperative prefetching indexing calculation
bashbaug Mar 8, 2024
afca843
a few more naming changes for consistency
bashbaug Mar 13, 2024
7176d53
try a smaller cooperative prefetch for the B matrix for the rowmajor …
bashbaug Mar 14, 2024
90b23b0
re-enable all tiled matrix scenarios
bashbaug Mar 14, 2024
ab01142
try a cooperative prefetch for the A matrix tile
bashbaug Mar 14, 2024
d6858fa
try a slightly smaller A tile prefetch
bashbaug Mar 14, 2024
8cbc8db
sync tf32 samples with bf16 samples
bashbaug Apr 9, 2024
bb11953
initial int8 matrixperf sample
bashbaug Apr 10, 2024
bb48fd4
enable more int8 samples
bashbaug Apr 12, 2024
1b8fda8
slight tf32 diversion
bashbaug Apr 12, 2024
5c3a6d0
update function names to align closer with final proposal
bashbaug May 20, 2024
72d7f3b
add 32x1 prefetch variants in addition to 16x2 variants
bashbaug Jul 30, 2024
5a2ac29
Merge branch 'main' into matrixperf-test
bashbaug Jul 30, 2024
d3d42f0
add a define for 32x1 prefetch variants
bashbaug Jul 30, 2024
eff9d19
enable support for the native tf32 dpas
bashbaug Jul 30, 2024
af8b5e1
update tf32 function names to be closer to the final versions
bashbaug Jul 30, 2024
af02676
Merge branch 'main' into matrixperf
bashbaug Jul 31, 2024
d927552
Merge branch 'main' into matrixperf
bashbaug Oct 4, 2024
6a72682
Merge branch 'matrixperf' into matrixperf-i8
bashbaug Feb 26, 2025
83a0690
revert change to tf32 kernel
bashbaug Feb 26, 2025
7e95831
fix typo
bashbaug Feb 26, 2025
dddfdf3
Merge remote-tracking branch 'origin/main' into matrixperf-i8
bashbaug Feb 26, 2025
41159a8
switch block read functions to the production names
bashbaug Feb 27, 2025
ddc93ff
add transpose block read variant
bashbaug Feb 27, 2025
fbb652f
Merge remote-tracking branch 'origin/main' into matrixperf-i8
bashbaug Feb 27, 2025
0734097
switch to a more efficient sequence with conditional movs
bashbaug Feb 28, 2025
3324a53
cleanup
bashbaug Feb 28, 2025
ace054e
Merge branch 'main' into matrixperf
bashbaug May 4, 2025
62a1fd8
switch to production 2d block io functions
bashbaug May 4, 2025
badf4c2
switch more block reads to the production versions
bashbaug May 4, 2025
a58314a
Merge branch 'matrixperf-i8' into matrixperf-final
bashbaug May 4, 2025
9988e6d
integrate i8 matrix multiplication
bashbaug May 4, 2025
d680ff1
switch to final directories and sample names
bashbaug May 4, 2025
5fe7e09
Merge branch 'main' into matrixperf-final
bashbaug Jan 8, 2026
a348dea
Merge branch 'main' into matrixperf-final
bashbaug Jan 8, 2026
af28503
update copyright, add README
bashbaug Feb 23, 2026
7f5764c
remove warning when split barriers are unsupported
bashbaug Feb 23, 2026
4e31c01
fixes for CPU and more
bashbaug Mar 3, 2026
ada8a4a
switch to kernels that use integer dot products
bashbaug Mar 4, 2026
2d734c0
Merge branch 'main' into matrixperf-final
bashbaug May 6, 2026
2ae7f9b
minor fixes
bashbaug Jun 2, 2026
cdee85b
fix error calculation
bashbaug Jun 2, 2026
e261d9c
a few more fixes
bashbaug Jun 3, 2026
f0c05d0
final cleanup
bashbaug Jun 3, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions include/bfloat16.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
/*
// Copyright (c) 2024-2026 Ben Ashbaugh
//
// SPDX-License-Identifier: MIT
*/
#pragma once

#include <cmath>
Comment thread
bashbaug marked this conversation as resolved.
#include <cstdint>

class bfloat16;

class bfloat16 {
using StorageType = uint16_t;
StorageType value;

static StorageType from_float(const float &a) {
if (std::isnan(a))
return 0xffc1;
union {
uint32_t intStorage;
float floatValue;
};
floatValue = a;
// Do RNE and truncate
uint32_t roundingBias = ((intStorage >> 16) & 0x1) + 0x00007FFF;
return static_cast<StorageType>((intStorage + roundingBias) >> 16);
}

static float to_float(const StorageType &a) {
union {
uint32_t intStorage;
float floatValue;
};
intStorage = a << 16;
return floatValue;
}

public:
bfloat16() = default;
bfloat16(const bfloat16 &) = default;
~bfloat16() = default;

// Implicit conversion from float to bfloat16
bfloat16(const float &a) { value = from_float(a); }

bfloat16 &operator=(const float &rhs) {
value = from_float(rhs);
return *this;
}

// Implicit conversion from bfloat16 to float
operator float() const { return to_float(value); }

// Logical operators (!,||,&&) are covered if we can cast to bool
explicit operator bool() const { return to_float(value) != 0.0f; }

// Unary minus operator overloading
friend bfloat16 operator-(const bfloat16 &lhs) {
return -to_float(lhs.value);
}

// Increment and decrement operators overloading
#define OP(op) \
friend bfloat16 &operator op(bfloat16 &lhs) { \
float f = to_float(lhs.value); \
lhs.value = from_float(op f); \
return lhs; \
} \
friend bfloat16 operator op(bfloat16 &lhs, int) { \
bfloat16 old = lhs; \
operator op(lhs); \
return old; \
}
OP(++)
OP(--)
#undef OP

// Assignment operators overloading
#define OP(op) \
friend bfloat16 &operator op(bfloat16 &lhs, const bfloat16 &rhs) { \
float f = static_cast<float>(lhs); \
f op static_cast<float>(rhs); \
return lhs = f; \
} \
template <typename T> \
friend bfloat16 &operator op(bfloat16 &lhs, const T &rhs) { \
float f = static_cast<float>(lhs); \
f op static_cast<float>(rhs); \
return lhs = f; \
} \
template <typename T> friend T &operator op(T &lhs, const bfloat16 &rhs) { \
float f = static_cast<float>(lhs); \
f op static_cast<float>(rhs); \
return lhs = f; \
}
OP(+=)
OP(-=)
OP(*=)
OP(/=)
#undef OP

// Binary operators overloading
#define OP(type, op) \
friend type operator op(const bfloat16 &lhs, const bfloat16 &rhs) { \
return type{static_cast<float>(lhs) op static_cast<float>(rhs)}; \
} \
template <typename T> \
friend type operator op(const bfloat16 &lhs, const T &rhs) { \
return type{static_cast<float>(lhs) op static_cast<float>(rhs)}; \
} \
template <typename T> \
friend type operator op(const T &lhs, const bfloat16 &rhs) { \
return type{static_cast<float>(lhs) op static_cast<float>(rhs)}; \
}
OP(bfloat16, +)
OP(bfloat16, -)
OP(bfloat16, *)
OP(bfloat16, /)
OP(bool, ==)
OP(bool, !=)
OP(bool, <)
OP(bool, >)
OP(bool, <=)
OP(bool, >=)
#undef OP

// Bitwise(|,&,~,^), modulo(%) and shift(<<,>>) operations are not supported
// for floating-point types.
};
22 changes: 22 additions & 0 deletions include/util.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@
#pragma once

#include <CL/opencl.hpp>

#include <cctype>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <iterator>
#include <string>
Comment thread
bashbaug marked this conversation as resolved.
Comment thread
bashbaug marked this conversation as resolved.

static cl_version getDeviceOpenCLVersion(
Expand Down Expand Up @@ -68,6 +74,22 @@ static bool checkDeviceForExtension(
return supported;
}

static std::string readStringFromFile(
const std::string& filename )
{
std::ifstream is(filename, std::ios::binary);
if (!is.good()) {
printf("Couldn't open file '%s'!\n", filename.c_str());
return "";
}

std::string source{
std::istreambuf_iterator<char>(is),
std::istreambuf_iterator<char>() };

return source;
}

static bool checkPlatformIndex(
const std::vector<cl::Platform>& platforms,
int platformIndex)
Expand Down
21 changes: 0 additions & 21 deletions samples/05_kernelfromfile/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,27 +13,6 @@

#include "util.hpp"

static std::string readStringFromFile(
const std::string& filename )
{
std::ifstream is(filename, std::ios::binary);
if (!is.good()) {
printf("Couldn't open file '%s'!\n", filename.c_str());
return "";
}

size_t filesize = 0;
is.seekg(0, std::ios::end);
filesize = (size_t)is.tellg();
is.seekg(0, std::ios::beg);

std::string source{
std::istreambuf_iterator<char>(is),
std::istreambuf_iterator<char>() };

return source;
}

int main(
int argc,
char** argv )
Expand Down
21 changes: 0 additions & 21 deletions samples/06_ndrangekernelfromfile/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,27 +13,6 @@

#include "util.hpp"

static std::string readStringFromFile(
const std::string& filename )
{
std::ifstream is(filename, std::ios::binary);
if (!is.good()) {
printf("Couldn't open file '%s'!\n", filename.c_str());
return "";
}

size_t filesize = 0;
is.seekg(0, std::ios::end);
filesize = (size_t)is.tellg();
is.seekg(0, std::ios::beg);

std::string source{
std::istreambuf_iterator<char>(is),
std::istreambuf_iterator<char>() };

return source;
}

int main(
int argc,
char** argv )
Expand Down
11 changes: 11 additions & 0 deletions samples/20_matrixexperiments-bf16/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Copyright (c) 2024-2026 Ben Ashbaugh
#
# SPDX-License-Identifier: MIT

add_opencl_sample(
TEST
NUMBER 20
TARGET matrixexperiments-bf16
VERSION 200 # for clSetKernelExecInfo
SOURCES main.cpp
Comment thread
bashbaug marked this conversation as resolved.
KERNELS matrix_helpers_bf16.cl matrix_kernels_bf16.cl matrix_kernel_tiled_bf16.cl)
60 changes: 60 additions & 0 deletions samples/20_matrixexperiments-bf16/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# matrixexperiments-bf16

## Sample Purpose

This sample demonstrates various techniques to perform a large matrix multiplication where the matrix elements contain 16-bit `bfloat16` data.
The sample includes many different implementations:

1. The "naive" implementation is a very simple implementation.
It is not very fast, but it is easy to understand, and it has no extension dependencies so it will run on many devices.
2. The "dpas" kernels use sub-group extensions to improve performance.
On some devices, they will also use specialized matrix multiplication extensions to further improve performance.
Because these kernels require certain extensions or a specific sub-group size, they may not run on all devices.
3. The "dpas blockread" kernels use additional sub-group extensions to further improve performance.

Most of the optimized kernels operate on fixed size tiles of matrix data.
For some of these kernels, parameters such as the number of matrix tiles per-sub-group or the number of sub-groups per work-group may be modified via program build options.
Experiment with different options to see what performs the best!

A good place to start for some devices is:

```sh
./matrixexperiments-bf16 -m4096 --options="-DSGS_PER_WG_X=4 -DSGS_PER_WG_Y=8 -DKK=2 -cl-intel-256-GRF-per-thread" --zero
```

## Key APIs and Concepts

This sample will optionally use the following OpenCL extensions:

* cl_intel_bfloat16_conversions
* cl_intel_required_subgroup_size
* cl_intel_split_work_group_barrier
* cl_intel_subgroup_2d_block_io
* cl_intel_subgroup_matrix_multiply_accumulate
* cl_intel_subgroups
* cl_intel_subgroups_short

## Command Line Options

| Option | Default Value | Description |
|:--|:-:|:--|
| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
| `--file <string>` | `matrix_kernels_bf16.cl` | Specify the name of the file with the OpenCL kernel source.
| `--options <string>` | None | Specify optional program build options.
| `--matrixsize <int>` | 512 | Specify the dimensions of the matrix.
| `--iterations <int>` | 16 | Specify the number of iterations for performance testing.
| `--validate` | n/a | Validate results for correctness.
| `--zero` | n/a | Initialize all matrices to zero.
| `--identity` | n/a | Initialize all matrices to one.
| `--fixed` | n/a | Initialize all matrices to values computed from the matrix row and column.
| `--emulate` | n/a | Do not use specialized matrix multiplication extensions.
| `--wallclock` | n/a | Measure performance using wallclock time instead of event profiling.
| `--skipinit` | n/a | Skip initialization of source matrices.
| `--roundrobin` | n/a | Use round robin thread scheduling.
| `--threshold <float>` | 0.01 | Set the threshold used when validating results.
| `--mask <int>` | ~0 | Set a mask to only run a subset of tests.

Comment thread
bashbaug marked this conversation as resolved.
By default, the source matrices are populated with random data.
When validating results, it is recommended to use either "fixed" or "identity" data.
For best performance, use "zero" data.
Loading
Loading