-
Notifications
You must be signed in to change notification settings - Fork 28
matrix experiments #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bashbaug
wants to merge
110
commits into
main
Choose a base branch
from
matrixperf-final
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
matrix experiments #167
Changes from all commits
Commits
Show all changes
110 commits
Select commit
Hold shift + click to select a range
472d64c
basic infrastructure, dpas version is working
bashbaug 16c343c
improved address arithmetic
bashbaug 1da0c2c
added vnni versions
bashbaug 11e0eef
cleanup
bashbaug d637ee6
host code cleanup
bashbaug 52b9550
add SIMD16 versions and emulation
bashbaug 074e0a5
add support for PVC, which does not support SIMD8
bashbaug 1ca8f73
fix warning
bashbaug f2b00f3
add 2D block read variants
bashbaug 0bb5529
reenable all variants
bashbaug 7b89cfe
add vnni block read variants
bashbaug ca4b3cd
fix typo in emulation path
bashbaug 3ffcf3e
start to add block tiled versions
bashbaug b469713
improve block tiled versions
bashbaug 098f339
more improvements
bashbaug ce7866f
add more block tiled variants
bashbaug 48c3bc2
refactor device code into a helper file
bashbaug feb1064
switch to timing using event profiling
bashbaug 4e89026
more refactorization and simplification
bashbaug c7edcd6
add tiled block read kernels for PVC
bashbaug a433769
fix block read tiled kernels and execute them
bashbaug d4eb405
fix typo affecting one of the SIMD16 kernels
bashbaug b6be2d4
fix a few more bugs and improve validation testing
bashbaug d76df7e
add support for a larger A matrix block read
bashbaug 756d2e9
switch the tiled dpas order
bashbaug 4caea7b
temporarily disable the large a matrix block load
bashbaug 0fb3d66
add support for launching more than one subgroup per work group
bashbaug d09b982
add support for split barriers
bashbaug 031e076
add support for larger K values for some tiled kernels
bashbaug 16b7cda
rename tester host functions to match kernel names more closely
bashbaug 83185fd
add support for more K tiles for the blockread kernels
bashbaug a24a5b0
start to add support for loading two K tiles at once
bashbaug 38d03c0
fix type for emulated v2 block reads
bashbaug ab84bbe
performance improvements and bugfixes for DG2
bashbaug 4ae2d95
add support for wide K block reads
bashbaug aa95c5e
minor improvements to tester program
bashbaug 40d0d7a
try a larger B matrix block read for the VNNI kernel
bashbaug 07311ec
add a mask argument to only run a subset of tests
bashbaug b19fe5e
initial support for prefetching
bashbaug a1efab2
add support for more prefetching
bashbaug 92c90d4
add driver version output to tester
bashbaug 5f29f59
add 8x2 tiled versions
bashbaug 8a0ed40
add tf32 tester
bashbaug 4712db3
add more tf32 variants and enable prefetching
bashbaug 29d8bf7
add a few more non-blockread tiled tests
bashbaug c88dc58
add basic activation function support
bashbaug 083a946
add support for prefetching multiple iterations ahead
bashbaug d5c3d6d
add support for even bigger block reads
bashbaug 18096ee
add a way to generate tf32 dpas currently (disabled by default)
bashbaug 459c109
increase prefetch distance
bashbaug b034c1d
fix DG2 prefetches
bashbaug 697754d
use more helper functions for DG2 tiled kernels
bashbaug cbbdcb6
switch back to a smaller prefetch
bashbaug 2d23f76
add support for 2D block prefetches
bashbaug 14bc83e
add support for bigger transformed block reads
bashbaug 02a6045
swap the B matrix NN and KK tiling dimensions
bashbaug a0c2e53
remove the 8x2 tiled kernels
bashbaug 873c6ab
try a different order for prefetches and loads
bashbaug ef205e3
add support for more block prefetches
bashbaug 2ea4d56
switch the prefetch order back for now
bashbaug f60930a
add support for larger work-groups in both dimensions
bashbaug 54b9366
add support for initializing matrices with zero data
bashbaug 8d23a8c
add support for setting round robin scheduling (disabled by default)
bashbaug 133fb58
tell the compiler K is always greater than zero
bashbaug 3989cae
switch the sum dimensions for consistency
bashbaug 7c0c358
try a cooperative prefetch for the B matrix tile
bashbaug 9a0317a
fix the cooperative prefetching indexing calculation
bashbaug afca843
a few more naming changes for consistency
bashbaug 7176d53
try a smaller cooperative prefetch for the B matrix for the rowmajor …
bashbaug 90b23b0
re-enable all tiled matrix scenarios
bashbaug ab01142
try a cooperative prefetch for the A matrix tile
bashbaug d6858fa
try a slightly smaller A tile prefetch
bashbaug 8cbc8db
sync tf32 samples with bf16 samples
bashbaug bb11953
initial int8 matrixperf sample
bashbaug bb48fd4
enable more int8 samples
bashbaug 1b8fda8
slight tf32 diversion
bashbaug 5c3a6d0
update function names to align closer with final proposal
bashbaug 72d7f3b
add 32x1 prefetch variants in addition to 16x2 variants
bashbaug 5a2ac29
Merge branch 'main' into matrixperf-test
bashbaug d3d42f0
add a define for 32x1 prefetch variants
bashbaug eff9d19
enable support for the native tf32 dpas
bashbaug af8b5e1
update tf32 function names to be closer to the final versions
bashbaug af02676
Merge branch 'main' into matrixperf
bashbaug d927552
Merge branch 'main' into matrixperf
bashbaug 6a72682
Merge branch 'matrixperf' into matrixperf-i8
bashbaug 83a0690
revert change to tf32 kernel
bashbaug 7e95831
fix typo
bashbaug dddfdf3
Merge remote-tracking branch 'origin/main' into matrixperf-i8
bashbaug 41159a8
switch block read functions to the production names
bashbaug ddc93ff
add transpose block read variant
bashbaug fbb652f
Merge remote-tracking branch 'origin/main' into matrixperf-i8
bashbaug 0734097
switch to a more efficient sequence with conditional movs
bashbaug 3324a53
cleanup
bashbaug ace054e
Merge branch 'main' into matrixperf
bashbaug 62a1fd8
switch to production 2d block io functions
bashbaug badf4c2
switch more block reads to the production versions
bashbaug a58314a
Merge branch 'matrixperf-i8' into matrixperf-final
bashbaug 9988e6d
integrate i8 matrix multiplication
bashbaug d680ff1
switch to final directories and sample names
bashbaug 5fe7e09
Merge branch 'main' into matrixperf-final
bashbaug a348dea
Merge branch 'main' into matrixperf-final
bashbaug af28503
update copyright, add README
bashbaug 7f5764c
remove warning when split barriers are unsupported
bashbaug 4e31c01
fixes for CPU and more
bashbaug ada8a4a
switch to kernels that use integer dot products
bashbaug 2d734c0
Merge branch 'main' into matrixperf-final
bashbaug 2ae7f9b
minor fixes
bashbaug cdee85b
fix error calculation
bashbaug e261d9c
a few more fixes
bashbaug f0c05d0
final cleanup
bashbaug File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| /* | ||
| // Copyright (c) 2024-2026 Ben Ashbaugh | ||
| // | ||
| // SPDX-License-Identifier: MIT | ||
| */ | ||
| #pragma once | ||
|
|
||
| #include <cmath> | ||
| #include <cstdint> | ||
|
|
||
| class bfloat16; | ||
|
|
||
| class bfloat16 { | ||
| using StorageType = uint16_t; | ||
| StorageType value; | ||
|
|
||
| static StorageType from_float(const float &a) { | ||
| if (std::isnan(a)) | ||
| return 0xffc1; | ||
| union { | ||
| uint32_t intStorage; | ||
| float floatValue; | ||
| }; | ||
| floatValue = a; | ||
| // Do RNE and truncate | ||
| uint32_t roundingBias = ((intStorage >> 16) & 0x1) + 0x00007FFF; | ||
| return static_cast<StorageType>((intStorage + roundingBias) >> 16); | ||
| } | ||
|
|
||
| static float to_float(const StorageType &a) { | ||
| union { | ||
| uint32_t intStorage; | ||
| float floatValue; | ||
| }; | ||
| intStorage = a << 16; | ||
| return floatValue; | ||
| } | ||
|
|
||
| public: | ||
| bfloat16() = default; | ||
| bfloat16(const bfloat16 &) = default; | ||
| ~bfloat16() = default; | ||
|
|
||
| // Implicit conversion from float to bfloat16 | ||
| bfloat16(const float &a) { value = from_float(a); } | ||
|
|
||
| bfloat16 &operator=(const float &rhs) { | ||
| value = from_float(rhs); | ||
| return *this; | ||
| } | ||
|
|
||
| // Implicit conversion from bfloat16 to float | ||
| operator float() const { return to_float(value); } | ||
|
|
||
| // Logical operators (!,||,&&) are covered if we can cast to bool | ||
| explicit operator bool() const { return to_float(value) != 0.0f; } | ||
|
|
||
| // Unary minus operator overloading | ||
| friend bfloat16 operator-(const bfloat16 &lhs) { | ||
| return -to_float(lhs.value); | ||
| } | ||
|
|
||
| // Increment and decrement operators overloading | ||
| #define OP(op) \ | ||
| friend bfloat16 &operator op(bfloat16 &lhs) { \ | ||
| float f = to_float(lhs.value); \ | ||
| lhs.value = from_float(op f); \ | ||
| return lhs; \ | ||
| } \ | ||
| friend bfloat16 operator op(bfloat16 &lhs, int) { \ | ||
| bfloat16 old = lhs; \ | ||
| operator op(lhs); \ | ||
| return old; \ | ||
| } | ||
| OP(++) | ||
| OP(--) | ||
| #undef OP | ||
|
|
||
| // Assignment operators overloading | ||
| #define OP(op) \ | ||
| friend bfloat16 &operator op(bfloat16 &lhs, const bfloat16 &rhs) { \ | ||
| float f = static_cast<float>(lhs); \ | ||
| f op static_cast<float>(rhs); \ | ||
| return lhs = f; \ | ||
| } \ | ||
| template <typename T> \ | ||
| friend bfloat16 &operator op(bfloat16 &lhs, const T &rhs) { \ | ||
| float f = static_cast<float>(lhs); \ | ||
| f op static_cast<float>(rhs); \ | ||
| return lhs = f; \ | ||
| } \ | ||
| template <typename T> friend T &operator op(T &lhs, const bfloat16 &rhs) { \ | ||
| float f = static_cast<float>(lhs); \ | ||
| f op static_cast<float>(rhs); \ | ||
| return lhs = f; \ | ||
| } | ||
| OP(+=) | ||
| OP(-=) | ||
| OP(*=) | ||
| OP(/=) | ||
| #undef OP | ||
|
|
||
| // Binary operators overloading | ||
| #define OP(type, op) \ | ||
| friend type operator op(const bfloat16 &lhs, const bfloat16 &rhs) { \ | ||
| return type{static_cast<float>(lhs) op static_cast<float>(rhs)}; \ | ||
| } \ | ||
| template <typename T> \ | ||
| friend type operator op(const bfloat16 &lhs, const T &rhs) { \ | ||
| return type{static_cast<float>(lhs) op static_cast<float>(rhs)}; \ | ||
| } \ | ||
| template <typename T> \ | ||
| friend type operator op(const T &lhs, const bfloat16 &rhs) { \ | ||
| return type{static_cast<float>(lhs) op static_cast<float>(rhs)}; \ | ||
| } | ||
| OP(bfloat16, +) | ||
| OP(bfloat16, -) | ||
| OP(bfloat16, *) | ||
| OP(bfloat16, /) | ||
| OP(bool, ==) | ||
| OP(bool, !=) | ||
| OP(bool, <) | ||
| OP(bool, >) | ||
| OP(bool, <=) | ||
| OP(bool, >=) | ||
| #undef OP | ||
|
|
||
| // Bitwise(|,&,~,^), modulo(%) and shift(<<,>>) operations are not supported | ||
| // for floating-point types. | ||
| }; | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| # Copyright (c) 2024-2026 Ben Ashbaugh | ||
| # | ||
| # SPDX-License-Identifier: MIT | ||
|
|
||
| add_opencl_sample( | ||
| TEST | ||
| NUMBER 20 | ||
| TARGET matrixexperiments-bf16 | ||
| VERSION 200 # for clSetKernelExecInfo | ||
| SOURCES main.cpp | ||
|
bashbaug marked this conversation as resolved.
|
||
| KERNELS matrix_helpers_bf16.cl matrix_kernels_bf16.cl matrix_kernel_tiled_bf16.cl) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| # matrixexperiments-bf16 | ||
|
|
||
| ## Sample Purpose | ||
|
|
||
| This sample demonstrates various techniques to perform a large matrix multiplication where the matrix elements contain 16-bit `bfloat16` data. | ||
| The sample includes many different implementations: | ||
|
|
||
| 1. The "naive" implementation is a very simple implementation. | ||
| It is not very fast, but it is easy to understand, and it has no extension dependencies so it will run on many devices. | ||
| 2. The "dpas" kernels use sub-group extensions to improve performance. | ||
| On some devices, they will also use specialized matrix multiplication extensions to further improve performance. | ||
| Because these kernels require certain extensions or a specific sub-group size, they may not run on all devices. | ||
| 3. The "dpas blockread" kernels use additional sub-group extensions to further improve performance. | ||
|
|
||
| Most of the optimized kernels operate on fixed size tiles of matrix data. | ||
| For some of these kernels, parameters such as the number of matrix tiles per-sub-group or the number of sub-groups per work-group may be modified via program build options. | ||
| Experiment with different options to see what performs the best! | ||
|
|
||
| A good place to start for some devices is: | ||
|
|
||
| ```sh | ||
| ./matrixexperiments-bf16 -m4096 --options="-DSGS_PER_WG_X=4 -DSGS_PER_WG_Y=8 -DKK=2 -cl-intel-256-GRF-per-thread" --zero | ||
| ``` | ||
|
|
||
| ## Key APIs and Concepts | ||
|
|
||
| This sample will optionally use the following OpenCL extensions: | ||
|
|
||
| * cl_intel_bfloat16_conversions | ||
| * cl_intel_required_subgroup_size | ||
| * cl_intel_split_work_group_barrier | ||
| * cl_intel_subgroup_2d_block_io | ||
| * cl_intel_subgroup_matrix_multiply_accumulate | ||
| * cl_intel_subgroups | ||
| * cl_intel_subgroups_short | ||
|
|
||
| ## Command Line Options | ||
|
|
||
| | Option | Default Value | Description | | ||
| |:--|:-:|:--| | ||
| | `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on. | ||
| | `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on. | ||
| | `--file <string>` | `matrix_kernels_bf16.cl` | Specify the name of the file with the OpenCL kernel source. | ||
| | `--options <string>` | None | Specify optional program build options. | ||
| | `--matrixsize <int>` | 512 | Specify the dimensions of the matrix. | ||
| | `--iterations <int>` | 16 | Specify the number of iterations for performance testing. | ||
| | `--validate` | n/a | Validate results for correctness. | ||
| | `--zero` | n/a | Initialize all matrices to zero. | ||
| | `--identity` | n/a | Initialize all matrices to one. | ||
| | `--fixed` | n/a | Initialize all matrices to values computed from the matrix row and column. | ||
| | `--emulate` | n/a | Do not use specialized matrix multiplication extensions. | ||
| | `--wallclock` | n/a | Measure performance using wallclock time instead of event profiling. | ||
| | `--skipinit` | n/a | Skip initialization of source matrices. | ||
| | `--roundrobin` | n/a | Use round robin thread scheduling. | ||
| | `--threshold <float>` | 0.01 | Set the threshold used when validating results. | ||
| | `--mask <int>` | ~0 | Set a mask to only run a subset of tests. | ||
|
|
||
|
bashbaug marked this conversation as resolved.
|
||
| By default, the source matrices are populated with random data. | ||
| When validating results, it is recommended to use either "fixed" or "identity" data. | ||
| For best performance, use "zero" data. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.