Skip to content

matrix experiments#167

Open
bashbaug wants to merge 107 commits into
mainfrom
matrixperf-final
Open

matrix experiments#167
bashbaug wants to merge 107 commits into
mainfrom
matrixperf-final

Conversation

@bashbaug
Copy link
Copy Markdown
Owner

@bashbaug bashbaug commented Jun 2, 2026

This PR adds several samples that demonstrate various methods of computing a large matrix multiplication. There are currently three samples: one that computes the product of two bfloat16 matrices, another that computes the product of 8-bit integer matrices, and a third that computes a product of tf32 "TensorFloat-32" matrices.

Each sample includes a naive version for correctness that runs (usually, slowly) on any OpenCL implementation, plus many other variants that demonstrate different extensions and tiling strategies. The samples are flexible and can accomodate other implementations, as needed.

Now have tiled implementations for SIMD16 as well.
We want to prioritize reuse of the A matrix to make best use of
read suppression buffers.
This is not working (silently failing) with some recent drivers, so
disable it for now.  Ideally we will be able to reenable it shortly.
This should enable better cache reuse across subgroups.
This may also be helpful to keep subgroups running approximately
together, which could also improve cache utilization.
Also, remove tK from all host function output, since it is only
used internally within the kernels.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new “matrix experiments” sample suite under samples/20_* that demonstrates large matrix-multiplication implementations for bf16, int8, and tf32 in OpenCL, including naive reference kernels and multiple optimized subgroup/tiled variants.

Changes:

  • Adds three new sample executables (matrixexperiments-bf16, matrixexperiments-i8, matrixexperiments-tf32) with corresponding OpenCL kernel sources and README usage docs.
  • Introduces shared helper utility readStringFromFile() in include/util.hpp and removes duplicate per-sample implementations.
  • Adds a new include/bfloat16.hpp type helper.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 29 comments.

Show a summary per file
File Description
samples/CMakeLists.txt Registers the three new matrix experiment sample subdirectories.
samples/20_matrixexperiments-tf32/README.md Documents tf32 sample purpose, extensions, and CLI flags.
samples/20_matrixexperiments-tf32/matrix_kernels_tf32.cl Adds tf32 naive + subgroup + tiled kernel variants.
samples/20_matrixexperiments-tf32/matrix_kernel_tiled_tf32.cl Adds templated tf32 tiled kernel implementation.
samples/20_matrixexperiments-tf32/matrix_helpers_tf32.cl Adds tf32 activation + subgroup load/store helpers.
samples/20_matrixexperiments-tf32/main.cpp Adds tf32 host harness: argument parsing, build, run, validate, benchmark.
samples/20_matrixexperiments-tf32/CMakeLists.txt Adds build rules for tf32 sample.
samples/20_matrixexperiments-i8/README.md Documents int8 sample purpose, extensions, and CLI flags.
samples/20_matrixexperiments-i8/matrix_kernels_i8.cl Adds int8 naive + subgroup + blockread kernel variants.
samples/20_matrixexperiments-i8/matrix_helpers_i8.cl Adds int8 activation + dp4/dpas emulation + IO helpers.
samples/20_matrixexperiments-i8/main.cpp Adds int8 host harness: argument parsing, build, run, validate, benchmark.
samples/20_matrixexperiments-i8/CMakeLists.txt Adds build rules for int8 sample.
samples/20_matrixexperiments-bf16/README.md Documents bf16 sample purpose, extensions, and CLI flags.
samples/20_matrixexperiments-bf16/matrix_kernels_bf16.cl Adds bf16 naive + subgroup + tiled kernel variants.
samples/20_matrixexperiments-bf16/matrix_kernel_tiled_bf16.cl Adds templated bf16 tiled + blockread tiled kernel implementation.
samples/20_matrixexperiments-bf16/matrix_helpers_bf16.cl Adds bf16 conversion, activation, and subgroup load/store helpers.
samples/20_matrixexperiments-bf16/CMakeLists.txt Adds build rules for bf16 sample.
samples/06_ndrangekernelfromfile/main.cpp Removes local readStringFromFile implementation (now centralized).
samples/05_kernelfromfile/main.cpp Removes local readStringFromFile implementation (now centralized).
include/util.hpp Adds shared readStringFromFile() helper and <fstream> include.
include/bfloat16.hpp Adds a C++ bfloat16 helper type.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread samples/20_matrixexperiments-tf32/main.cpp
Comment on lines +161 to +163
auto localErr = std::fabs(C[index] - C_ref[index]) /
std::max(std::fabs(C[index]),
std::fabs(C_ref[index]));
Comment thread samples/20_matrixexperiments-tf32/main.cpp
Comment thread samples/20_matrixexperiments-tf32/matrix_helpers_tf32.cl
Comment thread samples/20_matrixexperiments-tf32/matrix_kernel_tiled_tf32.cl Outdated
Comment thread include/bfloat16.hpp Outdated
Comment thread include/bfloat16.hpp Outdated
Comment thread samples/20_matrixexperiments-tf32/CMakeLists.txt
Comment thread samples/20_matrixexperiments-i8/CMakeLists.txt
Comment thread samples/20_matrixexperiments-bf16/CMakeLists.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants