20 Mar 06:31

kvaragan

ad6a372

AOCL 5.2.2 Release Latest

Latest

AOCL-BLAS 5.2.2 Release Notes

Overview

AOCL-BLAS 5.2.2 is an incremental release building on the 5.2 GA release, delivering performance optimizations, bug fixes, improved threading stability, and expanded test coverage.

Performance Optimizations

Optimized SGEMM rd kernels on Zen3
Improved SGEMM rd kernel on Zen4/Zen5
SGEMM tiny path tuning for Zen4 and Zen5
Added tiny path for SGEMM
Added fast path for single-threaded AVX512 DGEMV kernel
Replaced intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5
Improved fringe case handling for AXPYV kernel
Disabled small_gemm for Zen4/Zen5 and added single-thread check for tiny path

Bug Fixes

Fixed memory leak in DGEMV kernel
Fixed extreme values handling in GEMV
Fixed integer division in GEMV that was supposed to be a double operation
Fixed Integer Overflow issue in TPSV
Fixed out-of-bound access in F32 matrix add/mul ops
Bugfix: BF16 to F32 conversion in AVX2 F32 codepath
Bug fix in BF16 AVX2 conversion path
Fix for F32 to BF16 conversion and AVX512 ISA support checks
Fixed cblas_ctrmm invalid diag handling
Coverity issue fix for ZTRSM
Fixed Coverity static analysis issue in DTRSM
Fixed high priority Coverity issues in LPGEMM
Resolved operator precedence warning in Zen5 DCOMPLEX threshold logic
Modified AXPY kernel to ensure consistency of numerical results

Threading & Stability

Fixed data race in native code-path
Add OpenMP barrier before releasing threadinfo & global communicator to avoid race
Replaced OMP barrier with bli_thread_barrier and added similar fixes
Global communicator is now freed outside the parallel region
Thread: free global communicator after parallel region completes
Initialize mem_t structures safely and handle NULL communicator in threading
Fix DTL dynamic thread logging in BLAS operations
Added dynamic threads and actual threads in the DTL log of SAXPY
Enabled disable-sba-pools feature in AOCL-BLAS

Build System & Infrastructure

Updates to the build systems (CMake and Make) for LPGEMM compilation
CMake: Adding targets and aliases so that BLIS works with FetchContent
Set security flags default enable
DTL Windows getpid support
Add compiler information to make showconfig and bench_getlibraryInfo
Make all bench applications consistent
Standardize Zen kernel names

Test Suite (GTestSuite)

Added Banded API tests: gbmv, hbmv, sbmv, tbmv, tbsv
Added Packed API tests: hpmv, spmv, tpmv, tpsv, hpr, hpr2, spr, spr2
Added conjugate dot and ger IIT_ERS tests
Added data pool support
Moved data generator definitions to a cpp file
Computediff improvements
Fix in swap
Break up tests for better organization
Multiple miscellaneous test fixes
Code tidying

Assets 2

03 Jan 06:46

kvaragan

5.2

9734fc1

AOCL 5.2 GA Release

AOCL-BLAS 5.2 Release Notes

Overview

This release includes significant performance improvements, new features, and critical bug fixes for the AOCL - BLAS linear algebra library, with optimizations specifically targeting AMD Zen4 and Zen5 architectures.

Performance Improvements

GEMM Improvements

Tuned ZGEMM thresholds for Zen4 and Zen5 architectures
Optimized AVX512 ZGEMM kernel and edge-case handling
Improved ZGEMM packing kernel for M-dimension edge cases
Developed Optimal thread selection logic for ZGEMM on Zen5

GEMV Enhancements

Added DGEMV no-transpose multithreaded implementations
Exported AVX512 DGEMV kernels
DGEMV bug fixes and code cleanup
Added ability to handle non-unit incx in GEMV transpose kernel
Improved numerical precision in ZGEMV API

DCOPY Optimization

Tuned DCOPY aocl_dynamic logic for Zen4/Zen5 architectures

New Features

Additional build options to disable optimized code paths for smaller matrices in GEMM and TRSM
- Useful for testing and benchmarking
- Reduces numerical rounding differences when repeating calculations with different core counts
Complete set of GEMMTR APIs implemented

Bug Fixes

Critical Fixes

Fixed probable integer overflow in TPSV
Fixed ZTRSM accuracy for conjugate transpose
Fixed DTRSM small threshold for extremely skinny sizes on Zen5

Acknowledgments

This release is the result of contributions from the AOCL team at AMD and the broader BLIS community.

Release Date: January 2026
Version: 5.2 GA

Assets 2

28 May 06:57

kvaragan

5.1

16f852a

AOCL-BLAS 5.1 GA

Performance Optimizations

DGEMM, DTRSM, DGEMV, ZGEMM, DTRSV, DCOPYV on Zen4/5
DSCALV, DDOTV on Zen3
Benchmark support for ASUMV
Minor Bug Fixes.

Aocl-gemm Add-on Module updates

AOCL_ENABLE_INSTRUCTIONS support
batch_gemm support for all data types
New Output Datatype for Integer APIs
BF16 Support on AVX2 Platforms
WOQ with/without Group Quantization
Threading Framework Optimizations
Reference Kernels for all Reorder APIs
Performance Optimizations for all APIs
Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs

Assets 2

11 Oct 03:28

sireeshasanga

5.0

34d4bba

AOCL-BLAS 5.0

AOCL-BLAS 5.0 Release Highlights

Added zen5 support
Turin specific tuning for the APIs: D/ZGEMM, DTRSM and DNRM2
AVX512 made improvements for the APIs: ZGEMV, D/ZAXPYF, D/ZDOTXF, ZDOTV, C/ZSCALV, DNRM2, S/D/ZCOPY, S/D/C/ZAXPBYV, DTRSV, DGEMMT, D/ZTRSM, and D/ZGEMM
Improvements to the AOCL_ENABLE_INSTRUCTIONS functionality
Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs in aocl_gemm add-on

Assets 2

28 Feb 06:32

sireeshasanga

4.2

7c564c7

AOCL-BLAS 4.2

AOCL-BLAS 4.2 Release Highlights

Added uint8 output and zero-point support in int8 API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
Improved performance for all downscaled versions of all API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
Multithread performance improved across API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
Introduced AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative to BLIS_ARCH_TYPE, but with slightly different semantics.
Improved functionality of XERBLA error handling routine in AOCL-BLAS.
Performance optimizations for the following APIs:
- DGEMM for tiny sizes
- S/ZGEMM, D/ZTRSM, ZAXPBYV, Z/ZDSCALV, S/D/ZGEMV, and D/DZNRM2
Following BLAS extension APIs have been added only for AMD “Zen” code paths:
- sgemm_pack_get_size(), sgemm_pack(), and sgemm_compute()
- dgemm_pack_get_size(), dgemm_pack(), and dgemm_compute()

Assets 2

07 Aug 15:39

sireeshasanga

4.1

a5a3c8b

AOCL-BLAS 4.1

AOCL-BLAS 4.1 Release Highlights

Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs in aocl_gemm add-on
Dynamic dispatch and amdzen configuration support added to aocl_gemm add-on
Dynamic dispatch feature enhancements.
AVX 512-based optimizations for AMD “Zen4” platform:
- SGEMM, DGEMM, and ZGEMM
- DTRSM, D/ZAXPY, ZGEMV, DDOTV, and D/ZSCALV
Improved support for OpenMP nested parallelism.

Assets 2

13 Nov 07:06

pradeeptrgit

4.0

e3fc540

AOCL-BLIS 4.0

Highlights of AOCL-BLIS 4.0

The following LPGEMM (Low Precision GEMM) variants are added along with post-ops support:
- aocl_gemm_u8s8s32os32 and aocl_gemm_u8s8s32os8 routines are added and optimized using AVX-512-VNNI
- aocl_gemm_u8s8s16os16 and aocl_gemm_u8s8s16os8 routines are added and optimized using AVX2
- aocl_gemm_bf16bf16f32of32 and aocl_gemm_bf16bf16f32obf16 routines are added and optimized using AVX-512
SGEMM with packed/reorder buffer support (aocl_gemm_f32f32f32f32)
AMD “Zen4” support for BLIS
Dynamic dispatch supports AMD “Zen4” configuration
Optimizations and performance improvements for DGEMM, SGEMM, ZGEMM, DGEMMT, and DTRSM
Framework design changes

Assets 2

09 Jul 03:01

dzambare

3.2

77c8f06

AOCL-BLIS 3.2

New features:

Extended BLAS function - DZGEMM
Progress feature for xGEMM and xTRSM APIs: Time taken to complete the mathematical operations tends to increase exponentially with large input problem sizes; this feature provides users a periodic update on the operation progress.
Runtime Threading control using OpenMP APIs
Dynamic Dispatch covers APUs
Improved detection of standard x86-64 feature support
Minor bug fixes

Performance improvements in the following single-threaded and multi-threaded functions:

DGEMM, SGEMM, ZGEMM, and CGEMM
DTRSM, DGEMMT, ZTRSM, CTRSM, and DTRMM
SGEMV, DHER2, ZTRSV, and DSYMV
?AXPBYV, SSCALV, DSCALV, ?DOTXV, and ZAXPY2V

Assets 2

13 Dec 07:03

dzambare

3.1

3aa0044

AMD Optimized BLIS Version 3.1

Highlights of improvements on AMD EPYC^TM processor family CPUs

Supports Dynamic Dispatch and AOCL Dynamic feature
Improvements in DGEMM, ZGEMM, DTRSM, DSYRK, xGEMV, and DOTV

Assets 2

06 Jul 15:43

pradeeptrgit

3.0.1

d3a65bd

AMD Optimized BLIS Version 3.0.1

Highlights of improvements on AMD EPYC^TM processor family CPUs

Improved performance of DGEMM for skinny matrix shapes.
Improvements in SGEMM and ZGEMM
Improved performance of Level-1 and Level2 BLAS routines, GEMV, DOT and AXPY routines
Improvements in DTRSM for small matrix sizes

Assets 2

Releases: amd/blis

AOCL 5.2.2 Release

AOCL-BLAS 5.2.2 Release Notes

Overview

Performance Optimizations

Bug Fixes

Threading & Stability

Build System & Infrastructure

Test Suite (GTestSuite)

Uh oh!

AOCL 5.2 GA Release

AOCL-BLAS 5.2 Release Notes

Overview

Performance Improvements

GEMM Improvements

GEMV Enhancements

DCOPY Optimization

New Features

Bug Fixes

Critical Fixes

Acknowledgments

Uh oh!

AOCL-BLAS 5.1 GA

Performance Optimizations

Aocl-gemm Add-on Module updates

Uh oh!

AOCL-BLAS 5.0

Uh oh!

AOCL-BLAS 4.2

Uh oh!

AOCL-BLAS 4.1

Uh oh!

AOCL-BLIS 4.0

Uh oh!

AOCL-BLIS 3.2

Uh oh!

AMD Optimized BLIS Version 3.1

Uh oh!

AMD Optimized BLIS Version 3.0.1

Uh oh!