Skip to content

Releases: amd/blis

AOCL 5.2.2 Release

20 Mar 06:31

Choose a tag to compare

AOCL-BLAS 5.2.2 Release Notes

Overview

AOCL-BLAS 5.2.2 is an incremental release building on the 5.2 GA release, delivering performance optimizations, bug fixes, improved threading stability, and expanded test coverage.

Performance Optimizations

  • Optimized SGEMM rd kernels on Zen3
  • Improved SGEMM rd kernel on Zen4/Zen5
  • SGEMM tiny path tuning for Zen4 and Zen5
  • Added tiny path for SGEMM
  • Added fast path for single-threaded AVX512 DGEMV kernel
  • Replaced intrinsics with inline assembly for bli_saxpyv_zen4_int and bli_saxpyf_zen_int_5
  • Improved fringe case handling for AXPYV kernel
  • Disabled small_gemm for Zen4/Zen5 and added single-thread check for tiny path

Bug Fixes

  • Fixed memory leak in DGEMV kernel
  • Fixed extreme values handling in GEMV
  • Fixed integer division in GEMV that was supposed to be a double operation
  • Fixed Integer Overflow issue in TPSV
  • Fixed out-of-bound access in F32 matrix add/mul ops
  • Bugfix: BF16 to F32 conversion in AVX2 F32 codepath
  • Bug fix in BF16 AVX2 conversion path
  • Fix for F32 to BF16 conversion and AVX512 ISA support checks
  • Fixed cblas_ctrmm invalid diag handling
  • Coverity issue fix for ZTRSM
  • Fixed Coverity static analysis issue in DTRSM
  • Fixed high priority Coverity issues in LPGEMM
  • Resolved operator precedence warning in Zen5 DCOMPLEX threshold logic
  • Modified AXPY kernel to ensure consistency of numerical results

Threading & Stability

  • Fixed data race in native code-path
  • Add OpenMP barrier before releasing threadinfo & global communicator to avoid race
  • Replaced OMP barrier with bli_thread_barrier and added similar fixes
  • Global communicator is now freed outside the parallel region
  • Thread: free global communicator after parallel region completes
  • Initialize mem_t structures safely and handle NULL communicator in threading
  • Fix DTL dynamic thread logging in BLAS operations
  • Added dynamic threads and actual threads in the DTL log of SAXPY
  • Enabled disable-sba-pools feature in AOCL-BLAS

Build System & Infrastructure

  • Updates to the build systems (CMake and Make) for LPGEMM compilation
  • CMake: Adding targets and aliases so that BLIS works with FetchContent
  • Set security flags default enable
  • DTL Windows getpid support
  • Add compiler information to make showconfig and bench_getlibraryInfo
  • Make all bench applications consistent
  • Standardize Zen kernel names

Test Suite (GTestSuite)

  • Added Banded API tests: gbmv, hbmv, sbmv, tbmv, tbsv

  • Added Packed API tests: hpmv, spmv, tpmv, tpsv, hpr, hpr2, spr, spr2

  • Added conjugate dot and ger IIT_ERS tests

  • Added data pool support

  • Moved data generator definitions to a cpp file

  • Computediff improvements

  • Fix in swap

  • Break up tests for better organization

  • Multiple miscellaneous test fixes

  • Code tidying

AOCL 5.2 GA Release

03 Jan 06:46

Choose a tag to compare

AOCL-BLAS 5.2 Release Notes

Overview

This release includes significant performance improvements, new features, and critical bug fixes for the AOCL - BLAS linear algebra library, with optimizations specifically targeting AMD Zen4 and Zen5 architectures.


Performance Improvements

GEMM Improvements

  • Tuned ZGEMM thresholds for Zen4 and Zen5 architectures
  • Optimized AVX512 ZGEMM kernel and edge-case handling
  • Improved ZGEMM packing kernel for M-dimension edge cases
  • Developed Optimal thread selection logic for ZGEMM on Zen5

GEMV Enhancements

  • Added DGEMV no-transpose multithreaded implementations
  • Exported AVX512 DGEMV kernels
  • DGEMV bug fixes and code cleanup
  • Added ability to handle non-unit incx in GEMV transpose kernel
  • Improved numerical precision in ZGEMV API

DCOPY Optimization

  • Tuned DCOPY aocl_dynamic logic for Zen4/Zen5 architectures

New Features

  • Additional build options to disable optimized code paths for smaller matrices in GEMM and TRSM

    • Useful for testing and benchmarking
    • Reduces numerical rounding differences when repeating calculations with different core counts
  • Complete set of GEMMTR APIs implemented


Bug Fixes

Critical Fixes

  • Fixed probable integer overflow in TPSV
  • Fixed ZTRSM accuracy for conjugate transpose
  • Fixed DTRSM small threshold for extremely skinny sizes on Zen5

Acknowledgments

This release is the result of contributions from the AOCL team at AMD and the broader BLIS community.


Release Date: January 2026
Version: 5.2 GA

AOCL-BLAS 5.1 GA

28 May 06:57

Choose a tag to compare

Performance Optimizations

  • DGEMM, DTRSM, DGEMV, ZGEMM, DTRSV, DCOPYV on Zen4/5
  • DSCALV, DDOTV on Zen3
  • Benchmark support for ASUMV
  • Minor Bug Fixes.

Aocl-gemm Add-on Module updates

  • AOCL_ENABLE_INSTRUCTIONS support
  • batch_gemm support for all data types
  • New Output Datatype for Integer APIs
  • BF16 Support on AVX2 Platforms
  • WOQ with/without Group Quantization
  • Threading Framework Optimizations
  • Reference Kernels for all Reorder APIs
  • Performance Optimizations for all APIs
  • Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs

AOCL-BLAS 5.0

11 Oct 03:28

Choose a tag to compare

AOCL-BLAS 5.0 Release Highlights

  • Added zen5 support
  • Turin specific tuning for the APIs: D/ZGEMM, DTRSM and DNRM2
  • AVX512 made improvements for the APIs: ZGEMV, D/ZAXPYF, D/ZDOTXF, ZDOTV, C/ZSCALV, DNRM2, S/D/ZCOPY, S/D/C/ZAXPBYV, DTRSV, DGEMMT, D/ZTRSM, and D/ZGEMM
  • Improvements to the AOCL_ENABLE_INSTRUCTIONS functionality
  • Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs in aocl_gemm add-on

AOCL-BLAS 4.2

28 Feb 06:32

Choose a tag to compare

AOCL-BLAS 4.2 Release Highlights

  • Added uint8 output and zero-point support in int8 API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
  • Improved performance for all downscaled versions of all API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
  • Multithread performance improved across API’s in aocl_gemm addon (Low Precision GEMM / LPGEMM)
  • Introduced AOCL_ENABLE_INSTRUCTIONS environment variable as an alternative to BLIS_ARCH_TYPE, but with slightly different semantics.
  • Improved functionality of XERBLA error handling routine in AOCL-BLAS.
  • Performance optimizations for the following APIs:
    - DGEMM for tiny sizes
    - S/ZGEMM, D/ZTRSM, ZAXPBYV, Z/ZDSCALV, S/D/ZGEMV, and D/DZNRM2
  • Following BLAS extension APIs have been added only for AMD “Zen” code paths:
    - sgemm_pack_get_size(), sgemm_pack(), and sgemm_compute()
    - dgemm_pack_get_size(), dgemm_pack(), and dgemm_compute()

AOCL-BLAS 4.1

07 Aug 15:39

Choose a tag to compare

AOCL-BLAS 4.1 Release Highlights

  • Additional APIs and Post-Ops support in addition to the improved performance for the existing APIs in aocl_gemm add-on
  • Dynamic dispatch and amdzen configuration support added to aocl_gemm add-on
  • Dynamic dispatch feature enhancements.
  • AVX 512-based optimizations for AMD “Zen4” platform:
    - SGEMM, DGEMM, and ZGEMM
    - DTRSM, D/ZAXPY, ZGEMV, DDOTV, and D/ZSCALV
  • Improved support for OpenMP nested parallelism.

AOCL-BLIS 4.0

13 Nov 07:06

Choose a tag to compare

Highlights of AOCL-BLIS 4.0

  • The following LPGEMM (Low Precision GEMM) variants are added along with post-ops support:
    • aocl_gemm_u8s8s32os32 and aocl_gemm_u8s8s32os8 routines are added and optimized using AVX-512-VNNI
    • aocl_gemm_u8s8s16os16 and aocl_gemm_u8s8s16os8 routines are added and optimized using AVX2
    • aocl_gemm_bf16bf16f32of32 and aocl_gemm_bf16bf16f32obf16 routines are added and optimized using AVX-512
  • SGEMM with packed/reorder buffer support (aocl_gemm_f32f32f32f32)
  • AMD “Zen4” support for BLIS
  • Dynamic dispatch supports AMD “Zen4” configuration
  • Optimizations and performance improvements for DGEMM, SGEMM, ZGEMM, DGEMMT, and DTRSM
  • Framework design changes

AOCL-BLIS 3.2

09 Jul 03:01

Choose a tag to compare

New features:

  • Extended BLAS function - DZGEMM
  • Progress feature for xGEMM and xTRSM APIs: Time taken to complete the mathematical operations tends to increase exponentially with large input problem sizes; this feature provides users a periodic update on the operation progress.
  • Runtime Threading control using OpenMP APIs
  • Dynamic Dispatch covers APUs
  • Improved detection of standard x86-64 feature support
  • Minor bug fixes

Performance improvements in the following single-threaded and multi-threaded functions:

  • DGEMM, SGEMM, ZGEMM, and CGEMM
  • DTRSM, DGEMMT, ZTRSM, CTRSM, and DTRMM
  • SGEMV, DHER2, ZTRSV, and DSYMV
  • ?AXPBYV, SSCALV, DSCALV, ?DOTXV, and ZAXPY2V

AMD Optimized BLIS Version 3.1

13 Dec 07:03

Choose a tag to compare

AMD Optimized BLIS Version 3.1

Highlights of improvements on AMD EPYCTM processor family CPUs

  • Supports Dynamic Dispatch and AOCL Dynamic feature
  • Improvements in DGEMM, ZGEMM, DTRSM, DSYRK, xGEMV, and DOTV

AMD Optimized BLIS Version 3.0.1

06 Jul 15:43

Choose a tag to compare

AMD Optimized BLIS Version 3.0.1

Highlights of improvements on AMD EPYCTM processor family CPUs

  • Improved performance of DGEMM for skinny matrix shapes.
  • Improvements in SGEMM and ZGEMM
  • Improved performance of Level-1 and Level2 BLAS routines, GEMV, DOT and AXPY routines
  • Improvements in DTRSM for small matrix sizes