Skip to content

Overflow when reading unsigned 32-bit datasets with HDF5Array #66

@LTLA

Description

@LTLA

Closely related to the most recent comments at #56, but this time for dense arrays:

library(rhdf5)
tmp <- tempfile(fileext=".h5")
h5createFile(tmp)
mat <- matrix(runif(1000, 0, 3e9), 40, 25)

fhandle <- H5Fopen(tmp, "H5F_ACC_RDWR")
shandle <- H5Screate_simple(dim(mat), dim(mat))
dhandle <- H5Dcreate(fhandle, "data", "H5T_NATIVE_UINT32", shandle)
H5Dwrite(dhandle, mat)
H5Dclose(dhandle)
H5Sclose(shandle)
H5Fclose(fhandle)

library(HDF5Array)
HDF5Array(tmp, "data", type="double")
## <40 x 25> HDF5Matrix object of type "double":
##             [,1]       [,2]       [,3] ...      [,24]      [,25]
##  [1,] 2147483647  110424163 2147483647   .  329319355  177911917
##  [2,] 2147483647 2147483647 2092481469   . 2094453026  383178065
##  [3,]  532230655  648840575    8089084   . 1094738881 1819504375
##  [4,] 2147483647 1908067407 1866505078   . 1627282446 2147483647
##  [5,] 2045841619 2080169773 1830548583   . 2147483647  985403627
##   ...          .          .          .   .          .          .
## [36,] 1124441459  759962091 1555012640   . 2147483647  444232447
## [37,] 1305096420  433641150  192658039   . 2112986240 2147483647
## [38,] 1880927165   14314579 1561237376   . 1179038970 1066884682
## [39,] 2047077935 2147483647 1839530913   . 2147483647 2147483647
## [40,]  261359391  882414743 2147483647   .  731742666 1230179743

DelayedArray(mat) # for comparison
##             [,1]       [,2]       [,3] ...      [,24]      [,25]
##  [1,] 2490095666  110424164 2887895900   .  329319356  177911917
##  [2,] 2359096200 2913293150 2092481469   . 2094453026  383178065
##  [3,]  532230655  648840576    8089084   . 1094738881 1819504375
##  [4,] 2622424704 1908067408 1866505078   . 1627282447 2486156645
##  [5,] 2045841619 2080169773 1830548584   . 2849109240  985403628
##   ...          .          .          .   .          .          .
## [36,] 1124441459  759962092 1555012641   . 2733656557  444232447
## [37,] 1305096421  433641150  192658040   . 2112986240 2777658576
## [38,] 1880927165   14314579 1561237376   . 1179038970 1066884683
## [39,] 2047077936 2295735834 1839530914   . 2489400128 2580024283
## [40,]  261359391  882414744 2703304506   .  731742666 1230179743

We can see that the values are capped at the maximum value for 32-bit signed integers in the HDF5Array, even though I requested type="double" to avoid this. I presume that the data is actually being read in as integer and then cast to the specified type=, rather than being directly read in as double with bit64conversion="double" (assuming that rhdf5 is used under the hood).

This is causing issues with ArtifactDB/alabaster.matrix#15 where I occasionally process uint32 HDF5 datasets.

Session information
R version 4.5.0 Patched (2025-04-24 r88177)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-5-branch/lib/libRblas.so 
LAPACK: /home/luna/Software/R/R-4-5-branch/lib/libRlapack.so;  LAPACK version 3.12.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] HDF5Array_1.37.0      h5mread_1.1.1         DelayedArray_0.35.2  
 [4] SparseArray_1.9.0     S4Arrays_1.9.1        IRanges_2.43.0       
 [7] abind_1.4-8           S4Vectors_0.47.0      MatrixGenerics_1.21.0
[10] matrixStats_1.5.0     BiocGenerics_0.55.0   generics_0.1.4       
[13] Matrix_1.7-3          rhdf5_2.53.1         

loaded via a namespace (and not attached):
[1] lattice_0.22-7      XVector_0.49.0      rhdf5filters_1.21.0
[4] Rhdf5lib_1.31.0     grid_4.5.0          compiler_4.5.0     
[7] tools_4.5.0         crayon_1.5.3       

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions