Skip to content

Commit 3a787f6

Browse files
committed
Merge remote-tracking branch 'origin/main' into wma/levels
2 parents c9a40c5 + 0067ca8 commit 3a787f6

2 files changed

Lines changed: 43 additions & 38 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Binary Sparse Format Specification
22
This is part of a new effort to create a binary storage format for storing sparse matrices and other sparse data to disk.
33

4-
[Minutes from our meetings](minutes) are available.
4+
Minutes from our meetings are available [here](https://hackmd.io/0qzK4fJlQp-78t067yiYsA?view) (see also: [previous minutes](minutes)).
55

66
## Specification
77

@@ -13,6 +13,6 @@ The spec is written in [bikeshed](https://github.com/tabatkins/bikeshed) – a v
1313
To render the spec locally:
1414

1515
* Install bikeshed (ideally in an isolated environment): `pipx install bikeshed`
16-
* Call `bikeshed spec spec/latest/index.md`
16+
* Call `bikeshed spec spec/latest/index.bs`
1717

1818
Rendered versions will generated for pull requests.

spec/latest/index.bs

Lines changed: 41 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -40,16 +40,18 @@ with a value as a *stored value*. Stored values have associated with them a
4040
*scalar value*, which is the value stored in that location in the array, and one
4141
or more *indices*, which describe the location where the stored value is located
4242
in the array. Some or all of these indices may be stored explicitly, or they may
43-
be implicitly derived, depending on the storage format. When stored explicitly,
43+
be implicitly derived, depending on storage format. When stored explicitly,
4444
indices are 0-based positive integers.
4545

4646

4747
Binsparse JSON Descriptors {#descriptor}
4848
========================================
4949

50-
Binsparse descriptors are JSON blobs that describe the binary format of sparse
51-
data. The JSON blob includes several required keys that describe the structure of
52-
the binary storage. Optional attributes may be defined to hold additional metadata.
50+
Binsparse descriptors are key-value metadata that describe the binary format of sparse
51+
data. The key-value data is namespaced as "binsparse" to avoid any conflict with other
52+
metadata in the container. The required entries in the "binsparse" entry are listed
53+
below. Optional attributes may be defined to hold additional metadata and must be stored
54+
outside of the "binsparse" namespace.
5355

5456
<div class=example>
5557

@@ -59,17 +61,17 @@ attributes.
5961

6062
```json
6163
{
62-
"format": "CSC",
63-
"shape": [10, 12],
64-
"data_types": {
65-
"pointers_0": "uint64",
66-
"indices_1": "uint64",
67-
"values": "float32"
64+
"binsparse": {
65+
"format": "CSC",
66+
"shape": [10, 12],
67+
"data_types": {
68+
"pointers_0": "uint64",
69+
"indices_1": "uint64",
70+
"values": "float32"
71+
}
6872
},
69-
"attrs": {
70-
"original_source": "https://url/of/original/file.mtx",
71-
"author": "John Doe"
72-
}
73+
"original_source": "https://url/of/original/file.mtx",
74+
"author": "John Doe"
7375
}
7476
```
7577

@@ -82,7 +84,7 @@ The `shape` key must be present and shall define the shape of the sparse tensor.
8284
It shall contain a JSON array of integers, with index `i` containing the size of
8385
the `i`'th dimension. For matrices, index `0` shall contain the number of rows,
8486
and index `1` shall contain the number of columns. For vectors, index `0` shall
85-
contain the number of indices of the vector if it were dense.
87+
contain the vector's dimension.
8688

8789
Note: a matrix has shape [`number_of_rows`, `number_of_columns`] regardless of whether
8890
the format orientation is row-wise or column-wise.
@@ -99,7 +101,9 @@ in the binary storage container.
99101
### Pre-defined Formats ### {#predefined_formats}
100102

101103
The following is a list of all pre-defined formats and the arrays that shall
102-
be present in the binary container.
104+
be present in the binary container. `number_of_elements` refers to the number
105+
of stored values, `number_of_rows` refers to the number of rows, and `number_of_columns`
106+
refers to the number of columns.
103107

104108
#### VEC #### {#vec_format}
105109

@@ -110,7 +114,8 @@ Vector format
110114
: values
111115
:: Array of size `number_of_elements` containing stored values.
112116

113-
Indices shall be sorted and must not be duplicated.
117+
The element of the vector located at index `indices_0[i]` has scalar value
118+
`values[i]`. Elements shall be sorted by index and must not be duplicated.
114119

115120
#### CSR #### {#csr_format}
116121

@@ -127,7 +132,7 @@ The column indices of the stored values located in row `i` are located in the ra
127132
`[pointers_0[i], pointers_0[i+1])` in the `indices_1` array. The scalar values for
128133
each of those stored values is stored in the corresponding index in the `values` array.
129134

130-
Within a row, column indices shall be sorted and must not be duplicated.
135+
Within a row, elements shall be sorted by column index and must not be duplicated.
131136

132137
#### CSC #### {#csc_format}
133138

@@ -144,7 +149,7 @@ The rows indices of the stored values located in column `j` are located in the r
144149
`[pointers_0[j], pointers_0[j+1])` in the `indices_1` array. The scalar values for
145150
each of those stored values is stored in the corresponding index in the `values` array.
146151

147-
Within a column, row indices shall be sorted and must not be duplicated.
152+
Within a column, elements shall be sorted by row index and must not be duplicated.
148153

149154
#### DCSR #### {#dcsr_format}
150155

@@ -164,8 +169,8 @@ DCSR is similar to CSR, except that rows which are entirely empty are not stored
164169
contains no repeated values. Because the position within `pointers_0` no longer dictates the
165170
corresponding row index, `indices_0` provides the row index.
166171

167-
Within a row, column indices shall be sorted and must not be duplicated. Row indices shall be
168-
sorted and must not be duplicated.
172+
Rows shall be sorted and must not be duplicated.
173+
Within each row, elements shall be sorted by column index and must not be duplicated.
169174

170175
#### DCSC #### {#dcsc_format}
171176

@@ -185,8 +190,8 @@ DCSC is similar to CSC, except that columns which are entirely empty are not sto
185190
contains no repeated values. Because the position within `pointers_0` no longer dictates the
186191
corresponding column index, `indices_0` provides the column index.
187192

188-
Within a column, row indices shall be sorted and must not be duplicated. Column indices shall be
189-
sorted and must not be duplicated.
193+
Columns shall be sorted and not duplicated.
194+
Within each column, elements shall be sorted by row index and must not be duplicated.
190195

191196
#### COOR #### {#coor_format}
192197

@@ -464,12 +469,12 @@ Data Types {#key_data_types}
464469
----------------------------
465470

466471
The `data_types` key must be present and shall define the data types of all required
467-
arrays based on the [[#key_format]]. The data type declares the type of the
468-
in-memory arrays. While these are often identical to the types used when storing
469-
the arrays on disk in the container, the container may choose to store the arrays
470-
in another format. For example, `uint64` type may be stored as `int8` if all the
471-
numbers in the array are small enough to fit, but `data_types` would still list the
472-
array as having type `uint64`.
472+
arrays based on the [[#key_format]]. The data type declares the type of both the
473+
on-disk array as well as the in-memory array. When these are identical, a simple string
474+
defines the type. When the on-disk and in-memory types differ due to limitations in the
475+
storage container (ex. HDF5 lacks a BOOL type), the type is shown as "on_disk->in_memory".
476+
For the example of storing BOOL type as INT8, this would be "int8->bool" to indicate that
477+
after reading the values array into memory, it should be interpreted as boolean data.
473478

474479
For a given [[#key_format]], all named binary arrays for that format shall have a
475480
corresponding name in `data_types`.
@@ -496,16 +501,16 @@ Example of a CSR Matrix whose values are all 1.
496501
<td>.</td>
497502
<td>.</td>
498503
<td>.</td>
499-
<td>1</td>
504+
<td>7</td>
500505
<td>.</td>
501506
</tr>
502507
<tr>
503508
<th>1</th>
504509
<td>.</td>
505-
<td>1</td>
510+
<td>7</td>
506511
<td>.</td>
507512
<td>.</td>
508-
<td>1</td>
513+
<td>7</td>
509514
</tr>
510515
<tr>
511516
<th>2</th>
@@ -518,8 +523,8 @@ Example of a CSR Matrix whose values are all 1.
518523
<tr>
519524
<th>3</th>
520525
<td>.</td>
521-
<td>1</td>
522-
<td>1</td>
526+
<td>7</td>
527+
<td>7</td>
523528
<td>.</td>
524529
<td>.</td>
525530
</tr>
@@ -528,7 +533,7 @@ Example of a CSR Matrix whose values are all 1.
528533
<td>.</td>
529534
<td>.</td>
530535
<td>.</td>
531-
<td>1</td>
536+
<td>7</td>
532537
<td>.</td>
533538
</tr>
534539
</tbody>
@@ -558,7 +563,7 @@ Example of a CSR Matrix whose values are all 1.
558563

559564
- `pointers_0` = [0, 1, 3, 3, 5, 6]
560565
- `indices_1` = [3, 1, 4, 1, 2, 3]
561-
- `values` = [1]
566+
- `values` = [7]
562567

563568
</div>
564569

0 commit comments

Comments
 (0)