Skip to content

Commit fc91f67

Browse files
committed
tighten twobit spec & add tests
1 parent 993e597 commit fc91f67

3 files changed

Lines changed: 38 additions & 19 deletions

File tree

docs/guide.md

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1026,9 +1026,9 @@ But this plan strongly depends on the contiguity assumption.
10261026

10271027
#### Two-bit encoding for nucleotide sequences
10281028

1029-
The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.)
1029+
The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.) Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.
10301030

1031-
Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.
1031+
The encoding is case-insensitive and considers `T` and `U` equivalent.
10321032

10331033
**↪ Two-bit encoding**
10341034

@@ -1037,9 +1037,9 @@ Storing a large database of sequences using such BLOBs instead of TEXT can impro
10371037
SELECT nucleotides_twobit('TCAG')
10381038
```
10391039

1040-
Given a TEXT value consisting only of characters from `ACGTUacgtu`, compute a two-bit-encoded BLOB value that can later be decoded using `twobit_dna()` or `twobit_rna()`. Given any other ASCII TEXT value, pass it through unchanged. The encoding is case-insensitive and considers `T` and `U` equivalent.
1040+
Given a TEXT value consisting of characters from the set `ACGTUacgtu`, compute a two-bit-encoded BLOB value that can later be decoded using `twobit_dna()` or `twobit_rna()`. Given any other ASCII TEXT value (including empty), pass it through unchanged as TEXT. Given NULL, return NULL. Any other input is an error.
10411041

1042-
Given a BLOB, first attempt to coerce it to ASCII TEXT. Given NULL, return NULL. Any other input is an error.
1042+
Typically used to populate a BLOB column `C` with `INSERT INTO some_table(...,C) VALUES(...,nucleotides_twobit(?))`. This works even if some of the sequences contain `N`s or other characters, in which case those sequences are stored as the original TEXT values. Make sure the column has schema type `BLOB` to avoid spurious coercions.
10431043

10441044
**↪ Two-bit decoding**
10451045

@@ -1051,14 +1051,11 @@ Given a BLOB, first attempt to coerce it to ASCII TEXT. Given NULL, return NULL.
10511051
SELECT twobit_rna(nucleotides_twobit('UCAG'),Y,Z)
10521052
```
10531053

1054-
Given a two-bit-encoded BLOB value, decode the nucleotide sequence as uppercased TEXT, with `T`'s for `twobit_dna()` and `U`'s for `twobit_rna()`. Take care to only use BLOBs originally produced by `nucleotides_twobit()`, as other BLOBs may decode to garbage nucleotide sequences.
1055-
1056-
Given a TEXT value, pass it through unchanged. Given NULL, return NULL. Any other first input is an error.
1054+
Given a two-bit-encoded BLOB value, decode the nucleotide sequence as uppercased TEXT, with `T`'s for `twobit_dna()` and `U`'s for `twobit_rna()`. Given a TEXT value, pass it through unchanged. Given NULL, return NULL. Any other first input is an error.
10571055

10581056
The optional `Y` and `Z` arguments can be used to compute [`substr(twobit_dna(X),Y,Z)`](https://sqlite.org/lang_corefunc.html#substr) more efficiently, without decoding the whole sequence. Unfortunately however, [SQLite internals](https://sqlite.org/forum/forumpost/756c1a1e48?t=h) make this operation still liable to use time & memory proportional to the full length of X, not Z. If frequent random access into long sequences is needed, then consider splitting them across multiple rows.
10591057

1060-
Notice that the encoder passes through TEXT values if they contain any non-nucleotide character, and the decoder always passes through TEXT values. Therefore, if
1061-
a BLOB column `C` is filled with `nucleotides_twobit(...)`, and you `SELECT twobit_dna(C) FROM ...`, the original TEXT value is stored & returned automatically for any cell containing a non-nucleotide character, while the two-bit-encoded BLOBs are used exactly where possible. However, the original TEXT values would have their case and T/U letters preserved, unlike decoded BLOBs.
1058+
Take care to only use BLOBs originally produced by `nucleotides_twobit()`, as other BLOBs may decode to garbage nucleotide sequences. If you `SELECT twobit_dna(C) FROM some_table` on a column with mixed BLOB and TEXT values as suggested above, note that the results actually stored as TEXT preserve their case and T/U letters, unlike decoded BLOBs.
10621059

10631060
**↪ Two-bit sequence length**
10641061

src/genomicsqlite.cc

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1176,14 +1176,10 @@ extern "C" int nucleotides_twobit(const char *seq, size_t len, void *out) {
11761176
static void sqlfn_nucleotides_twobit(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
11771177
assert(argc == 1);
11781178
auto arg0ty = sqlite3_value_type(argv[0]);
1179-
switch (arg0ty) {
1180-
case SQLITE_TEXT:
1181-
case SQLITE_BLOB:
1182-
break;
1183-
case SQLITE_NULL:
1179+
if (arg0ty == SQLITE_NULL) {
11841180
return sqlite3_result_null(ctx);
1185-
default:
1186-
return sqlite3_result_error(ctx, "nucleotides_twobit() expected BLOB or TEXT", -1);
1181+
} else if (arg0ty != SQLITE_TEXT) {
1182+
return sqlite3_result_error(ctx, "nucleotides_twobit() expected TEXT", -1);
11871183
}
11881184

11891185
auto seqlen = sqlite3_value_bytes(argv[0]);
@@ -1192,8 +1188,7 @@ static void sqlfn_nucleotides_twobit(sqlite3_context *ctx, int argc, sqlite3_val
11921188
return sqlite3_result_value(ctx, argv[0]);
11931189
}
11941190

1195-
auto seq = (const char *)(arg0ty == SQLITE_TEXT ? sqlite3_value_text(argv[0])
1196-
: sqlite3_value_blob(argv[0]));
1191+
auto seq = (const char *)sqlite3_value_text(argv[0]);
11971192
if (!seq) {
11981193
return sqlite3_result_error_nomem(ctx);
11991194
}

test/test_twobit.py

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
import genomicsqlite
44

55

6-
def test_twobit():
6+
def test_twobit_random():
77
con = genomicsqlite.connect(":memory:")
88

99
random.seed(42)
@@ -51,6 +51,10 @@ def test_twobit():
5151
)[0]
5252
assert decoded == control
5353

54+
55+
def test_twobit_corner_cases():
56+
con = genomicsqlite.connect(":memory:")
57+
5458
for nuc in "AGCTagct":
5559
assert next(con.execute("SELECT length(nucleotides_twobit(?))", (nuc,)))[0] == 1
5660
assert (
@@ -71,3 +75,26 @@ def test_twobit():
7175
)[0]
7276
control = next(con.execute("SELECT substr('GAUUACA',?,?)", (xtest, ytest)))[0]
7377
assert decoded == control, str((xtest, ytest))
78+
79+
80+
def test_twobit_column():
81+
# test populating a column with mixed BLOB and TEXT values
82+
con = genomicsqlite.connect(":memory:")
83+
84+
con.executescript("CREATE TABLE test(test_twobit BLOB)")
85+
for elt in list("Tu") + ["foo", "bar", "gATuaCa"]:
86+
con.execute("INSERT INTO test(test_twobit) VALUES(nucleotides_twobit(?))", (elt,))
87+
88+
column = list(con.execute("SELECT test_twobit FROM test"))
89+
assert isinstance(column[0][0], bytes), str([type(x[0]) for x in column])
90+
assert isinstance(column[-1][0], bytes)
91+
assert isinstance(column[-2][0], str)
92+
assert column[-2][0] == "bar"
93+
94+
assert list(con.execute("SELECT twobit_dna(test_twobit) FROM test")) == [
95+
("T",),
96+
("T",),
97+
("foo",),
98+
("bar",),
99+
("GATTACA",),
100+
]

0 commit comments

Comments
 (0)