tighten twobit spec & add tests

mlin · mlin · commit fc91f670a233 · 2020-11-26T14:09:43.000-10:00
diff --git a/docs/guide.md b/docs/guide.md
@@ -1026,9 +1026,9 @@ But this plan strongly depends on the contiguity assumption.
 
 #### Two-bit encoding for nucleotide sequences
 
-The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.)
+The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.) Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.
 
-Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.
+The encoding is case-insensitive and considers `T` and `U` equivalent.
 
 **↪ Two-bit encoding**
 
@@ -1037,9 +1037,9 @@ Storing a large database of sequences using such BLOBs instead of TEXT can impro
     SELECT nucleotides_twobit('TCAG')
     ```
 
-Given a TEXT value consisting only of characters from `ACGTUacgtu`, compute a two-bit-encoded BLOB value that can later be decoded using `twobit_dna()` or `twobit_rna()`. Given any other ASCII TEXT value, pass it through unchanged. The encoding is case-insensitive and considers `T` and `U` equivalent.
+Given a TEXT value consisting of characters from the set `ACGTUacgtu`, compute a two-bit-encoded BLOB value that can later be decoded using `twobit_dna()` or `twobit_rna()`. Given any other ASCII TEXT value (including empty), pass it through unchanged as TEXT. Given NULL, return NULL. Any other input is an error.
 
-Given a BLOB, first attempt to coerce it to ASCII TEXT. Given NULL, return NULL. Any other input is an error.
+Typically used to populate a BLOB column `C` with `INSERT INTO some_table(...,C) VALUES(...,nucleotides_twobit(?))`. This works even if some of the sequences contain `N`s or other characters, in which case those sequences are stored as the original TEXT values. Make sure the column has schema type `BLOB` to avoid spurious coercions.
 
 **↪ Two-bit decoding**
 
@@ -1051,14 +1051,11 @@ Given a BLOB, first attempt to coerce it to ASCII TEXT. Given NULL, return NULL.
     SELECT twobit_rna(nucleotides_twobit('UCAG'),Y,Z)
     ```
 
-Given a two-bit-encoded BLOB value, decode the nucleotide sequence as uppercased TEXT, with `T`'s for `twobit_dna()` and `U`'s for `twobit_rna()`. Take care to only use BLOBs originally produced by `nucleotides_twobit()`, as other BLOBs may decode to garbage nucleotide sequences.
-
-Given a TEXT value, pass it through unchanged. Given NULL, return NULL. Any other first input is an error.
+Given a two-bit-encoded BLOB value, decode the nucleotide sequence as uppercased TEXT, with `T`'s for `twobit_dna()` and `U`'s for `twobit_rna()`. Given a TEXT value, pass it through unchanged. Given NULL, return NULL. Any other first input is an error.
 
 The optional `Y` and `Z` arguments can be used to compute [`substr(twobit_dna(X),Y,Z)`](https://sqlite.org/lang_corefunc.html#substr) more efficiently, without decoding the whole sequence. Unfortunately however, [SQLite internals](https://sqlite.org/forum/forumpost/756c1a1e48?t=h) make this operation still liable to use time & memory proportional to the full length of X, not Z. If frequent random access into long sequences is needed, then consider splitting them across multiple rows.
 
-Notice that the encoder passes through TEXT values if they contain any non-nucleotide character, and the decoder always passes through TEXT values. Therefore, if 
-a BLOB column `C` is filled with `nucleotides_twobit(...)`, and you `SELECT twobit_dna(C) FROM ...`, the original TEXT value is stored & returned automatically for any cell containing a non-nucleotide character, while the two-bit-encoded BLOBs are used exactly where possible. However, the original TEXT values would have their case and T/U letters preserved, unlike decoded BLOBs.
+Take care to only use BLOBs originally produced by `nucleotides_twobit()`, as other BLOBs may decode to garbage nucleotide sequences. If you `SELECT twobit_dna(C) FROM some_table` on a column with mixed BLOB and TEXT values as suggested above, note that the results actually stored as TEXT preserve their case and T/U letters, unlike decoded BLOBs.
 
 **↪ Two-bit sequence length**
 
diff --git a/src/genomicsqlite.cc b/src/genomicsqlite.cc
@@ -1176,14 +1176,10 @@ extern "C" int nucleotides_twobit(const char *seq, size_t len, void *out) {
 static void sqlfn_nucleotides_twobit(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
     assert(argc == 1);
     auto arg0ty = sqlite3_value_type(argv[0]);
-    switch (arg0ty) {
-    case SQLITE_TEXT:
-    case SQLITE_BLOB:
-        break;
-    case SQLITE_NULL:
+    if (arg0ty == SQLITE_NULL) {
         return sqlite3_result_null(ctx);
-    default:
-        return sqlite3_result_error(ctx, "nucleotides_twobit() expected BLOB or TEXT", -1);
+    } else if (arg0ty != SQLITE_TEXT) {
+        return sqlite3_result_error(ctx, "nucleotides_twobit() expected TEXT", -1);
     }
 
     auto seqlen = sqlite3_value_bytes(argv[0]);
@@ -1192,8 +1188,7 @@ static void sqlfn_nucleotides_twobit(sqlite3_context *ctx, int argc, sqlite3_val
         return sqlite3_result_value(ctx, argv[0]);
     }
 
-    auto seq = (const char *)(arg0ty == SQLITE_TEXT ? sqlite3_value_text(argv[0])
-                                                    : sqlite3_value_blob(argv[0]));
+    auto seq = (const char *)sqlite3_value_text(argv[0]);
     if (!seq) {
         return sqlite3_result_error_nomem(ctx);
     }
diff --git a/test/test_twobit.py b/test/test_twobit.py
@@ -3,7 +3,7 @@
 import genomicsqlite
 
 
-def test_twobit():
+def test_twobit_random():
     con = genomicsqlite.connect(":memory:")
 
     random.seed(42)
@@ -51,6 +51,10 @@ def test_twobit():
         )[0]
         assert decoded == control
 
+
+def test_twobit_corner_cases():
+    con = genomicsqlite.connect(":memory:")
+
     for nuc in "AGCTagct":
         assert next(con.execute("SELECT length(nucleotides_twobit(?))", (nuc,)))[0] == 1
         assert (
@@ -71,3 +75,26 @@ def test_twobit():
             )[0]
             control = next(con.execute("SELECT substr('GAUUACA',?,?)", (xtest, ytest)))[0]
             assert decoded == control, str((xtest, ytest))
+
+
+def test_twobit_column():
+    # test populating a column with mixed BLOB and TEXT values
+    con = genomicsqlite.connect(":memory:")
+
+    con.executescript("CREATE TABLE test(test_twobit BLOB)")
+    for elt in list("Tu") + ["foo", "bar", "gATuaCa"]:
+        con.execute("INSERT INTO test(test_twobit) VALUES(nucleotides_twobit(?))", (elt,))
+
+    column = list(con.execute("SELECT test_twobit FROM test"))
+    assert isinstance(column[0][0], bytes), str([type(x[0]) for x in column])
+    assert isinstance(column[-1][0], bytes)
+    assert isinstance(column[-2][0], str)
+    assert column[-2][0] == "bar"
+
+    assert list(con.execute("SELECT twobit_dna(test_twobit) FROM test")) == [
+        ("T",),
+        ("T",),
+        ("foo",),
+        ("bar",),
+        ("GATTACA",),
+    ]