Skip to content

Commit 63d2060

Browse files
authored
twobit codec SQL functions (#8)
nucleotides_twobit() : TEXT -> BLOB and twobit_{dna,rna}() : BLOB -> TEXT for compactly storing/caching nucleotide sequences
1 parent 8fb547f commit 63d2060

4 files changed

Lines changed: 491 additions & 10 deletions

File tree

docs/guide.md

Lines changed: 56 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -903,6 +903,62 @@ But this plan strongly depends on the contiguity assumption.
903903
/* genomicsqlite_open("compressed.db", ...); */
904904
```
905905

906+
#### Two-bit encoding for nucleotide sequences
907+
908+
The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.)
909+
910+
Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.
911+
912+
**↪ Two-bit encoding**
913+
914+
=== "SQL"
915+
``` sql
916+
SELECT nucleotides_twobit('TCAG')
917+
```
918+
919+
Given any TEXT value matching `[AaCcGgTtUu]+`, compute a two-bit-encoded BLOB value that can later be decoded using `twobit_dna()` or `twobit_rna()`. The two-bit encoding is case-insensitive and considers `T` and `U` equivalent.
920+
921+
Given any other ASCII TEXT value, including the empty string, pass it through unchanged. Given a BLOB, first attempt to coerce it to ASCII TEXT. Given NULL, return NULL. Any other input is an error.
922+
923+
**↪ Two-bit decoding**
924+
925+
=== "SQL"
926+
``` sql
927+
SELECT twobit_dna(nucleotides_twobit('TCAG'))
928+
SELECT twobit_rna(nucleotides_twobit('UCAG'))
929+
SELECT twobit_dna(nucleotides_twobit('TCAG'),Y,Z)
930+
SELECT twobit_rna(nucleotides_twobit('UCAG'),Y,Z)
931+
```
932+
933+
Given a BLOB value, perform two-bit decoding to produce a nucleotide sequence as uppercased TEXT, with `T`'s for `twobit_dna()` and `U`'s for `twobit_rna()`. Take care to only use BLOBs originally produced by the two-bit encoder, as any BLOB *will* decode to some nucleotide sequence.
934+
935+
Given a TEXT value, pass it through unchanged. Given NULL, return NULL. Any other first input is an error.
936+
937+
The optional `Y` and `Z` arguments can be used to compute [`substr(twobit_dna(X),Y,Z)`](https://sqlite.org/lang_corefunc.html#substr) more efficiently, without decoding the whole sequence. Unfortunately however, [SQLite internals](https://sqlite.org/forum/forumpost/756c1a1e48?t=h) make this operation still liable to use time & memory proportional to the full length of X, not Z. If frequent random access into long sequences is needed, then consider splitting them across multiple rows.
938+
939+
Notice that the encoder passes through TEXT values if they contain any non-nucleotide character, and the decoder always passes through TEXT values. Therefore, if
940+
a BLOB column `C` is filled with `nucleotides_twobit(...)`, and you `SELECT twobit_dna(C) FROM ...`, the original TEXT value is stored & returned automatically for any cell containing a non-nucleotide character, while the two-bit-encoded BLOBs are used exactly where possible. However, the original TEXT values would have their case and T/U letters preserved, unlike decoded BLOBs.
941+
942+
**↪ Two-bit sequence length**
943+
944+
=== "SQL"
945+
``` sql
946+
SELECT twobit_length(dna_twobit('TCAG'))
947+
```
948+
949+
Given a two-bit-encoded BLOB value, return the length of the *decoded* sequence (without actually decoding it). This is *not* equal to `4*length(BLOB)` due to padding.
950+
951+
Given a TEXT value, return its byte length. Given NULL, return NULL. Any other input is an error.
952+
953+
#### JSON functions
954+
955+
The Genomics Extension bundles the SQLite developers' [JSON1 extension](https://www.sqlite.org/json1.html) and enables it automatically. The following conventions are recommended,
956+
957+
1. JSON object columns should be named *_json with type `TEXT DEFAULT '{}'`.
958+
2. JSON array columns should be named *_jsarray with type `TEXT DEFAULT '[]'`.
959+
960+
The JSON1 functions can be used with [generated columns](https://sqlite.org/gencol.html) to effectively enable indices on JSON-embedded fields.
961+
906962
#### Genomics Extension version
907963

908964
**↪ GenomicSQLite Version**
@@ -933,15 +989,6 @@ But this plan strongly depends on the contiguity assumption.
933989
/* result to be sqlite3_free() */
934990
```
935991

936-
#### JSON functions
937-
938-
The Genomics Extension bundles the SQLite developers' [JSON1 extension](https://www.sqlite.org/json1.html) and enables it automatically. The following conventions are recommended,
939-
940-
1. JSON object columns should be named *_json with type `TEXT DEFAULT '{}'`.
941-
2. JSON array columns should be named *_jsarray with type `TEXT DEFAULT '[]'`.
942-
943-
The JSON1 functions can be used with [generated columns](https://sqlite.org/gencol.html) to effectively enable indices on JSON-embedded fields.
944-
945992
## genomicsqlite interactive shell
946993

947994
The Python package includes a `genomicsqlite` script that starts the [`sqlite3` interactive shell](https://sqlite.org/cli.html) with the Genomics Extension enabled. Simply invoke,

include/genomicsqlite.h

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
#include <stddef.h>
12
#ifndef SQLITE3EXT_H
23
#include <sqlite3.h>
34
#endif
@@ -86,6 +87,34 @@ char *put_genomic_reference_sequence_sql(const char *name, sqlite3_int64 length,
8687
const char *meta_json, sqlite3_int64 rid,
8788
const char *attached_schema);
8889

90+
/*
91+
* Low-level routines for two-bit nucleotide encoding (normally used via SQL functions, but
92+
* available to C FFI callers here)
93+
*/
94+
95+
/*
96+
* Two-bit encode the nucleotide character sequence of specified length. The output buffer must be
97+
* preallocated (len+7)/4 bytes. Return codes:
98+
* 0 = success; wrote (len+7)/4 bytes
99+
* -1 = encountered non-nucleotide ASCII character
100+
* -2 = encountered non-ASCII character (e.g. UTF-8) or NUL
101+
*/
102+
int nucleotides_twobit(const char *seq, size_t len, void *out);
103+
104+
/*
105+
* Given two-bit-encoded blob pointer & size, compute original nucleotide sequence length
106+
*/
107+
size_t twobit_length(const void *data, size_t sz);
108+
109+
/*
110+
* Given blob pointer, two-bit-decode the nucleotide subsequence [ofs, ofs+len). To get the whole
111+
* sequence, set ofs=0 & len=twobit_length(data, datasize). Caller must ensure that:
112+
* 1. ofs+len <= twobit_length(data, datasize)
113+
* 2. out is preallocated len+1 bytes (a null terminator will be affixed)
114+
*/
115+
void twobit_dna(const void *data, size_t ofs, size_t len, char *out);
116+
void twobit_rna(const void *data, size_t ofs, size_t len, char *out);
117+
89118
/*
90119
* C++ bindings: are liable to throw exceptions except where marked noexcept
91120
*/

0 commit comments

Comments
 (0)