Skip to content

Commit f8b8e71

Browse files
committed
split parse_genomic_range
1 parent 1fa5d98 commit f8b8e71

3 files changed

Lines changed: 62 additions & 35 deletions

File tree

docs/guide.md

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1028,31 +1028,43 @@ But this plan strongly depends on the contiguity assumption.
10281028

10291029
#### Parse genomic range string
10301030

1031-
The SQL function `parse_genomic_range(txt, part)` processes a string such as "chr1:2,345-6,789" into any of its three parts (chromosome name, begin position, and end position).
1031+
These SQL functions process a string like "chr1:2,345-6,789" into its three parts (sequence/chromosome name, begin position, and end position).
10321032

10331033
=== "SQL"
10341034
``` sql
1035-
SELECT parse_genomic_range('chr1:2,345-6,789', 1) -- 'chr1'
1036-
SELECT parse_genomic_range('chr1:2,345-6,789', 2) -- 2344
1037-
SELECT parse_genomic_range('chr1:2,345-6,789', 3) -- 6789
1035+
SELECT parse_genomic_range_sequence('chr1:2,345-6,789', 1) -- 'chr1'
1036+
SELECT parse_genomic_range_begin('chr1:2,345-6,789', 2) -- 2344 (!)
1037+
SELECT parse_genomic_range_end('chr1:2,345-6,789', 3) -- 6789
10381038
```
10391039

1040-
Important: [the begin position returned is one less than the text number](https://genome.ucsc.edu/FAQ/FAQtracks#tracks1), while the end position is equal to the text number.
1040+
<small>
1041+
[The begin position returned is one less than the text number](https://genome.ucsc.edu/FAQ/FAQtracks#tracks1), while the end position is equal to the text number.
1042+
</small>
10411043

10421044
#### Two-bit encoding for nucleotide sequences
10431045

10441046
The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.) Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.
10451047

10461048
The encoding is case-insensitive and considers `T` and `U` equivalent.
10471049

1050+
*Encoding:*
1051+
10481052
=== "SQL"
10491053
``` sql
10501054
SELECT nucleotides_twobit('TCAG')
10511055
```
10521056

10531057
Given a TEXT value consisting of characters from the set `ACGTUacgtu`, compute a two-bit-encoded BLOB value that can later be decoded using `twobit_dna()` or `twobit_rna()`. Given any other ASCII TEXT value (including empty), pass it through unchanged as TEXT. Given NULL, return NULL. Any other input is an error.
10541058

1055-
Typically used to populate a BLOB column `C` with `INSERT INTO some_table(...,C) VALUES(...,nucleotides_twobit(?))`. This works even if some of the sequences contain `N`s or other characters, in which case those sequences are stored as the original TEXT values. Make sure the column has schema type `BLOB` to avoid spurious coercions, and by convention, the column should be named *_twobit.
1059+
Typically used to populate a BLOB column `C` with e.g.
1060+
1061+
```sql
1062+
INSERT INTO some_table(...,C) VALUES(...,nucleotides_twobit(?))
1063+
```
1064+
1065+
This works even if some of the sequences contain `N`s or other characters, in which case those sequences are stored as the original TEXT values. Make sure the column has schema type `BLOB` to avoid spurious coercions, and by convention, the column should be named *_twobit.
1066+
1067+
*Decoding:*
10561068

10571069
=== "SQL"
10581070
``` sql
@@ -1064,11 +1076,11 @@ Typically used to populate a BLOB column `C` with `INSERT INTO some_table(...,C)
10641076

10651077
Given a two-bit-encoded BLOB value, decode the nucleotide sequence as uppercased TEXT, with `T`'s for `twobit_dna()` and `U`'s for `twobit_rna()`. Given a TEXT value, pass it through unchanged. Given NULL, return NULL. Any other first input is an error.
10661078

1067-
The optional `Y` and `Z` arguments can be used to compute [`substr(twobit_dna(X),Y,Z)`](https://sqlite.org/lang_corefunc.html#substr) more efficiently, without decoding the whole sequence. Unfortunately however, [SQLite internals](https://sqlite.org/forum/forumpost/756c1a1e48?t=h) make this operation still liable to use time & memory proportional to the full length of X, not Z. If frequent random access into long sequences is needed, then consider splitting them across multiple rows.
1079+
The optional `Y` and `Z` arguments can be used to compute [`substr(twobit_dna(X),Y,Z)`](https://sqlite.org/lang_corefunc.html#substr) more efficiently, without decoding the whole sequence. <small>Unfortunately however, [SQLite internals](https://sqlite.org/forum/forumpost/756c1a1e48?t=h) make this operation still liable to use time & memory proportional to the full length of X, not Z. If frequent random access into long sequences is needed, then consider splitting them across multiple rows.</small>
10681080

10691081
Take care to only use BLOBs originally produced by `nucleotides_twobit()`, as other BLOBs may decode to spurious nucleotide sequences. If you `SELECT twobit_dna(C) FROM some_table` on a column with mixed BLOB and TEXT values as suggested above, note that the results actually stored as TEXT preserve their case and T/U letters, unlike decoded BLOBs.
10701082

1071-
**↪ Two-bit sequence length**
1083+
*Length:*
10721084

10731085
=== "SQL"
10741086
``` sql

src/genomicsqlite.cc

Lines changed: 31 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1433,7 +1433,7 @@ static void sqlfn_twobit_rna(sqlite3_context *ctx, int argc, sqlite3_value **arg
14331433
}
14341434

14351435
/**************************************************************************************************
1436-
* parse_genomic_range()
1436+
* parse_genomic_range_*()
14371437
**************************************************************************************************/
14381438

14391439
static uint64_t parse_genomic_range_pos(const string &txt, size_t ofs1, size_t ofs2) {
@@ -1444,15 +1444,15 @@ static uint64_t parse_genomic_range_pos(const string &txt, size_t ofs1, size_t o
14441444
auto c = txt[i];
14451445
if (c >= '0' && c <= '9') {
14461446
if (ans > 922337203685477579ULL) { // (2**63-10)//10
1447-
throw std::runtime_error("parse_genomic_range() position overflow in `" + txt +
1447+
throw std::runtime_error("parse_genomic_range(): position overflow in `" + txt +
14481448
"`");
14491449
}
14501450
ans *= 10;
14511451
ans += c - '0';
14521452
} else if (c == ',') {
14531453
continue;
14541454
} else {
1455-
throw std::runtime_error("parse_genomic_range() can't read `" + txt + "`");
1455+
throw std::runtime_error("parse_genomic_range(): can't read `" + txt + "`");
14561456
}
14571457
}
14581458
return ans;
@@ -1480,26 +1480,36 @@ static std::tuple<string, uint64_t, uint64_t> parse_genomic_range(const string &
14801480
return std::make_tuple(chrom, begin_pos - 1, end_pos);
14811481
}
14821482

1483-
static void sqlfn_parse_genomic_range(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
1483+
static void sqlfn_parse_genomic_range_sequence(sqlite3_context *ctx, int argc,
1484+
sqlite3_value **argv) {
14841485
string txt;
1485-
sqlite3_int64 which_part;
14861486
ARG_TEXT(txt, 0);
1487-
ARG(which_part, 1, SQLITE_INTEGER, int64);
1488-
14891487
try {
14901488
auto t = parse_genomic_range(txt);
14911489
auto &chrom = get<0>(t);
1492-
switch (which_part) {
1493-
case 1:
1494-
return sqlite3_result_text(ctx, chrom.c_str(), chrom.size(), SQLITE_TRANSIENT);
1495-
case 2:
1496-
return sqlite3_result_int64(ctx, get<1>(t));
1497-
case 3:
1498-
return sqlite3_result_int64(ctx, get<2>(t));
1499-
default:
1500-
throw std::runtime_error(
1501-
"parse_genomic_range(): expected part 1, 2, or 3 (parameter 2)");
1502-
}
1490+
return sqlite3_result_text(ctx, chrom.c_str(), chrom.size(), SQLITE_TRANSIENT);
1491+
} catch (std::exception &exn) {
1492+
sqlite3_result_error(ctx, exn.what(), -1);
1493+
}
1494+
}
1495+
1496+
static void sqlfn_parse_genomic_range_begin(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
1497+
string txt;
1498+
ARG_TEXT(txt, 0);
1499+
try {
1500+
auto t = parse_genomic_range(txt);
1501+
return sqlite3_result_int64(ctx, get<1>(t));
1502+
} catch (std::exception &exn) {
1503+
sqlite3_result_error(ctx, exn.what(), -1);
1504+
}
1505+
}
1506+
1507+
static void sqlfn_parse_genomic_range_end(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
1508+
string txt;
1509+
ARG_TEXT(txt, 0);
1510+
try {
1511+
auto t = parse_genomic_range(txt);
1512+
return sqlite3_result_int64(ctx, get<2>(t));
15031513
} catch (std::exception &exn) {
15041514
sqlite3_result_error(ctx, exn.what(), -1);
15051515
}
@@ -1553,7 +1563,9 @@ static int register_genomicsqlite_functions(sqlite3 *db, const char **pzErrMsg,
15531563
{FPNM(twobit_rna), 1, SQLITE_DETERMINISTIC},
15541564
{FPNM(twobit_rna), 2, SQLITE_DETERMINISTIC},
15551565
{FPNM(twobit_rna), 3, SQLITE_DETERMINISTIC},
1556-
{FPNM(parse_genomic_range), 2, SQLITE_DETERMINISTIC}};
1566+
{FPNM(parse_genomic_range_sequence), 1, SQLITE_DETERMINISTIC},
1567+
{FPNM(parse_genomic_range_begin), 1, SQLITE_DETERMINISTIC},
1568+
{FPNM(parse_genomic_range_end), 1, SQLITE_DETERMINISTIC}};
15571569

15581570
int rc;
15591571
for (int i = 0; i < sizeof(fntab) / sizeof(fntab[0]); ++i) {

test/test_parse_genomic_range.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,13 @@
55

66
def test_parse_genomic_range():
77
con = genomicsqlite.connect(":memory:")
8-
query = "SELECT parse_genomic_range(?,?)"
98
for (txt, chrom, begin_pos, end_pos) in [
109
("chr1:2,345-06,789", "chr1", 2344, 6789),
1110
("π:1-9,223,372,036,854,775,799", "π", 0, 9223372036854775799),
1211
]:
13-
assert next(con.execute(query, (txt, 1)))[0] == chrom
14-
assert next(con.execute(query, (txt, 2)))[0] == begin_pos
15-
assert next(con.execute(query, (txt, 3)))[0] == end_pos
12+
assert next(con.execute("SELECT parse_genomic_range_sequence(?)", (txt,)))[0] == chrom
13+
assert next(con.execute("SELECT parse_genomic_range_begin(?)", (txt,)))[0] == begin_pos
14+
assert next(con.execute("SELECT parse_genomic_range_end(?)", (txt,)))[0] == end_pos
1615

1716
for txt in [
1817
"",
@@ -30,8 +29,12 @@ def test_parse_genomic_range():
3029
"chr1:2345-deadbeef",
3130
"chr1:1-9,223,372,036,854,775,800",
3231
]:
32+
with pytest.raises(sqlite3.OperationalError) as exc:
33+
con.execute("SELECT parse_genomic_range_sequence(?)", (txt,))
34+
assert "parse_genomic_range():" in str(exc.value)
3335
with pytest.raises(sqlite3.OperationalError):
34-
con.execute(query, (txt, 1))
35-
36-
with pytest.raises(sqlite3.OperationalError):
37-
con.execute(query, ("chr1:2-3", 0))
36+
con.execute("SELECT parse_genomic_range_begin(?)", (txt,))
37+
assert "parse_genomic_range():" in str(exc.value)
38+
with pytest.raises(sqlite3.OperationalError):
39+
con.execute("SELECT parse_genomic_range_end(?)", (txt,))
40+
assert "parse_genomic_range():" in str(exc.value)

0 commit comments

Comments
 (0)