Skip to content

Commit 1fa5d98

Browse files
authored
add in-SQL parse_genomic_range() (#11)
1 parent e6baa32 commit 1fa5d98

5 files changed

Lines changed: 127 additions & 7 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ This [SQLite3 loadable extension](https://www.sqlite.org/loadext.html) adds feat
66

77
* genomic range indexing for overlap queries & joins
88
* streaming storage compression (also available [standalone](https://github.com/mlin/sqlite_zstd_vfs))
9+
* in-SQL utility functions, e.g. parsing "chr1:2,345-6,789"
910
* pre-tuned settings for "big data"
1011

1112
This October 2020 poster discusses the context and long-run ambitions:

docs/guide.md

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1026,14 +1026,25 @@ But this plan strongly depends on the contiguity assumption.
10261026
/* genomicsqlite_open("compressed.db", ...); */
10271027
```
10281028

1029+
#### Parse genomic range string
1030+
1031+
The SQL function `parse_genomic_range(txt, part)` processes a string such as "chr1:2,345-6,789" into any of its three parts (chromosome name, begin position, and end position).
1032+
1033+
=== "SQL"
1034+
``` sql
1035+
SELECT parse_genomic_range('chr1:2,345-6,789', 1) -- 'chr1'
1036+
SELECT parse_genomic_range('chr1:2,345-6,789', 2) -- 2344
1037+
SELECT parse_genomic_range('chr1:2,345-6,789', 3) -- 6789
1038+
```
1039+
1040+
Important: [the begin position returned is one less than the text number](https://genome.ucsc.edu/FAQ/FAQtracks#tracks1), while the end position is equal to the text number.
1041+
10291042
#### Two-bit encoding for nucleotide sequences
10301043

10311044
The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.) Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.
10321045

10331046
The encoding is case-insensitive and considers `T` and `U` equivalent.
10341047

1035-
**↪ Two-bit encoding**
1036-
10371048
=== "SQL"
10381049
``` sql
10391050
SELECT nucleotides_twobit('TCAG')
@@ -1043,8 +1054,6 @@ Given a TEXT value consisting of characters from the set `ACGTUacgtu`, compute a
10431054

10441055
Typically used to populate a BLOB column `C` with `INSERT INTO some_table(...,C) VALUES(...,nucleotides_twobit(?))`. This works even if some of the sequences contain `N`s or other characters, in which case those sequences are stored as the original TEXT values. Make sure the column has schema type `BLOB` to avoid spurious coercions, and by convention, the column should be named *_twobit.
10451056

1046-
**↪ Two-bit decoding**
1047-
10481057
=== "SQL"
10491058
``` sql
10501059
SELECT twobit_dna(nucleotides_twobit('TCAG'))
@@ -1076,8 +1085,6 @@ The Genomics Extension bundles the SQLite developers' [JSON1 extension](https://
10761085

10771086
#### Genomics Extension version
10781087

1079-
**↪ GenomicSQLite Version**
1080-
10811088
=== "SQL"
10821089
``` sql
10831090
SELECT genomicsqlite_version()

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ This [SQLite3 loadable extension](https://www.sqlite.org/loadext.html) supports
88

99
* genomic range indexing for overlap queries & joins
1010
* streaming storage compression using multithreaded [Zstandard](https://facebook.github.io/zstd/)
11+
* in-SQL utility functions, e.g. parsing "chr1:2,345-6,789"
1112
* pre-tuned settings for "big data"
1213

1314
This October 2020 poster discusses the context and long-run ambitions:

src/genomicsqlite.cc

Lines changed: 75 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1432,6 +1432,79 @@ static void sqlfn_twobit_rna(sqlite3_context *ctx, int argc, sqlite3_value **arg
14321432
twobit_nucleotides(ctx, argc, argv, true);
14331433
}
14341434

1435+
/**************************************************************************************************
1436+
* parse_genomic_range()
1437+
**************************************************************************************************/
1438+
1439+
static uint64_t parse_genomic_range_pos(const string &txt, size_t ofs1, size_t ofs2) {
1440+
assert(ofs1 < ofs2);
1441+
assert(ofs2 <= txt.size());
1442+
uint64_t ans = 0;
1443+
for (size_t i = ofs1; i < ofs2; ++i) {
1444+
auto c = txt[i];
1445+
if (c >= '0' && c <= '9') {
1446+
if (ans > 922337203685477579ULL) { // (2**63-10)//10
1447+
throw std::runtime_error("parse_genomic_range() position overflow in `" + txt +
1448+
"`");
1449+
}
1450+
ans *= 10;
1451+
ans += c - '0';
1452+
} else if (c == ',') {
1453+
continue;
1454+
} else {
1455+
throw std::runtime_error("parse_genomic_range() can't read `" + txt + "`");
1456+
}
1457+
}
1458+
return ans;
1459+
}
1460+
1461+
static std::tuple<string, uint64_t, uint64_t> parse_genomic_range(const string &txt) {
1462+
auto p1 = txt.find(':');
1463+
auto p2 = txt.find('-');
1464+
if (p1 == string::npos || p2 == string::npos || p1 < 1 || p2 < p1 + 2 || p2 >= txt.size() - 1) {
1465+
throw std::runtime_error("parse_genomic_range(): can't read `" + txt + "`");
1466+
}
1467+
string chrom = txt.substr(0, p1);
1468+
for (size_t i = 0; i < chrom.size(); ++i) {
1469+
if (std::isspace(chrom[i])) {
1470+
throw std::runtime_error(
1471+
"parse_genomic_range(): invalid sequence/chromosome name in `" + txt + "`");
1472+
}
1473+
}
1474+
auto begin_pos = parse_genomic_range_pos(txt, p1 + 1, p2),
1475+
end_pos = parse_genomic_range_pos(txt, p2 + 1, txt.size());
1476+
if (begin_pos < 1 || begin_pos > end_pos) {
1477+
throw std::runtime_error("parse_genomic_range(): invalid one-based positions in `" + txt +
1478+
"`");
1479+
}
1480+
return std::make_tuple(chrom, begin_pos - 1, end_pos);
1481+
}
1482+
1483+
static void sqlfn_parse_genomic_range(sqlite3_context *ctx, int argc, sqlite3_value **argv) {
1484+
string txt;
1485+
sqlite3_int64 which_part;
1486+
ARG_TEXT(txt, 0);
1487+
ARG(which_part, 1, SQLITE_INTEGER, int64);
1488+
1489+
try {
1490+
auto t = parse_genomic_range(txt);
1491+
auto &chrom = get<0>(t);
1492+
switch (which_part) {
1493+
case 1:
1494+
return sqlite3_result_text(ctx, chrom.c_str(), chrom.size(), SQLITE_TRANSIENT);
1495+
case 2:
1496+
return sqlite3_result_int64(ctx, get<1>(t));
1497+
case 3:
1498+
return sqlite3_result_int64(ctx, get<2>(t));
1499+
default:
1500+
throw std::runtime_error(
1501+
"parse_genomic_range(): expected part 1, 2, or 3 (parameter 2)");
1502+
}
1503+
} catch (std::exception &exn) {
1504+
sqlite3_result_error(ctx, exn.what(), -1);
1505+
}
1506+
}
1507+
14351508
/**************************************************************************************************
14361509
* SQLite loadable extension initialization
14371510
**************************************************************************************************/
@@ -1479,7 +1552,8 @@ static int register_genomicsqlite_functions(sqlite3 *db, const char **pzErrMsg,
14791552
{FPNM(twobit_dna), 3, SQLITE_DETERMINISTIC},
14801553
{FPNM(twobit_rna), 1, SQLITE_DETERMINISTIC},
14811554
{FPNM(twobit_rna), 2, SQLITE_DETERMINISTIC},
1482-
{FPNM(twobit_rna), 3, SQLITE_DETERMINISTIC}};
1555+
{FPNM(twobit_rna), 3, SQLITE_DETERMINISTIC},
1556+
{FPNM(parse_genomic_range), 2, SQLITE_DETERMINISTIC}};
14831557

14841558
int rc;
14851559
for (int i = 0; i < sizeof(fntab) / sizeof(fntab[0]); ++i) {

test/test_parse_genomic_range.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
import sqlite3
2+
import pytest
3+
import genomicsqlite
4+
5+
6+
def test_parse_genomic_range():
7+
con = genomicsqlite.connect(":memory:")
8+
query = "SELECT parse_genomic_range(?,?)"
9+
for (txt, chrom, begin_pos, end_pos) in [
10+
("chr1:2,345-06,789", "chr1", 2344, 6789),
11+
("π:1-9,223,372,036,854,775,799", "π", 0, 9223372036854775799),
12+
]:
13+
assert next(con.execute(query, (txt, 1)))[0] == chrom
14+
assert next(con.execute(query, (txt, 2)))[0] == begin_pos
15+
assert next(con.execute(query, (txt, 3)))[0] == end_pos
16+
17+
for txt in [
18+
"",
19+
":",
20+
"-",
21+
":-",
22+
":1-2",
23+
"chr1",
24+
"chr1:0-1",
25+
"chr1:1,234",
26+
"chr1:1,234-",
27+
"chr1:1,234-5,67",
28+
"chr1 :2,345-06,789",
29+
"chr1:2,345-06,789\t",
30+
"chr1:2345-deadbeef",
31+
"chr1:1-9,223,372,036,854,775,800",
32+
]:
33+
with pytest.raises(sqlite3.OperationalError):
34+
con.execute(query, (txt, 1))
35+
36+
with pytest.raises(sqlite3.OperationalError):
37+
con.execute(query, ("chr1:2-3", 0))

0 commit comments

Comments
 (0)