You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guide.md
+13-6Lines changed: 13 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1026,14 +1026,25 @@ But this plan strongly depends on the contiguity assumption.
1026
1026
/* genomicsqlite_open("compressed.db", ...); */
1027
1027
```
1028
1028
1029
+
#### Parse genomic range string
1030
+
1031
+
The SQL function `parse_genomic_range(txt, part)` processes a string such as "chr1:2,345-6,789" into any of its three parts (chromosome name, begin position, and end position).
Important: [the begin position returned is one less than the text number](https://genome.ucsc.edu/FAQ/FAQtracks#tracks1), while the end position is equal to the text number.
1041
+
1029
1042
#### Two-bit encoding for nucleotide sequences
1030
1043
1031
1044
The extension supplies SQL functions to pack a DNA/RNA sequence TEXT value into a smaller BLOB value, using two bits per nucleotide. (Review [SQLite Datatypes](https://www.sqlite.org/datatype3.html) on the important differences between TEXT and BLOB values & columns.) Storing a large database of sequences using such BLOBs instead of TEXT can improve application I/O efficiency, with up to 4X more nucleotides cached in the same memory space. It is not, however, expected to greatly shrink the database file on disk, owing to the automatic storage compression.
1032
1045
1033
1046
The encoding is case-insensitive and considers `T` and `U` equivalent.
1034
1047
1035
-
**↪ Two-bit encoding**
1036
-
1037
1048
=== "SQL"
1038
1049
``` sql
1039
1050
SELECT nucleotides_twobit('TCAG')
@@ -1043,8 +1054,6 @@ Given a TEXT value consisting of characters from the set `ACGTUacgtu`, compute a
1043
1054
1044
1055
Typically used to populate a BLOB column `C` with `INSERT INTO some_table(...,C) VALUES(...,nucleotides_twobit(?))`. This works even if some of the sequences contain `N`s or other characters, in which case those sequences are stored as the original TEXT values. Make sure the column has schema type `BLOB` to avoid spurious coercions, and by convention, the column should be named *_twobit.
1045
1056
1046
-
**↪ Two-bit decoding**
1047
-
1048
1057
=== "SQL"
1049
1058
``` sql
1050
1059
SELECT twobit_dna(nucleotides_twobit('TCAG'))
@@ -1076,8 +1085,6 @@ The Genomics Extension bundles the SQLite developers' [JSON1 extension](https://
0 commit comments