Skip to content

Commit 74e08b7

Browse files
committed
revise genomic_range_index_levels() docs on recommended usage
1 parent 8836f3b commit 74e08b7

1 file changed

Lines changed: 21 additions & 17 deletions

File tree

docs/guide_gri.md

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -160,20 +160,9 @@ By the half-open position convention, this includes features that *abut* as well
160160

161161
#### Level bounds optimization
162162

163-
The optional, trailing `ceiling` & `floor` arguments to `genomic_range_rowids()` optimize GRI queries by skipping steps that'd be useless in view of the length distribution of the indexed features. (See Internals for full explanation.)
163+
The optional, trailing `ceiling` & `floor` arguments to `genomic_range_rowids()` optimize GRI queries by bounding their search *levels*, skipping steps that'd be useless in view of the overall length distribution of the indexed features. (See [Internals](internals.md) for full explanation.)
164164

165-
The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect the appropriate bounds for (the current snapshot of) the table. Example usage:
166-
167-
```sql
168-
SELECT col1, col2, ... FROM exons, genomic_range_index_levels('exons')
169-
WHERE exons._rowid_ IN
170-
genomic_range_rowids('exons', 'chr12', 111803912, 111804012,
171-
_gri_ceiling, _gri_floor)
172-
```
173-
174-
Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
175-
176-
Alternatively, your program might first query `genomic_range_index_levels()` alone, then pass the bounds in to subsequent prepared queries, e.g. in Python:
165+
The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect appropriate level bounds for the current version of the table. This procedure has to analyze the GRI, and the cost of doing so will be worthwhile if used to optimize many subsequent GRI queries (but not for just one or a few). Therefore, a typical program should query `genomic_range_index_levels()` once upfront, then pass the detected bounds in to subsequent prepared queries, e.g. in Python:
177166

178167
```python3
179168
(gri_ceiling, gri_floor) = next(
@@ -190,16 +179,29 @@ for (queryChrom, queryBegin, queryEnd) in queryRanges:
190179
...
191180
```
192181

193-
This bounds detection procedure has a small cost, which will be worthwhile if used to optimize many subsequent GRI queries (but possibly not if just for a few).
194-
195-
**❗ The bounds should be redetected if the min/max feature length may have been changed by inserts or updates to the table. GRI queries with incorrect bounds are liable to produce incomplete results.**
182+
**❗ Don't use the detected level bounds if the table can be modified in the meantime. GRI queries with inappropriate bounds are liable to produce incomplete results.**
196183

197184
Omitting the bounds is always safe, albeit slower. <small>Instead of detecting current bounds, they can be figured manually as follows. Set the integer ceiling to *C*, 0 &lt; *C* &lt; 16, such that all (present & future) indexed features are guaranteed to have lengths &le;16<sup>*C*</sup>. For example, if you're querying features on the human genome, then you can set ceiling=7 because the lengthiest chromosome sequence is &lt;16<sup>7</sup>nt. Set the integer floor *F* to (i) the floor value supplied at GRI creation, if any; (ii) *F* &gt; 0 such that the minimum possible feature length &gt;16<sup>*F*-1</sup>, if any; or (iii) zero. The default, safe, albeit slower bounds are C=15, F=0.</small>
198185

199186
#### Joining tables on range overlap
200187

201188
Suppose we have two tables with genomic features to join on range overlap. Only the "right-hand" table must have a GRI; preferably the smaller of the two. For example, annotating a table of variants with the surrounding exon(s), if any:
202189

190+
``` sql
191+
SELECT variants.*, exons._rowid_
192+
FROM variants LEFT JOIN exons ON exons._rowid_ IN
193+
genomic_range_rowids(
194+
'exons',
195+
variants.chrom,
196+
variants.beginPos,
197+
variants.endPos
198+
)
199+
```
200+
201+
We fill out the GRI query range using the three coordinate columns of the variants table.
202+
203+
We may be able to speed this up by supplying level bounds, as shown above. Optionally, in this case where we expect a "tight loop" of many GRI queries, we can even inline the bounds detection:
204+
203205
``` sql
204206
SELECT variants.*, exons._rowid_
205207
FROM genomic_range_index_levels('exons'),
@@ -214,7 +216,9 @@ FROM genomic_range_index_levels('exons'),
214216
)
215217
```
216218

217-
We fill out the GRI query range using the three coordinate columns of the variants table. The level bounds optimization is highly desirable for the "tight loop" of GRI queries during a join. See also "Advice for big data" below.
219+
Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
220+
221+
See also "Advice for big data" below on optimizing storage layout for GRI queries.
218222

219223
### Reference genome metadata
220224

0 commit comments

Comments
 (0)