You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guide_gri.md
+11-19Lines changed: 11 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -164,26 +164,20 @@ queryChrom = featureChrom AND
164
164
165
165
The optional, trailing `ceiling` & `floor` arguments to `genomic_range_rowids()` optimize GRI queries by bounding their search *levels*, skipping steps that'd be useless in view of the overall length distribution of the indexed features. (See [Internals](internals.md) for full explanation.)
166
166
167
-
The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect appropriate level bounds for the current version of the table. This procedure has to analyze the GRI, and the logarithmic cost of doing so will be worthwhile if used to optimize many subsequent GRI queries (but not for just one or a few). Therefore, a typical program should query `genomic_range_index_levels()` once upfront, then pass the detected bounds in to subsequent prepared queries, e.g. in Python:
167
+
The extension supplies a SQL helper function `genomic_range_index_levels(tableName)` to detect appropriate level bounds for the current version of the table. Example usage:
168
168
169
-
```python3
170
-
(gri_ceiling, gri_floor) =next(
171
-
con.execute("SELECT * FROM genomic_range_index_levels('exons')")
172
-
)
173
-
for (queryChrom, queryBegin, queryEnd) in queryRanges:
**❗ Don't use the detected level bounds if the table can be modified in the meantime. GRI queries with inappropriate bounds are liable to produce incomplete results.**
176
+
Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
177
+
178
+
`genomic_range_index_levels()` performs some upfront analysis of table's GRI upon its first use on any database connection. The cost of this analysis should be worthwhile if it's used to optimize many `genomic_range_rowids()` operations (but not just one or a few). Subsequent uses of `genomic_range_index_levels()` on the same connection & table reuse the first analysis, unless the database changes in the meantime, in which case the analysis must be redone. This suggests using `genomic_range_index_levels()` only once the database is read-only.
185
179
186
-
Omitting the bounds is always safe, albeit slightly slower. <small>Instead of detecting current bounds, they can be figured manually as follows. Set the integer ceiling to *C*, 0 <*C*< 16, such that all (present & future) indexed features are guaranteed to have lengths ≤16<sup>*C*</sup>. For example, if you're querying features on the human genome, then you can set ceiling=7 because the lengthiest chromosome sequence is <16<sup>7</sup>nt. Set the integer floor *F* to (i) the floor value supplied at GRI creation, if any; (ii) *F*> 0 such that the minimum possible feature length >16<sup>*F*-1</sup>, if any; or (iii) zero. The default, safe, albeit slower bounds are C=15, F=0.</small>
180
+
<small>Instead of detecting current bounds, they can be figured manually as follows. Set the integer ceiling to *C*, 0 <*C*< 16, such that all (present & future) indexed features are guaranteed to have lengths ≤16<sup>*C*</sup>. For example, if you're querying features on the human genome, then you can set ceiling=7 because the lengthiest chromosome sequence is <16<sup>7</sup>nt. Set the integer floor *F* to (i) the floor value supplied at GRI creation, if any; (ii) *F*> 0 such that the minimum possible feature length >16<sup>*F*-1</sup>, if any; or (iii) zero. The safe, default bounds are C=15, F=0. GRI queries with inappropriate bounds are liable to produce incomplete results.</small>
187
181
188
182
#### Joining tables on range overlap
189
183
@@ -202,7 +196,7 @@ FROM variants LEFT JOIN exons ON exons._rowid_ IN
202
196
203
197
We fill out the GRI query range using the three coordinate columns of the variants table.
204
198
205
-
We may be able to speed this up by supplying level bounds, as shown above. Optionally, in this case where we expect a "tight loop" of many GRI queries, we can even inline the bounds detection:
199
+
We may be able to speed this up by supplying level bounds, as discussed above:
206
200
207
201
```sql
208
202
SELECT variants.*, exons._rowid_
@@ -218,8 +212,6 @@ FROM genomic_range_index_levels('exons'),
218
212
)
219
213
```
220
214
221
-
Here `_gri_ceiling` and `_gri_floor` are columns of the single row computed by `genomic_range_index_levels('exons')`.
222
-
223
215
See also "Advice for big data" below on optimizing storage layout for GRI queries.
0 commit comments