Skip to content

Commit cc174ff

Browse files
docs: update clickhouse queries for estimated on-disk size and improve clarity (#204)
1 parent 3973966 commit cc174ff

2 files changed

Lines changed: 139 additions & 39 deletions

File tree

docs/admin/configuration/extensions/assistants-evaluation/data-volume-maintenance.md

Lines changed: 125 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,22 @@ clickhouse-client --password <password_from_above>
5151

5252
## 1. Disk Usage Analysis
5353

54+
### Database Size Summary
55+
56+
A quick overview of total disk usage and row counts across both databases.
57+
58+
```sql
59+
SELECT
60+
if(database = '', '=== TOTAL ===', database) AS database_name,
61+
formatReadableSize(sum(bytes_on_disk)) AS total_size_on_disk,
62+
sum(rows) AS total_rows
63+
FROM system.parts
64+
WHERE active AND (database IN ('default', 'system'))
65+
GROUP BY database
66+
WITH ROLLUP
67+
ORDER BY total_rows ASC;
68+
```
69+
5470
### Top Tables by Disk Size
5571

5672
This query identifies which tables are consuming the most disk space.
@@ -145,15 +161,15 @@ The following queries help you understand how data is distributed over time and
145161

146162
Choose the appropriate query based on your needs:
147163

148-
- **[Compressed Size by Month (Fast)](#compressed-size-by-month-fast)** – Actual compressed disk usage and row counts by month
149-
- **[Row Count by Day (Fast)](#by-day-row-count-fast)** – Number of records by day
150-
- **[Uncompressed Size by Day (Heavy)](#uncompressed-size-by-day-heavy)** – Decompresses data to calculate approximate size. Not actual disk usage – use only for comparing relative data volume between days
164+
- **[On-Disk Size by Month (Fast)](#on-disk-size-by-month-fast)** – Actual compressed disk usage and row counts by month
165+
- **[Row Count by Day (Fast)](#row-count-by-day-fast)** – Number of records by day
166+
- **[Approximate Size by Day (Heavy)](#approximate-size-by-day-heavy)** – Decompresses data to calculate approximate size. Not actual disk usage – use only for comparing relative data volume between days
151167

152168
:::note
153-
Per-day compressed size is not available because ClickHouse partitions data by month (`PARTITION BY toYYYYMM()`).
169+
Per-day compressed size is not available because langfuse partitions data by month (`PARTITION BY toYYYYMM()`).
154170
:::
155171

156-
#### Compressed Size by Month (Fast)
172+
#### On-Disk Size by Month (Fast)
157173

158174
Shows actual compressed disk usage by month. Reads partition metadata from `system.parts`.
159175

@@ -188,7 +204,7 @@ ORDER BY partition ASC;
188204
</TabItem>
189205
<TabItem value="blob_storage" label="Blob Storage Logs">
190206

191-
The `blob_storage_file_log` table does not have `PARTITION BY` in its schema, so compressed size by month cannot be queried from `system.parts`. Use [Row Count by Day](#by-day-row-count-fast) to analyze this table's data distribution.
207+
The `blob_storage_file_log` table does not have `PARTITION BY` in its schema, so compressed size by month cannot be queried from `system.parts`. Use [Row Count by Day (Fast)](#row-count-by-day-fast) to analyze this table's data distribution.
192208

193209
</TabItem>
194210
<TabItem value="system_logs" label="System Logs">
@@ -233,7 +249,7 @@ ORDER BY partition ASC;
233249
</TabItem>
234250
</Tabs>
235251

236-
#### By Day: Row Count (Fast)
252+
#### Row Count by Day (Fast)
237253

238254
Shows row count per day. Executes instantly by reading indices only.
239255

@@ -302,65 +318,135 @@ ORDER BY day ASC;
302318
</TabItem>
303319
</Tabs>
304320

305-
#### Uncompressed Size by Day (Heavy)
321+
#### Approximate Size by Day (Heavy)
322+
323+
Estimates approximate on-disk size per day using the table's real compression ratio from `system.parts` and the size of the main text fields. The result is slightly lower than actual disk usage because not all columns are measured.
306324

307325
<Tabs groupId="table-type">
308326
<TabItem value="observations" label="Observations" default>
309327

310328
```sql
329+
WITH table_compression AS (
330+
SELECT
331+
`table`,
332+
sum(data_uncompressed_bytes) / sum(data_compressed_bytes) AS ratio
333+
FROM system.parts
334+
WHERE active AND database = 'default' AND `table` = 'observations'
335+
GROUP BY `table`
336+
),
337+
daily_payload AS (
338+
SELECT
339+
toDate(start_time) AS day,
340+
count() AS rows,
341+
sum(length(input) + length(output)) AS raw_text_bytes
342+
FROM default.observations
343+
GROUP BY day
344+
)
311345
SELECT
312-
toDate(start_time) AS day,
313-
count() AS rows,
314-
formatReadableSize(sum(length(toString(input)) + length(toString(output)))) AS approx_size
315-
FROM default.observations
316-
GROUP BY day
317-
ORDER BY day ASC;
346+
d.day,
347+
d.rows,
348+
formatReadableSize(d.raw_text_bytes) AS raw_text_size,
349+
formatReadableSize(d.raw_text_bytes / c.ratio) AS estimated_disk_usage
350+
FROM daily_payload AS d
351+
CROSS JOIN table_compression AS c
352+
ORDER BY d.day ASC;
318353
```
319354

320355
</TabItem>
321356
<TabItem value="traces" label="Traces">
322357

323358
```sql
359+
WITH table_compression AS (
360+
SELECT
361+
`table`,
362+
sum(data_uncompressed_bytes) / sum(data_compressed_bytes) AS ratio
363+
FROM system.parts
364+
WHERE active AND database = 'default' AND `table` = 'traces'
365+
GROUP BY `table`
366+
),
367+
daily_payload AS (
368+
SELECT
369+
toDate(timestamp) AS day,
370+
count() AS rows,
371+
sum(length(input) + length(output)) AS raw_text_bytes
372+
FROM default.traces
373+
GROUP BY day
374+
)
324375
SELECT
325-
toDate(timestamp) AS day,
326-
count() AS rows,
327-
formatReadableSize(sum(length(toString(input)) + length(toString(output)))) AS approx_size
328-
FROM default.traces
329-
GROUP BY day
330-
ORDER BY day ASC;
376+
d.day,
377+
d.rows,
378+
formatReadableSize(d.raw_text_bytes) AS raw_text_size,
379+
formatReadableSize(d.raw_text_bytes / c.ratio) AS estimated_disk_usage
380+
FROM daily_payload AS d
381+
CROSS JOIN table_compression AS c
382+
ORDER BY d.day ASC;
331383
```
332384

333385
</TabItem>
334386
<TabItem value="blob_storage" label="Blob Storage Logs">
335387

336-
The `blob_storage_file_log` table does not have `PARTITION BY` in its schema, so uncompressed size by day cannot be queried from `system.parts`. Use [Row Count by Day](#by-day-row-count-fast) to analyze this table's data distribution.
388+
The `blob_storage_file_log` table does not have `PARTITION BY` in its schema, so approximate size by day cannot be queried from `system.parts`. Use [Row Count by Day (Fast)](#row-count-by-day-fast) to analyze this table's data distribution.
337389

338390
</TabItem>
339391
<TabItem value="system_logs" label="System Logs">
340392

341393
You can replace `query_log` with a table from [this list](#system-log-tables).
342394

343-
```sql {5}
395+
```sql {6,14}
396+
WITH table_compression AS (
397+
SELECT
398+
`table`,
399+
sum(data_uncompressed_bytes) / sum(data_compressed_bytes) AS ratio
400+
FROM system.parts
401+
WHERE active AND database = 'system' AND `table` = 'query_log'
402+
GROUP BY `table`
403+
),
404+
daily_payload AS (
405+
SELECT
406+
event_date AS day,
407+
count() AS rows,
408+
sum(length(query)) AS raw_text_bytes
409+
FROM system.query_log
410+
GROUP BY day
411+
)
344412
SELECT
345-
event_date AS day,
346-
count() AS rows,
347-
formatReadableSize(sum(length(toString(query)))) AS approx_size
348-
FROM system.query_log
349-
GROUP BY day
350-
ORDER BY day ASC;
413+
d.day,
414+
d.rows,
415+
formatReadableSize(d.raw_text_bytes) AS raw_text_size,
416+
formatReadableSize(d.raw_text_bytes / c.ratio) AS estimated_disk_usage
417+
FROM daily_payload AS d
418+
CROSS JOIN table_compression AS c
419+
ORDER BY d.day ASC;
351420
```
352421

353422
</TabItem>
354423
<TabItem value="opentelemetry" label="OpenTelemetry Span Log">
355424

356425
```sql
426+
WITH table_compression AS (
427+
SELECT
428+
`table`,
429+
sum(data_uncompressed_bytes) / sum(data_compressed_bytes) AS ratio
430+
FROM system.parts
431+
WHERE active AND database = 'system' AND `table` = 'opentelemetry_span_log'
432+
GROUP BY `table`
433+
),
434+
daily_payload AS (
435+
SELECT
436+
finish_date AS day,
437+
count() AS rows,
438+
sum(length(toString(attribute))) AS raw_text_bytes
439+
FROM system.opentelemetry_span_log
440+
GROUP BY day
441+
)
357442
SELECT
358-
finish_date AS day,
359-
count() AS rows,
360-
formatReadableSize(sum(length(toString(attribute)))) AS approx_size
361-
FROM system.opentelemetry_span_log
362-
GROUP BY day
363-
ORDER BY day ASC;
443+
d.day,
444+
d.rows,
445+
formatReadableSize(d.raw_text_bytes) AS raw_text_size,
446+
formatReadableSize(d.raw_text_bytes / c.ratio) AS estimated_disk_usage
447+
FROM daily_payload AS d
448+
CROSS JOIN table_compression AS c
449+
ORDER BY d.day ASC;
364450
```
365451

366452
</TabItem>
@@ -451,7 +537,7 @@ Since deletion is not instant, check the progress here:
451537
<TabItem value="observations" label="Observations" default>
452538

453539
```sql
454-
SELECT command, is_done
540+
SELECT create_time, command, is_done
455541
FROM system.mutations
456542
WHERE table = 'observations'
457543
ORDER BY create_time DESC
@@ -462,7 +548,7 @@ LIMIT 5;
462548
<TabItem value="traces" label="Traces">
463549

464550
```sql
465-
SELECT command, is_done
551+
SELECT create_time, command, is_done
466552
FROM system.mutations
467553
WHERE table = 'traces'
468554
ORDER BY create_time DESC
@@ -473,7 +559,7 @@ LIMIT 5;
473559
<TabItem value="blob_storage" label="Blob Storage Logs">
474560

475561
```sql
476-
SELECT command, is_done
562+
SELECT create_time, command, is_done
477563
FROM system.mutations
478564
WHERE table = 'blob_storage_file_log'
479565
ORDER BY create_time DESC
@@ -486,7 +572,7 @@ LIMIT 5;
486572
You can replace `query_log` with a table from [this list](#system-log-tables).
487573

488574
```sql {3}
489-
SELECT command, is_done
575+
SELECT create_time, command, is_done
490576
FROM system.mutations
491577
WHERE table = 'query_log'
492578
ORDER BY create_time DESC
@@ -497,7 +583,7 @@ LIMIT 5;
497583
<TabItem value="opentelemetry" label="OpenTelemetry Span Log">
498584

499585
```sql
500-
SELECT command, is_done
586+
SELECT create_time, command, is_done
501587
FROM system.mutations
502588
WHERE table = 'opentelemetry_span_log'
503589
ORDER BY create_time DESC

docs/admin/deployment/extensions/02-assistants-evaluation/03-deployment-prerequisites.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,20 @@ GRANT ALL ON SCHEMA public TO langfuse_admin;
225225

226226
To prevent disk overflow, configure [TTL](https://clickhouse.com/docs/guides/developer/ttl) policies in `values.yaml` to automatically remove old data. Default retention: 90 days.
227227

228+
### Recommended TTL by Usage
229+
230+
The table below provides recommended TTL values based on usage level, assuming the default **100 GB** ClickHouse disk size.
231+
232+
| Usage Level | Active Users | Est. Ingestion | Recommended TTL |
233+
| ------------ | ------------ | -------------- | --------------- |
234+
| High usage | 3,000–4,000 | ~40 GB/day | 2 days |
235+
| Medium usage | ~1,500 | ~10 GB/day | 9 days |
236+
| Low usage | < 500 | ~1 GB/day | 90 days |
237+
238+
:::note
239+
If your deployment does not fit within the recommended TTL for 100 GB, either lower the TTL or increase the ClickHouse disk size. To measure your actual ingestion rate, see [Data distribution by time period](../../../configuration/extensions/assistants-evaluation/data-volume-maintenance#data-distribution-by-time-period).
240+
:::
241+
228242
### Langfuse Tables
229243

230244
Set `retention.langfuse.enabled: true` in `values.yaml`. TTL is configured for the following tables:

0 commit comments

Comments
 (0)