diff --git a/modules/metrics-reference/pages/application-telemetry-metrics.adoc b/modules/metrics-reference/pages/application-telemetry-metrics.adoc new file mode 100644 index 0000000000..b1ec503410 --- /dev/null +++ b/modules/metrics-reference/pages/application-telemetry-metrics.adoc @@ -0,0 +1,103 @@ += Application Telemetry Metrics +:description: Couchbase can be configured to collect metrics related to Couchbase SDK calls made from applications. +:stem: asciimath + +[abstract] +{description} + +== Prerequisites + +=== Activating Telemetry on Couchbase Server + +Before generating metrics from application telemetry, you must enable application telemetry on your Couchbase Server installation. +You can do this by executing the following CURL command: + +[source, bash, subs="+quotes"] +---- +curl -u Administrator:password -X POST \ +https://##{host}##:##{port}##/settings/appTelemetry \ +-d enabled=true +---- + +For more information on enabling application telemetry, see xref:rest-api:application-telemetry.adoc[] + +== Telemetry Counters + +Couchbase can record the number of SDK calls made for specific services. The metrics follow the same basic patterns: + +[%header] +|=== +| Metric| Description + +| `sdk_\{service}_r_total` +| The total number of operations + +|`sdk_\{service}_r_canceled` +| The total number of operations that were canceled. + +| `sdk_\{service}_r_timedout` +| The total number of operations that timed out. + +|=== + +To access the counter for a particular service, you replace `\{service}` with one of the following: + +* `kv` +* `query` +* `search` +* `analytics` +* `management` +* `eventing` + +So to access the number of calls that timed out for the query service, +you would use the command: `sdk_query_r_timedout`. + +== Histograms + +The same service labeling is used in the retrieval of histograms: + + +[%header, cols="2,1"] +|=== +| Metric| Description + +| `sdk_\{service}_retrieval_duration_milliseconds_bucket` +| Time take for retrieval + +|`sdk_\{service}_mutation_nondurable` +| Time taken for non-durable mutations. + +| `sdk_\{service}__mutation_durable` +| Time taken for ndurable mutations. +|=== + +The histogram metrics can take the same `service` values as the metric counters. + +* `kv` +* `query` +* `search` +* `analytics` +* `management` +* `eventing` + +[#histogram-output] +.Histogram output +==== +[source] +---- +sdk_kv_retrieval_duration_milliseconds_bucket{le="1",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 24054 +sdk_kv_retrieval_duration_milliseconds_bucket{le="10",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 33444 +sdk_kv_retrieval_duration_milliseconds_bucket{le="100",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 100392 +sdk_kv_retrieval_duration_milliseconds_bucket{le="500",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 129389 +sdk_kv_retrieval_duration_milliseconds_bucket{le="1000",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 133988 +sdk_kv_retrieval_duration_milliseconds_bucket{le="2500",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 139823 +sdk_kv_retrieval_duration_milliseconds_bucket{le="+Inf",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 144320 +sdk_kv_retrieval_duration_milliseconds_sum{agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 53423000 +sdk_kv_retrieval_duration_milliseconds_count{agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 144320 +---- +==== +xref:histogram-output[xrefstyle=short] demonstrates a series of `kv` retrieval times for the buckets in a node. +Each line gives the retrieval count for a particular range. +The retrieval duration counts are grouped by latency buckets: for example, requests completed in stem:[le 1] ms total 24,054, those in stem:[le 10] ms total 33,444, and so on. + +There are two special metrics associated with each histogram: `_sum` and `_count` that represent the overall sum of all buckets, and total number of the points in the histogram. diff --git a/modules/metrics-reference/pages/metrics-reference.adoc b/modules/metrics-reference/pages/metrics-reference.adoc index aba34af7bc..7a81e52478 100644 --- a/modules/metrics-reference/pages/metrics-reference.adoc +++ b/modules/metrics-reference/pages/metrics-reference.adoc @@ -26,3 +26,5 @@ In this section, metrics are listed according to service, as follows: * xref:metrics-reference:ns-server-metrics.adoc[Cluster Manager Metrics] * xref:metrics-reference:xdcr-metrics.adoc[XDCR Metrics] + +* xref:rest-api:application-telemetry.adoc[Application Telemetry Metrics] diff --git a/modules/rest-api/pages/application-telemetry.adoc b/modules/rest-api/pages/application-telemetry.adoc index b5194b9515..30eeef2cd3 100644 --- a/modules/rest-api/pages/application-telemetry.adoc +++ b/modules/rest-api/pages/application-telemetry.adoc @@ -11,8 +11,38 @@ Having Couchbase Server collect telemetry from your applications can help you troubleshoot client issues. This telemetry data is useful to diagnose issues such as poor performance or timeouts. -When you enable application telemetry, Couchbase Server advertises to SDK clients that it can collect telemetry data. -When an SDK client connects to a cluster with application telemetry enabled, it opens a WebSocket connection to a node in the cluster. +[mermaid] +.Telemetry Architecture +---- +flowchart LR +Client@{shape:rectange, label: "SDK Client"} + +subgraph Node1 [Couchbase Node] + direction LR + Server1@{shape:rectangle, label: "Couchbase Server"} + Prometheus1@{shape:rectangle, label: "Prometheus"} + Server1 -----> Prometheus1 +end + +subgraph Node2 [Couchbase Node] + direction LR + Server2@{shape:rectangle, label: "Couchbase Server"} + Prometheus2@{shape:rectangle, label: "Prometheus"} + Server2 --> Prometheus2 + +end + +Client --> Node1 +Client --> Node2 +Client -. websocket .-> Server2 +Server2 -. metrics forwarding .-> Server1 + +---- + + +When you enable application telemetry, Couchbase Server advertises to SDK clients that it can collect telemetry data. +When an SDK client connects to a cluster with application telemetry enabled, it opens a WebSocket connection to a node in the cluster. + +The clients will send metrics to any node in the cluster — the cluster will forward the metrics on to other nodes. + Couchbase Server uses this connection to periodically gather telemetry data from the client in Prometheus format. NOTE: Application telemetry is off by default in Couchbase Server 8.0. @@ -33,7 +63,6 @@ You cannot enable application telemetry if your cluster is running in mixed mode * Your applications must use a recent SDK version that supports application telemetry. The following table lists the SDKs that support application telemetry along with the version where they added support. - + |=== | SDK | Minimum Version with Application Telemetry Support @@ -73,6 +102,41 @@ The following table lists the SDKs that support application telemetry along with The default management port is 8091 for unencrypted connections and 18091 for encrypted connections. Make sure any firewall rules between your clients and the nodes allow traffic on the management port. +[IMPORTANT] +.Enabling telemetry +==== + +Telemetry is turned off by default on Couchbase Server. + +Use the following `curl` command to view the telemetry state on your server: + +[source, bash] +---- +curl -u Administrator:password -X GET http://localhost:8091/settings/appTelemetry | jq +---- + +This returns a JSON object that contains the telemetry state as `true` or `false`. + +[source, json] +---- +{ + "enabled": true, + "maxScrapeClientsPerNode": 1024, + "scrapeIntervalSeconds": 60 +} +---- + +Execute the following `curl` command to enable telemetry on your cluster. + +[source, bash, subs="+quotes"] +---- +curl -u Administrator:password -X POST \ +http://localhoast:8091/settings/appTelemetry \ +-d enabled=true +---- + +==== + == HTTP Methods This API endpoint supports the following methods: @@ -251,7 +315,7 @@ This object has the same format as the <<#get-status-responses,response from the `400 Bad Request`:: Returned if you attempt to enable application telemetry on a cluster that's running in mixed mode where some nodes are running a version earlier than 8.0. -All of the nodes in the cluster must be running version 8.0 or later to enable application telemetry. +All the nodes in the cluster must be running version 8.0 or later to enable application telemetry. See <> for more requirements. `403 Forbidden`:: @@ -259,7 +323,7 @@ Returned if you do not have the proper roles to call this API. See <<#config-privs>> for a list of the required roles. `404 Not Found`:: -Returned if you attempt to call the endpoint on a running a version of Couchbase Server earlier than 8.0. +Returned if you attempt to call the endpoint on a version of Couchbase Server earlier than 8.0. [#config-examples] === Examples @@ -287,6 +351,22 @@ If successful, the previous command returns the following JSON object containing } ---- +[#observability-tracing] +== Observability Tracing + +The SDKs also have tracing enabled by default, but your applications should configure tracing options appropriately for the required telemetry information. + +More information about trace enabling can be found on the Java SDK pages: + +xref:java-sdk:howtos:observability-tracing.adoc[Observability Tracing] + +xref:java-sdk:howtos:observability-metrics.adoc[Observability Metrics] + +More examples can be found on the Couchbase GitHub pages: + +https://github.com/couchbaselabs/sdk-rfcs/blob/master/rfc/0084-app-service-level-telemetry.md[] + + == See Also * xref:manage:monitor/set-up-prometheus-for-monitoring.adoc[]