Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions modules/metrics-reference/pages/application-telemetry-metrics.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
= Application Telemetry Metrics
:description: Couchbase can be configured to collect metrics related to Couchbase SDK calls made from applications.
:stem: asciimath

[abstract]
{description}

== Prerequisites

=== Activating Telemetry on Couchbase Server

Before generating metrics from application telemetry, you must enable application telemetry on your Couchbase Server installation.
You can do this by executing the following CURL command:

[source, bash, subs="+quotes"]
----
curl -u Administrator:password -X POST \
https://##{host}##:##{port}##/settings/appTelemetry \
-d enabled=true
----

For more information on enabling application telemetry, see xref:rest-api:application-telemetry.adoc[]

== Telemetry Counters

Couchbase can record the number of SDK calls made for specific services. The metrics follow the same basic patterns:

[%header]
|===
| Metric| Description

| `sdk_\{service}_r_total`
| The total number of operations

|`sdk_\{service}_r_canceled`
| The total number of operations that were canceled.

| `sdk_\{service}_r_timedout`
| The total number of operations that timed out.

|===

To access the counter for a particular service, you replace `\{service}` with one of the following:

* `kv`
* `query`
* `search`
* `analytics`
* `management`
* `eventing`

So to access the number of calls that timed out for the query service,
you would use the command: `sdk_query_r_timedout`.

== Histograms

The same service labeling is used in the retrieval of histograms:


[%header, cols="2,1"]
|===
| Metric| Description

| `sdk_\{service}_retrieval_duration_milliseconds_bucket`
| Time take for retrieval

|`sdk_\{service}_mutation_nondurable`
| Time taken for non-durable mutations.

| `sdk_\{service}__mutation_durable`
| Time taken for ndurable mutations.
|===

The histogram metrics can take the same `service` values as the metric counters.

* `kv`
* `query`
* `search`
* `analytics`
* `management`
* `eventing`

[#histogram-output]
.Histogram output
====
Comment thread
RayOffiah marked this conversation as resolved.
[source]
----
sdk_kv_retrieval_duration_milliseconds_bucket{le="1",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 24054
sdk_kv_retrieval_duration_milliseconds_bucket{le="10",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 33444
sdk_kv_retrieval_duration_milliseconds_bucket{le="100",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 100392
sdk_kv_retrieval_duration_milliseconds_bucket{le="500",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 129389
sdk_kv_retrieval_duration_milliseconds_bucket{le="1000",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 133988
sdk_kv_retrieval_duration_milliseconds_bucket{le="2500",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 139823
sdk_kv_retrieval_duration_milliseconds_bucket{le="+Inf",agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 144320
sdk_kv_retrieval_duration_milliseconds_sum{agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 53423000
sdk_kv_retrieval_duration_milliseconds_count{agent="sdk/2.4.5.0",bucket="travel-sample",node="node1",node_uuid="91442eb8202e0e16bbb59624d9ccdb0a"} 144320
----
====
Comment thread
RayOffiah marked this conversation as resolved.
xref:histogram-output[xrefstyle=short] demonstrates a series of `kv` retrieval times for the buckets in a node.
Each line gives the retrieval count for a particular range.
The retrieval duration counts are grouped by latency buckets: for example, requests completed in stem:[le 1] ms total 24,054, those in stem:[le 10] ms total 33,444, and so on.

There are two special metrics associated with each histogram: `_sum` and `_count` that represent the overall sum of all buckets, and total number of the points in the histogram.
2 changes: 2 additions & 0 deletions modules/metrics-reference/pages/metrics-reference.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,5 @@ In this section, metrics are listed according to service, as follows:
* xref:metrics-reference:ns-server-metrics.adoc[Cluster Manager Metrics]

* xref:metrics-reference:xdcr-metrics.adoc[XDCR Metrics]

* xref:rest-api:application-telemetry.adoc[Application Telemetry Metrics]
Comment thread
RayOffiah marked this conversation as resolved.
90 changes: 85 additions & 5 deletions modules/rest-api/pages/application-telemetry.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,38 @@
Having Couchbase Server collect telemetry from your applications can help you troubleshoot client issues.
This telemetry data is useful to diagnose issues such as poor performance or timeouts.

When you enable application telemetry, Couchbase Server advertises to SDK clients that it can collect telemetry data.
When an SDK client connects to a cluster with application telemetry enabled, it opens a WebSocket connection to a node in the cluster.
[mermaid]
.Telemetry Architecture
----
flowchart LR
Client@{shape:rectange, label: "SDK Client"}

subgraph Node1 [Couchbase Node]
direction LR
Server1@{shape:rectangle, label: "Couchbase Server"}
Prometheus1@{shape:rectangle, label: "Prometheus"}
Server1 -----> Prometheus1
end

subgraph Node2 [Couchbase Node]
direction LR
Server2@{shape:rectangle, label: "Couchbase Server"}
Prometheus2@{shape:rectangle, label: "Prometheus"}
Server2 --> Prometheus2

end

Client --> Node1
Client --> Node2
Client -. websocket .-> Server2
Server2 -. metrics forwarding .-> Server1

----


When you enable application telemetry, Couchbase Server advertises to SDK clients that it can collect telemetry data.
When an SDK client connects to a cluster with application telemetry enabled, it opens a WebSocket connection to a node in the cluster. +
The clients will send metrics to any node in the cluster — the cluster will forward the metrics on to other nodes. +
Couchbase Server uses this connection to periodically gather telemetry data from the client in Prometheus format.

NOTE: Application telemetry is off by default in Couchbase Server 8.0.
Expand All @@ -33,7 +63,6 @@ You cannot enable application telemetry if your cluster is running in mixed mode

* Your applications must use a recent SDK version that supports application telemetry.
The following table lists the SDKs that support application telemetry along with the version where they added support.

+
|===
| SDK | Minimum Version with Application Telemetry Support
Expand Down Expand Up @@ -73,6 +102,41 @@ The following table lists the SDKs that support application telemetry along with
The default management port is 8091 for unencrypted connections and 18091 for encrypted connections.
Make sure any firewall rules between your clients and the nodes allow traffic on the management port.

[IMPORTANT]
Comment thread
RayOffiah marked this conversation as resolved.
.Enabling telemetry
====

Telemetry is turned off by default on Couchbase Server.

Use the following `curl` command to view the telemetry state on your server:

[source, bash]
----
curl -u Administrator:password -X GET http://localhost:8091/settings/appTelemetry | jq
----

This returns a JSON object that contains the telemetry state as `true` or `false`.

[source, json]
----
{
"enabled": true,
"maxScrapeClientsPerNode": 1024,
"scrapeIntervalSeconds": 60
}
----

Execute the following `curl` command to enable telemetry on your cluster.

[source, bash, subs="+quotes"]
----
curl -u Administrator:password -X POST \
http://localhoast:8091/settings/appTelemetry \
-d enabled=true
Comment thread
RayOffiah marked this conversation as resolved.
----

====

== HTTP Methods

This API endpoint supports the following methods:
Expand Down Expand Up @@ -251,15 +315,15 @@ This object has the same format as the <<#get-status-responses,response from the

`400 Bad Request`::
Returned if you attempt to enable application telemetry on a cluster that's running in mixed mode where some nodes are running a version earlier than 8.0.
All of the nodes in the cluster must be running version 8.0 or later to enable application telemetry.
All the nodes in the cluster must be running version 8.0 or later to enable application telemetry.
See <<prerequisites>> for more requirements.

`403 Forbidden`::
Returned if you do not have the proper roles to call this API.
See <<#config-privs>> for a list of the required roles.

`404 Not Found`::
Returned if you attempt to call the endpoint on a running a version of Couchbase Server earlier than 8.0.
Returned if you attempt to call the endpoint on a version of Couchbase Server earlier than 8.0.

[#config-examples]
=== Examples
Expand Down Expand Up @@ -287,6 +351,22 @@ If successful, the previous command returns the following JSON object containing
}
----

[#observability-tracing]
== Observability Tracing

The SDKs also have tracing enabled by default, but your applications should configure tracing options appropriately for the required telemetry information.

More information about trace enabling can be found on the Java SDK pages:

xref:java-sdk:howtos:observability-tracing.adoc[Observability Tracing]

xref:java-sdk:howtos:observability-metrics.adoc[Observability Metrics]

More examples can be found on the Couchbase GitHub pages:

https://github.com/couchbaselabs/sdk-rfcs/blob/master/rfc/0084-app-service-level-telemetry.md[]


== See Also

* xref:manage:monitor/set-up-prometheus-for-monitoring.adoc[]
Expand Down