Skip to content

Commit ef4a8c6

Browse files
authored
Support Envoy AI Gateway observability (SWIP-10) (#13772)
Add Envoy AI Gateway as a new monitored layer (`ENVOY_AI_GATEWAY`) in SkyWalking, receiving GenAI metrics and access logs via OTLP push from the AI Gateway. **Changes:** - New layer `ENVOY_AI_GATEWAY(46, true)` in `Layer.java` - MAL rules: 2 rule files (service + instance) with 38 metrics total — aggregates, per-provider and per-model breakdowns including token usage, latency, TTFT, TPOT - LAL rules: access log sampling (errors, upstream failures, high token cost) - UI dashboards: root (with doc link), service (Overview/Providers/Models/Log/Instances tabs), instance (Overview/Providers/Models/Log tabs) - OTel receiver fixes: - Convert data point attribute dots to underscores (consistent with resource attributes) - Change `LABEL_MAPPINGS` to fallback-only — `service.name` preserved as `service_name` tag - Remove unused `service.name → job_name` mapping (MAL checker: 1268/1268 rules pass) - OTLP log handler: prefer `service.instance.id` (OTel spec) with fallback to `service.instance` - `UITemplateInitializer`: register `ENVOY_AI_GATEWAY` template folder - `SampleFamily`: add `toString()` and `debugDump()` for MAL debugging - Documentation: setup guide, OTel receiver label conversion, LAL OTLP mapping, marketplace GenAI - E2e test: docker-compose with `ai-gateway-cli` + Ollama (qwen2.5:0.5b)
1 parent d0cc16a commit ef4a8c6

30 files changed

Lines changed: 2058 additions & 428 deletions

File tree

docs/en/changes/changes.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,13 @@
169169
* Support Virtual-GenAI monitoring.
170170
* Fix on-demand pod log parsing failure by replacing invalid `DateTimeFormatter` pattern with `ISO_OFFSET_DATE_TIME`.
171171
* Fix Zipkin receiver compatibility with application/x-protobuf Content-Type.
172-
172+
* Support Envoy AI Gateway observability (SWIP-10): new `ENVOY_AI_GATEWAY` layer with MAL/LAL rules
173+
for GenAI metrics (token usage, latency, TTFT, TPOT) and access log sampling via OTLP.
174+
* OTel metric receiver: convert data point attribute dots to underscores (consistent with resource attributes
175+
and metric names). Label mappings are now fallback-only — explicit `job_name` in resource attributes takes
176+
precedence over the `service.name` fallback.
177+
* OTel log handler: prefer `service.instance.id` (OTel spec) over `service.instance` with fallback.
178+
* Add `SampleFamily.debugDump()` for MAL debugging.
173179

174180
#### UI
175181
* Fix the missing icon in new native trace view.

docs/en/concepts-and-designs/lal.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,19 @@ The LAL config files are in YAML format, and are located under directory `lal`.
88
set `log-analyzer/default/lalFiles` in the `application.yml` file or set environment variable `SW_LOG_LAL_FILES` to
99
activate specific LAL config files.
1010

11+
## OTLP log attribute mapping
12+
13+
When logs arrive via the OTLP receiver, resource attributes are mapped to `LogData` fields:
14+
15+
| Resource attribute | LogData field | Notes |
16+
|---|---|---|
17+
| `service.name` | `service` | SkyWalking service name |
18+
| `service.instance.id` | `serviceInstance` | OTel standard ([spec](https://opentelemetry.io/docs/specs/semconv/resource/#service)). Falls back to `service.instance` for backward compatibility. |
19+
| `service.layer` | `layer` | Routes to the LAL rule with matching `layer` declaration |
20+
21+
Log record attributes are available via `tag("attribute_name")` in LAL rules. Attribute keys
22+
retain their original names (dots are NOT converted to underscores in log attributes).
23+
1124
## Layer
1225
Layer should be declared in the LAL script to represent the analysis scope of the logs.
1326

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Envoy AI Gateway Monitoring
2+
3+
## Envoy AI Gateway observability via OTLP
4+
5+
[Envoy AI Gateway](https://aigateway.envoyproxy.io/) is a gateway/proxy for AI/LLM API traffic
6+
(OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Google Gemini, etc.) built on top of Envoy Proxy.
7+
It natively emits GenAI metrics and access logs via OTLP, following
8+
[OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/).
9+
10+
SkyWalking receives OTLP metrics and logs directly on its gRPC port (11800) — no OpenTelemetry
11+
Collector is needed between the AI Gateway and SkyWalking OAP.
12+
13+
### Prerequisites
14+
- [Envoy AI Gateway](https://aigateway.envoyproxy.io/) deployed. See the
15+
[Envoy AI Gateway getting started](https://aigateway.envoyproxy.io/docs/getting-started/) for installation.
16+
17+
### Data flow
18+
1. Envoy AI Gateway processes LLM API requests and records GenAI metrics (token usage, latency, TTFT, TPOT).
19+
2. The AI Gateway pushes metrics and access logs via OTLP gRPC to SkyWalking OAP.
20+
3. SkyWalking OAP parses metrics with [MAL](../../concepts-and-designs/mal.md) rules and access logs
21+
with [LAL](../../concepts-and-designs/lal.md) rules.
22+
23+
### Set up
24+
25+
The MAL rules (`envoy-ai-gateway/*`) and LAL rules (`envoy-ai-gateway`) are enabled by default
26+
in SkyWalking OAP. No OAP-side configuration is needed.
27+
28+
Configure the AI Gateway to push OTLP to SkyWalking by setting these environment variables:
29+
30+
| Env Var | Value | Purpose |
31+
|---------|-------|---------|
32+
| `OTEL_SERVICE_NAME` | Per-deployment gateway name (e.g., `my-ai-gateway`) | SkyWalking service name |
33+
| `OTEL_EXPORTER_OTLP_ENDPOINT` | `http://skywalking-oap:11800` | SkyWalking OAP gRPC receiver |
34+
| `OTEL_EXPORTER_OTLP_PROTOCOL` | `grpc` | OTLP transport |
35+
| `OTEL_METRICS_EXPORTER` | `otlp` | Enable OTLP metrics push |
36+
| `OTEL_LOGS_EXPORTER` | `otlp` | Enable OTLP access log push |
37+
| `OTEL_RESOURCE_ATTRIBUTES` | See below | Routing + instance + layer |
38+
39+
**Required resource attributes** (in `OTEL_RESOURCE_ATTRIBUTES`):
40+
- `job_name=envoy-ai-gateway` — Fixed routing tag for MAL/LAL rules. Same for all AI Gateway deployments.
41+
- `service.instance.id=<instance-id>` — Instance identity. In Kubernetes, use the pod name via Downward API.
42+
- `service.layer=ENVOY_AI_GATEWAY` — Routes access logs to the AI Gateway LAL rules.
43+
44+
**Example:**
45+
```bash
46+
OTEL_SERVICE_NAME=my-ai-gateway
47+
OTEL_EXPORTER_OTLP_ENDPOINT=http://skywalking-oap:11800
48+
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
49+
OTEL_METRICS_EXPORTER=otlp
50+
OTEL_LOGS_EXPORTER=otlp
51+
OTEL_RESOURCE_ATTRIBUTES=job_name=envoy-ai-gateway,service.instance.id=pod-abc123,service.layer=ENVOY_AI_GATEWAY
52+
```
53+
54+
### Supported Metrics
55+
56+
SkyWalking observes the AI Gateway as a `LAYER: ENVOY_AI_GATEWAY` service. Each gateway deployment
57+
is a service, each pod is an instance. Metrics include per-provider and per-model breakdowns.
58+
59+
#### Service Metrics
60+
61+
| Monitoring Panel | Unit | Metric Name | Description |
62+
|---|---|---|---|
63+
| Request CPM | calls/min | meter_envoy_ai_gw_request_cpm | Requests per minute |
64+
| Request Latency Avg | ms | meter_envoy_ai_gw_request_latency_avg | Average request duration |
65+
| Request Latency Percentile | ms | meter_envoy_ai_gw_request_latency_percentile | P50/P75/P90/P95/P99 |
66+
| Input Token Rate | tokens/min | meter_envoy_ai_gw_input_token_rate | Input (prompt) tokens per minute |
67+
| Output Token Rate | tokens/min | meter_envoy_ai_gw_output_token_rate | Output (completion) tokens per minute |
68+
| TTFT Avg | ms | meter_envoy_ai_gw_ttft_avg | Time to First Token (streaming only) |
69+
| TTFT Percentile | ms | meter_envoy_ai_gw_ttft_percentile | P50/P75/P90/P95/P99 TTFT |
70+
| TPOT Avg | ms | meter_envoy_ai_gw_tpot_avg | Time Per Output Token (streaming only) |
71+
| TPOT Percentile | ms | meter_envoy_ai_gw_tpot_percentile | P50/P75/P90/P95/P99 TPOT |
72+
73+
#### Provider Breakdown Metrics
74+
75+
| Monitoring Panel | Unit | Metric Name | Description |
76+
|---|---|---|---|
77+
| Provider Request CPM | calls/min | meter_envoy_ai_gw_provider_request_cpm | Requests by provider |
78+
| Provider Token Rate | tokens/min | meter_envoy_ai_gw_provider_token_rate | Token rate by provider |
79+
| Provider Latency Avg | ms | meter_envoy_ai_gw_provider_latency_avg | Latency by provider |
80+
81+
#### Model Breakdown Metrics
82+
83+
| Monitoring Panel | Unit | Metric Name | Description |
84+
|---|---|---|---|
85+
| Model Request CPM | calls/min | meter_envoy_ai_gw_model_request_cpm | Requests by model |
86+
| Model Token Rate | tokens/min | meter_envoy_ai_gw_model_token_rate | Token rate by model |
87+
| Model Latency Avg | ms | meter_envoy_ai_gw_model_latency_avg | Latency by model |
88+
| Model TTFT Avg | ms | meter_envoy_ai_gw_model_ttft_avg | TTFT by model |
89+
| Model TPOT Avg | ms | meter_envoy_ai_gw_model_tpot_avg | TPOT by model |
90+
91+
#### Instance Metrics
92+
93+
All service-level metrics are also available per instance (pod) with `meter_envoy_ai_gw_instance_` prefix,
94+
including per-provider and per-model breakdowns.
95+
96+
### Access Log Sampling
97+
98+
The LAL rules apply a sampling policy to reduce storage:
99+
- **Error responses** (HTTP status >= 400) — always persisted.
100+
- **Upstream failures** — always persisted.
101+
- **High token cost** (>= 10,000 total tokens) — persisted for cost anomaly detection.
102+
- Normal successful responses with low token counts are dropped.
103+
104+
The token threshold can be adjusted in `lal/envoy-ai-gateway.yaml`.

docs/en/setup/backend/marketplace.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ SkyWalking provides ready-to-use monitoring capabilities for a wide range of tec
1212
- **Infrastructure** - Linux and Windows server monitoring
1313
- **Cloud Services** - AWS EKS, S3, DynamoDB, API Gateway, and more
1414
- **Gateways** - Nginx, APISIX, Kong monitoring
15+
- **GenAI** - [Virtual GenAI](../service-agent/virtual-genai.md) for agent-based LLM call monitoring, [Envoy AI Gateway](backend-envoy-ai-gateway-monitoring.md) for infrastructure-side AI traffic observability
1516
- **Databases** - MySQL, PostgreSQL, Redis, Elasticsearch, MongoDB, ClickHouse, and more
1617
- **Message Queues** - Kafka, RabbitMQ, Pulsar, RocketMQ, ActiveMQ
1718
- **Browser** - Real user monitoring for web applications

docs/en/setup/backend/opentelemetry-receiver.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,26 @@ The receiver adds label with key `node_identifier_host_name` to the collected da
2626
and its value is from `net.host.name` (or `host.name` for some OTLP versions) resource attributes defined in OpenTelemetry proto,
2727
for identification of the metric data.
2828

29-
**Notice:** In the resource scope, dots (.) in the attributes' key names are converted to underscores (_), whereas in the metrics scope, they are not converted.
29+
**Label name conversion:** Dots (`.`) in attribute key names are converted to underscores (`_`) for both
30+
resource attributes and data point (metric-level) attributes. For example, `gen_ai.token.type` becomes
31+
`gen_ai_token_type` in MAL rules. Metric names also undergo the same conversion (e.g.,
32+
`gen_ai.client.token.usage` becomes `gen_ai_client_token_usage`).
33+
34+
**Fallback label mappings:** The following resource attributes are copied to alternative label names
35+
if the target does not already exist. These are fallback-only — if the target label is already present
36+
in the resource attributes, the fallback is skipped.
37+
38+
| Source | Target | Notes |
39+
|---|---|---|
40+
| `service.name` | `job_name` | The [OTel Collector Prometheus Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/README.md) automatically converts the Prometheus `job` label to `service.name`. This fallback ensures it is available as `job_name` for MAL rule filtering. |
41+
| `net.host.name` | `node_identifier_host_name` | Legacy: used by VM/Windows MAL rules |
42+
| `host.name` | `node_identifier_host_name` | Legacy: used by VM/Windows MAL rules |
43+
44+
When `job_name` is set explicitly in `OTEL_RESOURCE_ATTRIBUTES` (e.g., by Envoy AI Gateway),
45+
it takes precedence and the `service.name` fallback is skipped.
46+
47+
**Note:** The `net.host.name` and `host.name` mappings are legacy. New integrations should use
48+
the natural dot-to-underscore conversion (e.g., `host.name` → `host_name` in MAL rules).
3049

3150
| Description | Configuration File | Data Source |
3251
|-----------------------------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|

0 commit comments

Comments
 (0)