Leverage metrics provided by ZDM Proxy
This topic provides detailed information about the metrics captured by ZDM Proxy and explains how to interpret the metrics.
Benefits
ZDM Proxy gathers a large number of metrics, which allows you to gain deep insights into how it is operating with regard to its communication with client applications and clusters, as well as its request handling.
Having visibility on all aspects of ZDM Proxy’s behavior is extremely important in the context of a migration of critical client applications, and is a great help in building confidence in the process and troubleshooting any issues. For this reason, we strongly encourage you to monitor ZDM Proxy, either by deploying the self-contained monitoring stack provided by ZDM Proxy Automation or by importing the pre-built Grafana dashboards in your own monitoring infrastructure.
Retrieving the ZDM Proxy metrics
ZDM Proxy exposes an HTTP endpoint that returns metrics in the Prometheus format.
ZDM Proxy Automation can deploy Prometheus and Grafana, configuring them automatically, as explained here. The Grafana dashboards are ready to go with metrics that are being scraped from the ZDM Proxy instances.
If you already have a Grafana deployment then you can import the dashboards from the two ZDM dashboard files from this ZDM Proxy Automation GitHub location.
Grafana dashboard for ZDM Proxy metrics
There are three groups of metrics in this dashboard:
-
Proxy level metrics
-
Node level metrics
-
Asynchronous read requests metrics

Proxy-level metrics
-
Latency
-
Read Latency: Total latency measured by ZDM Proxy per read request, including post-processing, such as response aggregation. This metric has two labels:
reads_origin
andreads_target
. The label that has data depends on which cluster is receiving the reads, which is the current primary cluster. -
Write Latency: Total latency measured by ZDM Proxy per write request, including post-processing, such as response aggregation. This metric is measured as the total latency across both clusters for a single bifurcated write request.
-
-
Throughput (same structure as the previous latency metrics):
-
Read Throughput
-
Write Throughput
-
-
In-flight requests
-
Number of client connections
-
Prepared Statement cache:
-
Cache Misses: meaning, a prepared statement was sent to ZDM Proxy, but it wasn’t on its cache, so the proxy returned an
UNPREPARED
response to make the driver send thePREPARE
request again. -
Number of cached prepared statements.
-
-
Request Failure Rates: the number of request failures per interval. You can set the interval in the
Error Rate interval
dashboard variable at the top.-
Connect Failure Rate: one
cluster
label with two settings,origin
andtarget
, which represent the cluster to which the connection attempt failed. -
Read Failure Rate: one
cluster
label with two settings,origin
andtarget
. The label that contains data depends on which cluster is currently considered the primary, the same as the latency and throughput metrics explained above. -
Write Failure Rate: one
failed_on
label with three settings,origin
,target
, andboth
.-
failed_on=origin
: the write request failed on the origin only. -
failed_on=target
: the write request failed on the target only. -
failed_on=both
: the write request failed on both the origin and target clusters.
-
-
-
Request Failure Counters: Number of total request failures (resets when the ZDM Proxy instance is restarted)
-
Connect Failure Counters: the same labels as the connect failure rate.
-
Read Failure Counters: the same labels as the read failure rate.
-
Write Failure Counters: the same labels as the write failure rate.
-
To see error metrics by error type, see the node-level error metrics on the next section.
Node-level metrics
-
Latency: Node-level latency metrics report combined read and write latency per cluster, not per request. For latency by request type, see Proxy-level metrics.
-
Origin: Latency, as measured by ZDM Proxy, up to the point that it received a response from the origin connection.
-
Target: latency, as measured by ZDM Proxy, up to the point it received a response from the target connection.
-
-
Throughput: same as node level latency metrics, reads and writes are mixed together.
-
Number of connections per origin node and per target node.
-
Number of Used Stream Ids:
-
Tracks the total number of used stream ids ("request ids") per connection type (
Origin
,Target
, andAsync
).
-
-
Number of errors per error type per origin node and per target node. Possible values for the
error
type label:-
error=client_timeout
-
error=read_failure
-
error=read_timeout
-
error=write_failure
-
error=write_timeout
-
error=overloaded
-
error=unavailable
-
error=unprepared
-
Asynchronous read requests metrics
These metrics are only recorded if you enable asynchronous dual reads.
These metrics track the following information for asynchronous read requests:
-
Latency
-
Throughput
-
Number of dedicated connections per node for the cluster receiving the asynchronous read requests
-
Number of errors per node, separated by error type
Insights from the ZDM Proxy metrics
Some examples of problems manifesting on these metrics:
-
Number of client connections close to 1000 per ZDM Proxy instance: by default, ZDM Proxy starts rejecting client connections after having accepted 1000 of them.
-
Always increasing Prepared Statement cache metrics: both the entries and misses metrics.
-
Error metrics depending on the error type: these need to be evaluated on a per-case basis.
Go runtime metrics dashboard and system dashboard
This dashboard in Grafana is not as important as the ZDM Proxy dashboard. However, it may be useful to troubleshoot performance issues. Here you can see memory usage, Garbage Collection (GC) duration, open fds (file descriptors - useful to detect leaked connections), and the number of goroutines:

Some examples of problem areas on these Go runtime metrics:
-
An always increasing “open fds” metric.
-
GC latencies in (or close to) the triple digits of milliseconds frequently.
-
Always increasing memory usage.
-
Always increasing number of goroutines.
The ZDM monitoring stack also includes a system-level dashboard collected through the Prometheus Node Exporter. This dashboard contains hardware and OS-level metrics for the host on which the proxy runs. This can be useful to check the available resources and identify low-level bottlenecks or issues.