• Glossary
  • Support
  • Downloads
  • DataStax Home
Get Live Help
Expand All
Collapse All

DataStax Astra DB Classic Documentation

    • Overview
      • Release notes
      • Astra DB FAQs
      • CQL for Astra
      • Astra DB glossary
      • Get support
    • Getting Started
      • Create your database
      • Grant a user access
      • Load and retrieve data
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
      • Connect a driver
      • Build sample apps
      • Use integrations
    • Planning
      • Plan options
      • Database regions
    • Securing
      • Security highlights
      • Security guidelines
      • Default user permissions
      • Change your password
      • Reset your password
      • Authentication and Authorization
      • Astra DB Plugin for HashiCorp Vault
    • Connecting
      • Connecting to a VPC
      • Connecting Change Data Capture (CDC)
      • Connecting CQL console
      • Connect the Spark Cassandra Connector to Astra
      • Drivers for Astra DB
        • Connecting C++ driver
        • Connecting C# driver
        • Connecting Java driver
        • Connecting Node.js driver
        • Connecting Python driver
        • Connecting Legacy drivers
        • Drivers retry policies
      • Get Secure Connect Bundle
    • Migrating
      • Components
      • FAQs
      • Preliminary steps
        • Feasibility checks
        • Deployment and infrastructure considerations
        • Create target environment for migration
        • Understand rollback options
      • Phase 1: Deploy ZDM Proxy and connect client applications
        • Set up the ZDM Proxy Automation with ZDM Utility
        • Deploy the ZDM Proxy and monitoring
        • Configure Transport Layer Security
        • Connect client applications to ZDM Proxy
        • Leverage metrics provided by ZDM Proxy
        • Manage your ZDM Proxy instances
      • Phase 2: Migrate and validate data
        • Cassandra Data Migrator
        • DSBulk Migrator
      • Phase 3: Enable asynchronous dual reads
      • Phase 4: Change read routing to Target
      • Phase 5: Connect client applications directly to Target
      • Troubleshooting
        • Troubleshooting tips
        • Troubleshooting scenarios
      • Glossary
      • Contribution guidelines
      • Release Notes
    • API QuickStarts
      • Document API QuickStart
      • REST API QuickStart
      • GraphQL CQL-first API QuickStart
    • Developing with APIs
      • Developing with Document API
      • Developing with REST API
      • Developing with GraphQL API
        • Developing with GraphQL API (CQL-first)
        • Developing with GraphQL API (Schema-first)
      • Developing with gRPC API
        • gRPC Rust Client
        • gRPC Go Client
        • gRPC Node.js Client
        • gRPC Java Client
      • Developing with CQL API
      • Tooling Resources
      • Node.js Document Collection Client
      • Node.js REST Client
    • API References
      • Astra DB JSON API v1
      • Astra DB REST API v2
      • Astra DB Document API v2
      • Astra DB DevOps API v2
    • Managing
      • Managing your organization
        • User permissions
        • Pricing and billing
        • Audit Logs
        • Delete an account
        • Bring Your Own Key
          • BYOK AWS DevOps API
        • Configuring SSO
          • Configure SSO for Microsoft Azure AD
          • Configure SSO for Okta
          • Configure SSO for OneLogin
      • Managing your database
        • Create your database
        • View your databases
        • Database statuses
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
        • Monitor your databases
        • Manage multiple keyspaces
        • Using multiple regions
        • Terminate your database
        • Resize your classic database
        • Park your classic database
        • Unpark your classic database
      • Managing with DevOps API
        • Managing database lifecycle
        • Managing roles
        • Managing users
        • Managing tokens
        • Managing multiple regions
        • Get private endpoints
        • AWS PrivateLink
        • Azure PrivateLink
        • GCP Private Service
    • Astra CLI
  • DataStax Astra DB Classic Documentation
  • Migrating
  • Phase 1: Deploy ZDM Proxy and connect client applications
  • Leverage metrics provided by ZDM Proxy

Leverage metrics provided by ZDM Proxy

This topic provides detailed information about the metrics captured by the ZDM Proxy and explains how to interpret the metrics.

Benefits

The ZDM Proxy gathers a large number of metrics, which allows you to gain deep insights into how it is operating with regard to its communication with client applications and clusters, as well as its request handling.

Having visibility on all aspects of the ZDM Proxy’s behavior is extremely important in the context of a migration of critical client applications, and is a great help in building confidence in the process and troubleshooting any issues. For this reason, we strongly encourage you to monitor the ZDM Proxy, either by deploying the self-contained monitoring stack provided by the ZDM Proxy Automation or by importing the pre-built Grafana dashboards in your own monitoring infrastructure.

Retrieving the ZDM Proxy metrics

ZDM Proxy exposes an HTTP endpoint that returns metrics in the Prometheus format.

ZDM Proxy Automation can deploy Prometheus and Grafana, configuring them automatically, as explained here. The Grafana dashboards are ready to go with metrics that are being scraped from the ZDM Proxy instances.

If you already have a Grafana deployment then you can import the dashboards from the two ZDM dashboard files from this ZDM Proxy Automation GitHub location.

Grafana dashboard for ZDM Proxy metrics

There are three groups of metrics in this dashboard:

  • Proxy level metrics

  • Node level metrics

  • Asynchronous read requests metrics

Grafana dashboard shows three categories of ZDM metrics for the proxy.

Proxy-level metrics

  • Latency:

    • Read Latency: total latency measured by the ZDM Proxy (including post-processing like response aggregation) for read requests. This metric has two labels (reads_origin and reads_target): the label that has data will depend on which cluster is receiving the reads, i.e. which cluster is currently considered the primary cluster. This is configured by the ZDM Proxy Automation through the variable primary_cluster, or directly through the environment variable ZDM_PRIMARY_CLUSTER of the ZDM Proxy.

    • Write Latency: total latency measured by the ZDM Proxy (including post-processing like response aggregation) for write requests.

  • Throughput (same structure as the previous latency metrics):

    • Read Throughput

    • Write Throughput

  • In-flight requests

  • Number of client connections

  • Prepared Statement cache:

    • Cache Misses: meaning, a prepared statement was sent to the ZDM Proxy, but it wasn’t on its cache, so the proxy returned an UNPREPARED response to make the driver send the PREPARE request again.

    • Number of cached prepared statements.

  • Request Failure Rates: number of request failures per interval. You can set the interval via the Error Rate interval dashboard variable at the top.

    • Read Failure Rate: one cluster label with two settings: origin and target. The label that contains data depends on which cluster is currently considered the primary (same as the latency and throughput metrics explained above).

    • Write Failure Rate: one failed_on label with three settings: origin, target and both.

      • failed_on=origin: the write request failed on Origin ONLY.

      • failed_on=target: the write request failed on Target ONLY.

      • failed_on=both: the write request failed on BOTH clusters.

  • Request Failure Counters: Number of total request failures (resets when the ZDM Proxy instance is restarted)

    • Read Failure Counters: same labels as read failure rate.

    • Write Failure Counters: same labels as write failure rate.

To see error metrics by error type, see the node-level error metrics on the next section.

Node-level metrics

  • Latency: metrics on this bucket are not split by request type like the proxy level latency metrics so writes and reads are mixed together:

    • Origin: latency measured by the ZDM Proxy up to the point it received a response from the Origin connection.

    • Target: latency measured by the ZDM Proxy up to the point it received a response from the Target connection.

  • Throughput: same as node level latency metrics, reads and writes are mixed together.

  • Number of connections per Origin node and per Target node.

  • Number of Used Stream Ids:

    • Tracks the total number of used stream ids ("request ids") per connection type (Origin, Target and Async).

  • Number of errors per error type per Origin node and per Target node. Possible values for the error type label:

    • error=client_timeout

    • error=read_failure

    • error=read_timeout

    • error=write_failure

    • error=write_timeout

    • error=overloaded

    • error=unavailable

    • error=unprepared

Asynchronous read requests metrics

These metrics are specific to asynchronous reads, so they are only populated if asynchronous dual reads are enabled. This is done by setting the ZDM Proxy Automation variable read_mode, or its equivalent environment variable ZDM_READ_MODE, to DUAL_ASYNC_ON_SECONDARY as explained here.

These metrics track:

  • Latency.

  • Throughput.

  • Number of dedicated connections per node for async reads: whether it’s Origin or Target connections depends on the ZDM Proxy configuration. That is, if the primary cluster is Origin, then the asynchronous reads are sent to Target.

  • Number of errors per error type per node.

Insights via the ZDM Proxy metrics

Some examples of problems manifesting on these metrics:

  • Number of client connections close to 1000 per ZDM Proxy instance: by default, ZDM Proxy starts rejecting client connections after having accepted 1000 of them.

  • Always increasing Prepared Statement cache metrics: both the entries and misses metrics.

  • Error metrics depending on the error type: these need to be evaluated on a per-case basis.

Go runtime metrics dashboard and system dashboard

This dashboard in Grafana is not as important as the ZDM Proxy dashboard. However, it may be useful to troubleshoot performance issues. Here you can see memory usage, Garbage Collection (GC) duration, open fds (file descriptors - useful to detect leaked connections), and the number of goroutines:

Golang metrics dashboard example is shown.

Some examples of problem areas on these Go runtime metrics:

  • An always increasing “open fds” metric.

  • GC latencies in (or close to) the triple digits of milliseconds frequently.

  • Always increasing memory usage.

  • Always increasing number of goroutines.

The ZDM monitoring stack also includes a system-level dashboard collected through the Prometheus Node Exporter. This dashboard contains hardware and OS-level metrics for the host on which the proxy runs. This can be useful to check the available resources and identify low-level bottlenecks or issues.

Connect client applications to ZDM Proxy Manage your ZDM Proxy instances

General Inquiries: +1 (650) 389-6000 info@datastax.com

© DataStax | Privacy policy | Terms of use

DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

Kubernetes is the registered trademark of the Linux Foundation.

landing_page landingpage