Astra DB Serverless metrics reference

Astra DB Serverless health metrics provide insight into database performance and workload distribution.

This reference defines the metrics that Astra captures for your Astra DB Serverless databases and PCU groups (if applicable).

Database metrics definitions

Astra DB Serverless database metrics are aggregated values calculated once per minute, and each metric has a rate1m and a rate5m variant. rate1m is the rate of increase or decrease over a one minute interval, and rate5m is the rate of increase or decrease over a five minute interval.

astra_billing_report_tenant_requests_total:rate1m astra_billing_report_tenant_requests_total:rate5m: The number of requests processed in a given 1-minute or 5-minute interval, calculated as a sum of read and write throughput.

This metric doesn’t represent actual billing data because billing data isn’t transmitted in database health metrics. For billing information and usage reports, see Usage reports.

This metrics is only included with a push-based metrics export. It isn’t included in a pull-based metrics import (scrape).
astra_cql_org_apache_cassandra_metrics_Client_connectedNativeClients:rate1m astra_cql_org_apache_cassandra_metrics_Client_connectedNativeClients:rate5m: The number of CQL connections to the database in a given 1-minute or 5-minute interval.
astra_db_rate_limited_requests:rate1m astra_db_rate_limited_requests:rate5m: A calculated rate of change for the number of failed operations due to an Astra DB rate limit. Using these rates, alert if the value is greater than 0 for more than 30 minutes.
astra_db_read_requests_failures:rate1m astra_db_read_requests_failures:rate5m: A calculated rate of change for the number of failed reads. Using these rates, alert if the value is greater than 0. Warn alert on low amount. High alert on larger amounts; determine potentially as a percentage of read throughput.
astra_db_read_requests_timeouts:rate1m astra_db_read_requests_timeouts:rate5m: A calculated rate of change for read timeouts. Timeouts happen when operations against the database take longer than the server side timeout. Using these rates, alert if the value is greater than 0.
astra_db_read_requests_unavailables:rate1m astra_db_read_requests_unavailables:rate5m: A calculated rate of change for reads where there were not enough data service replicas available to complete the request. Using these rates, alert if the value is greater than 0.
astra_db_write_requests_failures:rate1m astra_db_write_requests_failures:rate5m: A calculated rate of change for the number of failed writes. Cassandra drivers retry failed operations, but significant failures can be problematic. Using these rates, alert if the value is greater than 0. Warn alert on low amount. High alert on larger amounts; determine potentially as a percentage of read throughput.
astra_db_write_requests_timeouts:rate1m astra_db_write_requests_timeouts:rate5m: A calculated rate of change for timeouts, which occur when operations take longer than the server side timeout. Using these rates, compare with astra_db_write_requests_failures.
astra_db_write_requests_unavailables:rate1m astra_db_write_requests_unavailables:rate5m: A calculated rate of change for unavailable errors, which occur when the service is not available to service a particular request. Using these rates, compare with astra_db_write_requests_failures.
astra_db_range_requests_failures:rate1m astra_db_range_requests_failures:rate5m: A calculated rate of change for the number of range reads that failed. Cassandra drivers retry failed operations, but significant failures can be problematic. Using these rates, alert if the value is greater than 0. Warn alert on low amount. High alert on larger amounts; determine potentially as a percentage of read throughput.
astra_db_range_requests_timeouts:rate1m astra_db_range_requests_timeouts:rate5m: A calculated rate of change for timeouts, which are a subset of total failures. Use this metric to understand if failures are due to timeouts. Using these rates, compare with astra_db_range_requests_failures.
astra_db_range_requests_unavailables:rate1m astra_db_range_requests_unavailables:rate5m: A calculated rate of change for unavailable errors, which are a subset of total failures. Use this metric to understand if failures are due to timeouts. Using these rates, compare with astra_db_range_requests_failures.
astra_db_write_latency_seconds:rate1m astra_db_write_latency_seconds:rate5m: A calculated rate of change for write throughput as an average number of operations per second. Alert based on your application service level objective (SLO).
astra_db_write_latency_seconds_QUANTILE:rate1m astra_db_write_latency_seconds_QUANTILE:rate5m: A calculated rate of change for write latency. QUANTILE is a histogram quantile of 99, 95, 90, 75, or 50, such as astra_db_write_latency_seconds_P99:rate1m. Alert based on your application SLO.
astra_db_write_requests_mutation_size_bytes_QUANTILE astra_db_write_requests_mutation_size_bytes_QUANTILE: A calculated rate of change for write size. QUANTILE is a histogram quantile of 99, 95, 90, 75, or 50, such as astra_db_write_requests_mutation_size_bytes_P99:rate5m.
astra_db_read_latency_seconds:rate1m astra_db_read_latency_seconds:rate5m: A calculated rate of change for read latency as an average number of operations per second. Alert based on your application SLO.
astra_db_read_latency_seconds_QUANTILE:rate1m astra_db_read_latency_seconds_QUANTILE:rate5m: A calculated rate of change for read latency. QUANTILE is a histogram quantile of 99, 95, 90, 75, or 50, such as astra_db_read_latency_seconds_P99:rate1m. Alert based on your application SLO.
astra_db_range_latency_seconds:rate1m astra_db_range_latency_seconds:rate5m: A calculated rate of change for range read throughput as an average number of operations per second. Alert based on your application SLO.
astra_db_range_latency_seconds_QUANTILE:rate1m astra_db_range_latency_seconds_QUANTILE:rate5m: A calculated rate of change of range read latency. QUANTILE is a histogram quantile of 99, 95, 90, 75, or 50, such as astra_db_range_latency_seconds_P99. Alert based on your application SLO.
astra_db_read_counter_tombstone:rate1m astra_db_read_counter_tombstone:rate5m: A calculated rate of change for the total number of tombstone reads. Tombstones are markers of deleted records or certain updates, such as collection updates. Monitoring the rate of tombstone reads can help identify potential performance impacts. Alert if the value increases significantly.
astra_db_read_failure_tombstone:rate1m astra_db_read_failure_tombstone:rate5m: A calculated rate of change for the total number of read operations that failed due to hitting the tombstone guardrail failure threshold (tombstone_failure_threshold). This metric is critical for identifying issues that could lead to performance degradation or timeouts. Alert if the value is greater than 0.
astra_db_read_warnings_tombstone:rate1m astra_db_read_warnings_tombstone:rate5m: A calculated rate of change for the total number of warnings generated due to getting close to the tombstone guardrail failure threshold (tombstone_warn_threshold). This metric helps identify scenarios where read operations are slowed or at risk of slowing. Alert on a significant increase, which can indicate potential read performance issues.
astra_db_cas_read_latency_seconds:rate1m astra_db_cas_read_latency_seconds:rate5m: A calculated rate of change for the count of Compare and Set (CAS) read operations in Lightweight Transactions (LWTs), measuring the throughput of CAS reads as an average number of operations per second. Monitoring this rate is important for understanding the load and performance of read operations that involve conditional checks. Alert on unusual changes, which could signal issues with data access patterns or performance bottlenecks.
astra_db_cas_read_latency_seconds_QUANTILE:rate1m astra_db_cas_read_latency_seconds_QUANTILE:rate5m: A calculated rate of change for CAS read latency distributions in LWTs. QUANTILE is a histogram quantile of 99, 95, 90, 75, or 50, such as astra_db_cas_read_latency_seconds_P99. This metric helps you identify the latency of CAS reads, which is essential for diagnosing potential read performance issues and understanding the distribution of read operation latencies. Alert based on your application SLO.
astra_db_cas_write_latency_seconds:rate1m astra_db_cas_write_latency_seconds:rate5m: A calculated rate of change for the count of Compare and Swap (CAS) write operations in LWTs, measuring the throughput of CAS writes as an average number of operations per second. Compare and Swap operations are used for atomic read-modify-write operations. Monitoring the rate of these operations helps in understanding the load and performance characteristics of CAS writes. Alert if the rate significantly deviates from expected patterns, indicating potential concurrency or contention issues.
astra_db_cas_write_latency_seconds_QUANTILE:rate1m astra_db_cas_write_latency_seconds_QUANTILE:rate5m: A calculated rate of change for CAS write latency distributions in LWTs. QUANTILE is a histogram quantile of 99, 95, 90, 75, or 50, such as astra_db_cas_write_latency_seconds_P99. This metric provides insights into the latency characteristics of CAS write operations, helping identify latency spikes or trends over time. Alert based on your application SLO.
astra_db_cas_write_unfinished_commit:rate1m astra_db_cas_write_unfinished_commit:rate5m: A calculated rate of change for the total number of CAS write operations in LWTs that did not finish committing. This metric is crucial for detecting issues in the atomicity of write operations, potentially caused by network or node failures. Alert if there’s an increase because this could impact data consistency.
astra_db_cas_write_contention_QUANTILE:rate1m astra_db_cas_write_contention_QUANTILE:rate5m: A calculated rate of change for the distribution of CAS write contention in LWTs. QUANTILE is a histogram quantile of 99, 95, 90, 75, or 50, such as astra_db_cas_write_contention_P99. Contention during CAS write operations can significantly impact performance. This metric helps you understand and diagnose the levels of contention affecting CAS writes. Alert based on your application SLO.
astra_db_cas_read_unfinished_commit:rate1m astra_db_cas_read_unfinished_commit:rate5m: A calculated rate of change for the total number of CAS read operations that encountered unfinished commits. Monitoring this metric is important for identifying issues with read consistency and potential data visibility problems. Alert if there’s an increase, indicating problems with the completion of write operations.
astra_db_cas_read_contention_QUANTILE:rate1m astra_db_cas_read_contention_QUANTILE:rate5m: A calculated rate of change for the distribution of CAS read contention in LWTs. QUANTILE is a histogram quantile of 99, 95, 90, 75, or 50, such as astra_db_cas_read_contention_P99. Contention during CAS reads can indicate performance issues or high levels of concurrent access to the same data. Alert based on your application SLO.
astra_db_cas_read_requests_failures:rate1m astra_db_cas_read_requests_failures:rate5m: A calculated rate of change for the total number of CAS read operations in LWTs that failed. Failures in CAS reads can signal issues with data access or consistency problems. Alert if the rate increases, indicating potential issues affecting the reliability of CAS reads.
astra_db_cas_read_requests_timeouts:rate1m astra_db_cas_read_requests_timeouts:rate5m: A calculated rate of change for the number of CAS read operations in LWTs that timed out. Timeouts can indicate system overload or issues with data access patterns. Monitoring this metric helps in identifying and addressing potential bottlenecks. Alert if there’s an increase in timeouts.
astra_db_cas_read_requests_unavailables:rate1m astra_db_cas_read_requests_unavailables:rate5m: A calculated rate of change for CAS read operations in LWTs that were unavailable. This metric is vital for understanding the availability of the system to handle CAS reads. An increase in unavailability can indicate cluster health issues. Alert if the rate increases.
astra_db_cas_write_requests_failures:rate1m astra_db_cas_write_requests_failures:rate5m: A calculated rate of change for the total number of CAS write operations in LWTs that failed. Failure rates for CAS writes are critical for assessing the reliability and performance of write operations. Alert if there’s a significant increase in failures.
astra_db_cas_write_requests_timeouts:rate1m astra_db_cas_write_requests_timeouts:rate5m: A calculated rate of change for the number of CAS write operations in LWTs that timed out. Write timeouts can significantly impact application performance and user experience. Monitoring this rate is crucial for maintaining system performance. Alert on an upward trend in timeouts.
astra_db_cas_write_requests_unavailables:rate1m astra_db_cas_write_requests_unavailables:rate5m: A calculated rate of change for CAS write operations in LWTs that were unavailable. Increases in this metric can indicate problems with cluster capacity or health, impacting the ability to perform write operations. Alert if there’s an increase, as it could signal critical availability issues.

PCU group metrics definitions

PCU group metrics are available only through a pull-based metrics import (scrape). These metrics aren’t included with a push-based metrics export.

These are the same metrics reported on the PCU details graphs in the Astra Portal.

All metrics default to 0 if there is no data available.

astra_pcu_group_cpu_utilization: Calculates the total CPU utilization for a PCU group.
astra_pcu_group_cache_utilization: Calculates the total disk utilization for a PCU group.
astra_pcu_group_read_latency_ms: Calculates the median (50th percentile) read latency in milliseconds for a PCU group over the past minute.
astra_pcu_group_min_pcu_count: Reports the minimum capacity for a PCU group.
astra_pcu_group_max_pcu_count: Reports the maximum capacity for a PCU group.
astra_pcu_group_reserved_pcu_count: Reports the reserved capacity for a PCU group.
astra_pcu_group_current_rcu: Calculates the current total RCUs for a PCU group in a given minute.
astra_pcu_group_current_hcu: Calculates the current total HCUs for a PCU group in a given minute.
astra_pcu_group_actual_pcu: Calculates the total cumulative PCUs (RCUs + HCUs) for a PCU group in a given minute.

Astra DB Serverless metrics reference

Database metrics definitions

PCU group metrics definitions

See also

Was this helpful?

Give Feedback