Monitor streaming tenants

Because Astra Streaming is a managed SaaS offering, some Apache Pulsar metrics aren’t exposed for external integration purposes. At a high level, Astra Streaming only exposes metrics related to namespaces. Metrics that are not directly related to namespaces aren’t exposed externally, such as the Bookkeeper ledger and journal metrics and Zookeeper metrics.

Additionally, of the exposed metrics, not all metrics are recommended for external integration.

Pulsar raw metrics

For a complete Pulsar metrics reference, see:

For a complete Astra Streaming metrics reference, see Grafana dashboards for Astra Streaming metrics.

Astra Streaming metrics

Namespace and topic metrics

Astra Streaming exposes both namespace and topic level metrics. Namespace metrics can always be inferred from corresponding topic metrics via metrics aggregation.

The following table lists recommended namespace and/or topic metrics as a starting point.

Metrics Name Namespace and/or Topic Level Metrics Type Note

pulsar_topics_count

Namespace

Gauge

The number of Pulsar topics of a namespace.

pulsar_producers_count

Topic

Gauge

The number of active producers of a topic.

pulsar_consumers_count

Topic

Gauge

The number of active consumers of a topic.

pulsar_subscriptions_count

Topic

Gauge

The number of Pulsar subscriptions of a topic.

pulsar_rate_in

Topic

Gauge

The total message rate (message per second) coming into a topic.

pulsar_rate_out

Topic

Gauge

The total message rate (message per second) coming out of a topic.

pulsar_throughput_in

Topic

Gauge

The total throughput (byte per second) coming into a topic.

pulsar_throughput_out

Topic

Gauge

The total throughput (byte per second) coming out of a topic.

pulsar_msg_backlog

Topic

Gauge

The total number of message backlog of a topic.

pulsar_storage_size

Topic

Gauge

The total storage size (in bytes) of a topic.

pulsar_storage_backlog_size

Topic

Gauge

The total backlog size (in bytes) of a topic.

pulsar_storage_offloaded_size

Topic

Gauge

The total amount of the data (in bytes) of a topic offloaded to the tiered storage.

pulsar_in_bytes_total

Topic

Counter

The total number of messages (in bytes) received for a topic.

pulsar_out_bytes_total

Topic

Counter

The total number of messages (in bytes) read from a topic.

pulsar_in_messages_total

Topic

Counter

The total number of messages received for a topic.

pulsar_out_messages_total

Topic

Counter

The total number of messages read from a topic.

Replication metrics

When geo-replication is enabled for a particular namespace, a subset of namespace metrics is available specifically for geo-replication purposes. Below is the list of recommended geo-replication metrics as a starting point.

Metrics Name Namespace and/or Topic Level Metrics Type Note

pulsar_replication_rate_in

Namespace

Gauge

The total message rate (message per second) of the namespace replicating from a remote cluster.

pulsar_replication_rate_out

Namespace

Gauge

The total message rate (message per second) of the namespace replicating to a remote cluster.

pulsar_replication_throughput_in

Namespace

Gauge

The total throughput (bytes per second) of the namespace replicating from a remote cluster.

pulsar_replication_throughput_out

Namespace

Gauge

The total throughput (bytes per second) of the namespace replicating to a remote cluster.

pulsar_replication_backlog

Namespace

Gauge

The total message backlog of the namespace replicating to a remote cluster.

Subscription metrics

The following table gives the list of recommended subscription metrics as a starting point.

Metrics Name Metrics Type Note

pulsar_subscription_back_log

Gauge

The total backlog (number of messages) for a subscription of a topic.

pulsar_subscription_delayed

Gauge

The total number of messages of a subscription that are delayed to be dispatched for a subscription of a topic.

pulsar_subscription_msg_rate_redeliver

Gauge

The total message rate (message per second) being redelivered for a subscription of a topic.

pulsar_subscription_unacked_messages

Gauge

The total number of unacknowledged messages for a subscription of a topic.

pulsar_subscription_blocked_on_unacked_messages

Gauge

Binary indicator (1 or 0) of whether a subscription of a topic is blocked on unacknowledged messages or not.

pulsar_subscription_msg_rate_out

Gauge

The total message dispatch rate (message per second) for a subscription of a topic.

pulsar_subscription_msg_throughput_out

Gauge

The total message dispatch throughput (bytes per second) for a subscription of a topic.

pulsar_subscription_msg_ack_rate

Gauge

The total message acknowledgment rate (message per second) for a subscription of a topic.

pulsar_subscription_msg_rate_expired

Gauge

The total rate of messages (message per second) expired on a subscription of a topic.

pulsar_subscription_total_msg_expired

Gauge

The total number of messages expired on a subscription of a topic.

pulsar_subscription_msg_drop_rate

Gauge

The rate of messages (message per second) dropped on a subscription of a topic.

pulsar_subscription_consumers_count

Gauge

The number of connected consumers on a subscription of a topic.

Function metrics

The following table gives the list of recommended function metrics as a starting point. This is only relevant when Pulsar functions are deployed in Astra Streaming.

Metrics Name Metrics Type Note

pulsar_function_processed_successfully_total

Counter

The total number of messages processed successfully by a function.

pulsar_function_received_total

Counter

The total number of messages a function receives.

pulsar_function_process_latency_ms

Summary

The process latency (in milliseconds) of a function.

Source connector metrics

The following table gives the list of recommended source connector metrics as a starting point. This is only relevant when Pulsar source connectors are deployed in Astra Streaming.

Metrics Name Metrics Type Note

pulsar_source_written_total

Counter

The total number of messages processed by a source connector.

pulsar_source_received_total

Counter

The total number of messages received by a source connector.

Sink connector metrics

The following table gives the list of recommended source connector metrics as a starting point. This is only relevant when Pulsar sink connectors are deployed in Astra Streaming.

Metrics Name Metrics Type Note

pulsar_sink_written_total

Counter

The total number of messages processed by a sink connector.

pulsar_sink_received_total

Counter

The total number of messages received by a sink connector.

Aggregate Astra Streaming metrics

Do not aggregate metrics on shared clusters because one cluster can be shared among multiple organizations. For more information, see Astra Streaming limits and Astra Streaming pricing.

Each externally exposed raw Astra Streaming metric is reported at a very low level, at each individual server instance (the exported_instance label) and each topic partition (the topic label). The same raw metrics could come from multiple server instances. From a Astra Streaming user’s perspective, the direct monitoring of raw metrics is not really useful. Raw metrics need to be aggregated first - for example, by averaging or summing the raw metrics over a period of time.

The following example shows some raw metrics for a Pulsar message backlog (pulsar_msg_backlog) scraped from an Astra Streaming cluster in the Google Cloud us-central1 region:

....
pulsar_msg_backlog{app="pulsar", cluster="pulsar-gcp-uscentral1", component="broker", controller_revision_hash="pulsar-gcp-uscentral1-broker-<hash>f", exported_instance="<ip>:<port>", exported_job="broker", helm_release_name="astraproduction-gcp-pulsar-uscentral1", instance="prometheus-gcp-uscentral1.streaming.datastax.com:443", job="astra-pulsar-metrics-demo", kubernetes_namespace="pulsar", kubernetes_pod_name="pulsar-gcp-uscentral1-broker-3", namespace="demo/testns", prometheus="pulsar/astraproduction-gcp-pulsar-prometheus", prometheus_replica="prometheus-astraproduction-gcp-pulsar-prometheus-0", pulsar_cluster_dns="gcp-uscentral1.streaming.datastax.com", release="astraproduction-gcp-pulsar-uscentral1", statefulset_kubernetes_io_pod_name="pulsar-gcp-uscentral1-broker-3", topic="persistent://demo/testns/raw-partition-0"}
....

To transform raw metrics into a usable state, DataStax recommends the following:

  • Aggregate metrics at the parent topic level, at minimum, instead of at the partition level. In Pulsar, end user applications only deal with messages at the parent topic level; however, internally, Pulsar handles message processing at the partition level.

  • Exclude reported metrics that are associated with Astra Streaming’s system namespaces and topics, which are usually prefixed by two underscores, such as:

    __kafka
    __transaction_producer_state

PromQL query patterns

PromQL is Prometheus’s simple and powerful query language that you can use to select and aggregate time series data in real time. For more information, see the PromQL documentation.

DataStax recommends the following PromQL query patterns for aggregating raw Astra Streaming metrics. The following examples use the pulsar_msg_backlog raw metric to demonstrate the patterns. In accordance with the recommendations in Aggregate Astra Streaming metrics, the example patterns aggregate messages at the parent topic level or higher and they exclude system topics.

Filter system topics

You can use the following expression to filter system topics:

{topic !~ ".*__.*"}`

This expression excludes messages with topic labels that include two consecutive underscores. This works because Pulsar system topics and namespaces are usually prefixed by two underscores, such as:

persistent://some_tenant/__kafka/__consumer_offsets_partition_0

To use this expression, your applications' namespace and topic names don’t contain double underscores. If they do, they will also be excluded by this filter.

Get the total message backlog of a specific parent topic, excluding system topics

$ptopic is a Grafana dashboard variable that represents a specific parent topic.

sum(pulsar_msg_backlog{topic=~$ptopic, topic !~ ".*__.*"})

Get the total message backlog of a specific namespace, excluding system topics

$namespace is a Grafana dashboard variable that represents a specific namespace.

sum(pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})

Get the total message backlog of a tenant, excluding system topics

$tenant is a (Grafana dashboard) variable that represents a specific tenant.

sum(pulsar_msg_backlog{namespace=~"$tenant.+"", topic !~ ".*__.*"})

Get the total message backlog of each topic within a specific namespace, excluding system topics

sum by(topic) (pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})

Get the top 10 message backlog by topic within a specific namespace, excluding system topics

topk (10, sum by(topic) (pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"}))

Metrics alerts

Most of the exposed Astra Streaming metrics reflect generic application workload characteristics, such as message rate or throughput, and they are for informational purposes only.

However, DataStax recommends that you monitor the following metrics for unexpected increases:

Metrics for alerting
Metrics Name Aggregate Metrics Type Note

pulsar_storage_size

Topic

Gauge

The total storage size (in bytes) of a topic.

pulsar_storage_backlog_size

Topic

Gauge

The total backlog size (in bytes) of a topic.

pulsar_replication_backlog

Georeplication

Gauge

The total message backlog of the namespace replicating to a remote cluster.

pulsar_subscription_back_log

Subscription

Gauge

The total backlog (number of messages) for a subscription of a topic.

pulsar_subscription_delayed

Subscription

Gauge

The total number of messages of a subscription that are delayed to be dispatched for a subscription of a topic.

pulsar_subscription_msg_drop_rate

Subscription

Gauge

The rate of messages (message per second) dropped on a subscription of a topic.

pulsar_subscription_unacked_messages

Subscription

Gauge

The total number of unacknowledged messages for a subscription of a topic.

Alerting rules

In a perfect world, these metrics would always be 0. In reality, these metrics will increase when an application’s workload increases, and then return to normal when the workload decreases.

You can set an alert threshold to be notified when these metrics exceed normal capacity, but this can cause false alarms during expected workload spikes.

Alternatively, you can calculate the metrics' increase rate over a period of time, such as one hour, and then set a threshold based on the rate of increase. For example, if the average message backlog increase rate exceeds the given threshold, an alert is triggered.

Thresholds for these metrics depends on your application’s routine workloads and requirements. Generally, these values are large positive numbers, ranging in the several hundreds or several thousands. If your receive too many false alarms, adjust the alert threshold to a higher value.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com