Monitoring Streaming Tenants

Because Astra Streaming is a software-as-a-service product, not all Apache Pulsar metrics (Pulsar Metrics Reference) are exposed for external integration purposes. At a high level, Astra Streaming only exposes metrics that are related to namespaces. Other metrics that are not directly namespace related are not exposed externally, such as the Bookkeeper ledger and journal metrics and Zookeeper metrics.

In the following sections, we’ll explore each of the Astra Streaming metrics categories that are available for external integration, and recommended metrics for external integration.

Pulsar raw metrics

For a complete Pulsar metrics reference, see:

For a complete Astra Streaming metrics reference, see Grafana dashboards for Astra Streaming metrics.

Astra Streaming metrics

Namespace and topic metrics

Astra Streaming exposes both namespace and topic level metrics. Namespace metrics can always be inferred from corresponding topic metrics via metrics aggregation.

The following table lists recommended namespace and/or topic metrics as a starting point.

Metrics Name Namespace and/or Topic Level Metrics Type Note

pulsar_topics_count

Namespace

Gauge

The number of Pulsar topics of a namespace.

pulsar_producers_count

Topic

Gauge

The number of active producers of a topic.

pulsar_consumers_count

Topic

Gauge

The number of active consumers of a topic.

pulsar_subscriptions_count

Topic

Gauge

The number of Pulsar subscriptions of a topic.

pulsar_rate_in

Topic

Gauge

The total message rate (message per second) coming into a topic.

pulsar_rate_out

Topic

Gauge

The total message rate (message per second) coming out of a topic.

pulsar_throughput_in

Topic

Gauge

The total throughput (byte per second) coming into a topic.

pulsar_throughput_out

Topic

Gauge

The total throughput (byte per second) coming out of a topic.

pulsar_msg_backlog

Topic

Gauge

The total number of message backlog of a topic.

pulsar_storage_size

Topic

Gauge

The total storage size (in bytes) of a topic.

pulsar_storage_backlog_size

Topic

Gauge

The total backlog size (in bytes) of a topic.

pulsar_storage_offloaded_size

Topic

Gauge

The total amount of the data (in bytes) of a topic offloaded to the tiered storage.

pulsar_in_bytes_total

Topic

Counter

The total number of messages (in bytes) received for a topic.

pulsar_out_bytes_total

Topic

Counter

The total number of messages (in bytes) read from a topic.

pulsar_in_messages_total

Topic

Counter

The total number of messages received for a topic.

pulsar_out_messages_total

Topic

Counter

The total number of messages read from a topic.

Replication Metrics

When geo-replication is enabled for a particular namespace, a subset of namespace metrics is available specifically for geo-replication purposes. Below is the list of recommended geo-replication metrics as a starting point.

Metrics Name Namespace and/or Topic Level Metrics Type Note

pulsar_replication_rate_in

Namespace

Gauge

The total message rate (message per second) of the namespace replicating from a remote cluster.

pulsar_replication_rate_out

Namespace

Gauge

The total message rate (message per second) of the namespace replicating to a remote cluster.

pulsar_replication_throughput_in

Namespace

Gauge

The total throughput (bytes per second) of the namespace replicating from a remote cluster.

pulsar_replication_throughput_out

Namespace

Gauge

The total throughput (bytes per second) of the namespace replicating to a remote cluster.

pulsar_replication_backlog

Namespace

Gauge

The total message backlog of the namespace replicating to a remote cluster.

Subscription metrics

The following table gives the list of recommended subscription metrics as a starting point.

Metrics Name Metrics Type Note

pulsar_subscription_back_log

Gauge

The total backlog (number of messages) for a subscription of a topic.

pulsar_subscription_delayed

Gauge

The total number of messages of a subscription that are delayed to be dispatched for a subscription of a topic.

pulsar_subscription_msg_rate_redeliver

Gauge

The total message rate (message per second) being redelivered for a subscription of a topic.

pulsar_subscription_unacked_messages

Gauge

The total number of unacknowledged messages for a subscription of a topic.

pulsar_subscription_blocked_on_unacked_messages

Gauge

Binary indicator (1 or 0) of whether a subscription of a topic is blocked on unacknowledged messages or not.

pulsar_subscription_msg_rate_out

Gauge

The total message dispatch rate (message per second) for a subscription of a topic.

pulsar_subscription_msg_throughput_out

Gauge

The total message dispatch throughput (bytes per second) for a subscription of a topic.

pulsar_subscription_msg_ack_rate

Gauge

The total message acknowledgment rate (message per second) for a subscription of a topic.

pulsar_subscription_msg_rate_expired

Gauge

The total rate of messages (message per second) expired on a subscription of a topic.

pulsar_subscription_total_msg_expired

Gauge

The total number of messages expired on a subscription of a topic.

pulsar_subscription_msg_drop_rate

Gauge

The rate of messages (message per second) dropped on a subscription of a topic.

pulsar_subscription_consumers_count

Gauge

The number of connected consumers on a subscription of a topic.

Function Metrics

The following table gives the list of recommended function metrics as a starting point. This is only relevant when Pulsar functions are deployed in Astra Streaming.

Metrics Name Metrics Type Note

pulsar_function_processed_successfully_total

Counter

The total number of messages processed successfully by a function.

pulsar_function_received_total

Counter

The total number of messages a function receives.

pulsar_function_process_latency_ms

Summary

The process latency (in milliseconds) of a function.

Source connector metrics

The following table gives the list of recommended source connector metrics as a starting point. This is only relevant when Pulsar source connectors are deployed in Astra Streaming.

Metrics Name Metrics Type Note

pulsar_source_written_total

Counter

The total number of messages processed by a source connector.

pulsar_source_received_total

Counter

The total number of messages received by a source connector.

Sink connector metrics

The following table gives the list of recommended source connector metrics as a starting point. This is only relevant when Pulsar sink connectors are deployed in Astra Streaming.

Metrics Name Metrics Type Note

pulsar_sink_written_total

Counter

The total number of messages processed by a sink connector.

pulsar_sink_received_total

Counter

The total number of messages received by a sink connector.

Aggregate Astra Streaming Metrics

Each externally exposed raw Astra Streaming metric is reported at a very low level, at each individual server instance (the exported_instance label) and each topic partition (the topic label). The same raw metrics could come from multiple server instances. From a Astra Streaming user’s perspective, the direct monitoring of raw metrics is not really useful. Raw metrics need to be aggregated first - for example, by averaging or summing the raw metrics over a period of time.

Below is an example of a raw metric for the Pulsar message backlog (pulsar_msg_backlog) scraped from an Astra Streaming cluster located in the GCP US Central region:

Show raw metric for Pulsar message backlog:
pulsar_msg_backlog{app="pulsar", cluster="pulsar-gcp-uscentral1", component="broker", controller_revision_hash="pulsar-gcp-uscentral1-broker-<hash>f", exported_instance="<ip>:<port>", exported_job="broker", helm_release_name="astraproduction-gcp-pulsar-uscentral1", instance="prometheus-gcp-uscentral1.streaming.datastax.com:443", job="astra-pulsar-metrics-msgenrich", kubernetes_namespace="pulsar", kubernetes_pod_name="pulsar-gcp-uscentral1-broker-3", namespace="msgenrich/testns", prometheus="pulsar/astraproduction-gcp-pulsar-prometheus", prometheus_replica="prometheus-astraproduction-gcp-pulsar-prometheus-0", pulsar_cluster_dns="gcp-uscentral1.streaming.datastax.com", release="astraproduction-gcp-pulsar-uscentral1", statefulset_kubernetes_io_pod_name="pulsar-gcp-uscentral1-broker-3", topic="persistent://msgenrich/testns/raw-partition-0"}

To make raw metrics like this useful for end users, we recommend the following guidelines when aggregating raw metrics:

  1. Aggregate metrics to at least the parent topic level, instead of at the partition level. In Pulsar, end user applications only deal with messages at the parent topic level (but internally, Pulsar is handling message processing at the partition level).

  2. Exclude reported metrics that are associated with Astra Streaming’s system namespaces and topics. These namespaces and topics normally have a name starting with __ (two underscores). For example, when Pulsar’s Kafka protocol handler is enabled (via S4K integration), a system namespace __kafka is created with one system topic within called __transaction_producer_state. Do NOT aggregate metrics with the Astra Streaming Pay As You Go option, since one cluster may be shared among multiple organizations. For more, see Cluster Limits.

PromQL query patterns

Prometheus provides a powerful but easy-to-use query language called PromQL for selecting and aggregating time series data in real time. PromQL syntax is beyond this document’s scope, but the Prometheus documentation is a great place to start.

In the rest of this section, we’ll recommend some PromQL query patterns for aggregating raw Astra Streaming metrics. These examples use one Astra Streaming raw metric, pulsar_msg_backlog, as an example for illustrative purposes. We aggregate messages at the parent topic level or above, and exclude system topics per our recommendations above. We filter out system messages with the pattern {topic !~ ".*__.*"}. This PromQL pattern filters out messages when their topic labels do not include __. This works because Pulsar system topics usually have __ as the topic or namespace name prefix (e.g. persistent://<tenant>/__kafka/__consumer_offsets_partition_0). This pattern assumes that the user applications don’t also have namespaces and topics with __ as part of the names, or they will be filtered as well.

Pattern 1: Get the total message backlog of a specific parent topic, excluding system topics. "$ptopic" is a (Grafana dashboard) variable that represents a specific parent topic.

sum(pulsar_msg_backlog{topic=~$ptopic, topic !~ ".*__.*"})

Pattern 2: Get the total message backlog of a specific namespace, excluding system topics. "$namespace" is a (Grafana dashboard) variable that represents a specific namespace.

sum(pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})

Pattern 3: Get the total message backlog of a tenant, excluding system topics. "$tenant" is a (Grafana dashboard) variable that represents a specific tenant.

sum(pulsar_msg_backlog{namespace=~"$tenant.+"", topic !~ ".*__.*"})

Pattern 4: Get the total message backlog of each topic within a specific namespace, excluding system topics.

sum by(topic) (pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})

Pattern 5: Get the top 10 message backlog by topic within a specific namespace, excluding system topics.

topk by(topic) (10, sum(pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})

Metrics to be alerted

Most of the exposed Astra Streaming metrics are for informational purposes only and in most cases the metrics values are just reflecting the application workload characteristics. For example, message rate or throughput are common examples of such metrics.

There are, however, several metrics that need special attention when we see an increasing number of their values. Among the exposed Astra Streaming metrics, these metrics are: .Metrics for alerting

Metrics Name Aggregate Metrics Type Note

pulsar_storage_size

Topic

Gauge

The total storage size (in bytes) of a topic.

pulsar_storage_backlog_size

Topic

Gauge

The total backlog size (in bytes) of a topic.

pulsar_replication_backlog

Georeplication

Gauge

The total message backlog of the namespace replicating to a remote cluster.

pulsar_subscription_back_log

Subscription

Gauge

The total backlog (number of messages) for a subscription of a topic.

pulsar_subscription_delayed

Subscription

Gauge

The total number of messages of a subscription that are delayed to be dispatched for a subscription of a topic.

pulsar_subscription_msg_drop_rate

Subscription

Gauge

The rate of messages (message per second) dropped on a subscription of a topic.

pulsar_subscription_unacked_messages

Subscription

Gauge

The total number of unacknowledged messages for a subscription of a topic.

Alerting rules

In a perfect world, these metrics should always stay at 0, but in reality, these metrics will increase when the application workload becomes heavier. If your system is behaving correctly, these metrics should go down when the application workload drops.

A simple way to trigger an alert on these metrics is to set a threshold which triggers an alert when the metric exceeds it. However, this will probably cause false alarms during workload spikes.

A better approach is calculating the metrics' increase rate over a period of time (e.g. 1 hour) and setting a threshold on the rate of increase. For example, if the average message backlog increase rate exceeds a threshold, an alert is triggered.

The actual threshold values for these metrics is highly dependent on each application’s workload and requirements, but the values should be relatively large positive numbers, e.g. several hundreds or several thousands. Otherwise, they may trigger too many false alarms.

What’s next?

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com