Monitoring Streaming Tenants
Because Astra Streaming is a software-as-a-service product, not all Apache Pulsar metrics (Pulsar Metrics Reference) are exposed for external integration purposes. At a high level, Astra Streaming only exposes metrics that are related to namespaces. Other metrics that are not directly namespace related are not exposed externally, such as the Bookkeeper ledger and journal metrics and Zookeeper metrics.
In the following sections, we’ll explore each of the Astra Streaming metrics categories that are available for external integration, and recommended metrics for external integration.
Pulsar raw metrics
For a complete Pulsar metrics reference, see:
For a complete Astra Streaming metrics reference, see Grafana dashboards for Astra Streaming metrics.
Astra Streaming metrics
Namespace and topic metrics
Astra Streaming exposes both namespace and topic level metrics. Namespace metrics can always be inferred from corresponding topic metrics via metrics aggregation.
The following table lists recommended namespace and/or topic metrics as a starting point.
Metrics Name | Namespace and/or Topic Level | Metrics Type | Note |
---|---|---|---|
pulsar_topics_count |
Namespace |
Gauge |
The number of Pulsar topics of a namespace. |
pulsar_producers_count |
Topic |
Gauge |
The number of active producers of a topic. |
pulsar_consumers_count |
Topic |
Gauge |
The number of active consumers of a topic. |
pulsar_subscriptions_count |
Topic |
Gauge |
The number of Pulsar subscriptions of a topic. |
pulsar_rate_in |
Topic |
Gauge |
The total message rate (message per second) coming into a topic. |
pulsar_rate_out |
Topic |
Gauge |
The total message rate (message per second) coming out of a topic. |
pulsar_throughput_in |
Topic |
Gauge |
The total throughput (byte per second) coming into a topic. |
pulsar_throughput_out |
Topic |
Gauge |
The total throughput (byte per second) coming out of a topic. |
pulsar_msg_backlog |
Topic |
Gauge |
The total number of message backlog of a topic. |
pulsar_storage_size |
Topic |
Gauge |
The total storage size (in bytes) of a topic. |
pulsar_storage_backlog_size |
Topic |
Gauge |
The total backlog size (in bytes) of a topic. |
pulsar_storage_offloaded_size |
Topic |
Gauge |
The total amount of the data (in bytes) of a topic offloaded to the tiered storage. |
pulsar_in_bytes_total |
Topic |
Counter |
The total number of messages (in bytes) received for a topic. |
pulsar_out_bytes_total |
Topic |
Counter |
The total number of messages (in bytes) read from a topic. |
pulsar_in_messages_total |
Topic |
Counter |
The total number of messages received for a topic. |
pulsar_out_messages_total |
Topic |
Counter |
The total number of messages read from a topic. |
Replication Metrics
When geo-replication is enabled for a particular namespace, a subset of namespace metrics is available specifically for geo-replication purposes. Below is the list of recommended geo-replication metrics as a starting point.
Metrics Name | Namespace and/or Topic Level | Metrics Type | Note |
---|---|---|---|
pulsar_replication_rate_in |
Namespace |
Gauge |
The total message rate (message per second) of the namespace replicating from a remote cluster. |
pulsar_replication_rate_out |
Namespace |
Gauge |
The total message rate (message per second) of the namespace replicating to a remote cluster. |
pulsar_replication_throughput_in |
Namespace |
Gauge |
The total throughput (bytes per second) of the namespace replicating from a remote cluster. |
pulsar_replication_throughput_out |
Namespace |
Gauge |
The total throughput (bytes per second) of the namespace replicating to a remote cluster. |
pulsar_replication_backlog |
Namespace |
Gauge |
The total message backlog of the namespace replicating to a remote cluster. |
Subscription metrics
The following table gives the list of recommended subscription metrics as a starting point.
Metrics Name | Metrics Type | Note |
---|---|---|
pulsar_subscription_back_log |
Gauge |
The total backlog (number of messages) for a subscription of a topic. |
pulsar_subscription_delayed |
Gauge |
The total number of messages of a subscription that are delayed to be dispatched for a subscription of a topic. |
pulsar_subscription_msg_rate_redeliver |
Gauge |
The total message rate (message per second) being redelivered for a subscription of a topic. |
pulsar_subscription_unacked_messages |
Gauge |
The total number of unacknowledged messages for a subscription of a topic. |
pulsar_subscription_blocked_on_unacked_messages |
Gauge |
Binary indicator (1 or 0) of whether a subscription of a topic is blocked on unacknowledged messages or not. |
pulsar_subscription_msg_rate_out |
Gauge |
The total message dispatch rate (message per second) for a subscription of a topic. |
pulsar_subscription_msg_throughput_out |
Gauge |
The total message dispatch throughput (bytes per second) for a subscription of a topic. |
pulsar_subscription_msg_ack_rate |
Gauge |
The total message acknowledgment rate (message per second) for a subscription of a topic. |
pulsar_subscription_msg_rate_expired |
Gauge |
The total rate of messages (message per second) expired on a subscription of a topic. |
pulsar_subscription_total_msg_expired |
Gauge |
The total number of messages expired on a subscription of a topic. |
pulsar_subscription_msg_drop_rate |
Gauge |
The rate of messages (message per second) dropped on a subscription of a topic. |
pulsar_subscription_consumers_count |
Gauge |
The number of connected consumers on a subscription of a topic. |
Function Metrics
The following table gives the list of recommended function metrics as a starting point. This is only relevant when Pulsar functions are deployed in Astra Streaming.
Metrics Name | Metrics Type | Note |
---|---|---|
pulsar_function_processed_successfully_total |
Counter |
The total number of messages processed successfully by a function. |
pulsar_function_received_total |
Counter |
The total number of messages a function receives. |
pulsar_function_process_latency_ms |
Summary |
The process latency (in milliseconds) of a function. |
Source connector metrics
The following table gives the list of recommended source connector metrics as a starting point. This is only relevant when Pulsar source connectors are deployed in Astra Streaming.
Metrics Name | Metrics Type | Note |
---|---|---|
pulsar_source_written_total |
Counter |
The total number of messages processed by a source connector. |
pulsar_source_received_total |
Counter |
The total number of messages received by a source connector. |
Sink connector metrics
The following table gives the list of recommended source connector metrics as a starting point. This is only relevant when Pulsar sink connectors are deployed in Astra Streaming.
Metrics Name | Metrics Type | Note |
---|---|---|
pulsar_sink_written_total |
Counter |
The total number of messages processed by a sink connector. |
pulsar_sink_received_total |
Counter |
The total number of messages received by a sink connector. |
Aggregate Astra Streaming Metrics
Each externally exposed raw Astra Streaming metric is reported at a very low level, at each individual server instance (the exported_instance
label) and each topic partition (the topic
label). The same raw metrics could come from multiple server instances. From a Astra Streaming user’s perspective, the direct monitoring of raw metrics is not really useful. Raw metrics need to be aggregated first - for example, by averaging or summing the raw metrics over a period of time.
Below is an example of a raw metric for the Pulsar message backlog (pulsar_msg_backlog) scraped from an Astra Streaming cluster located in the GCP US Central region:
Show raw metric for Pulsar message backlog:
pulsar_msg_backlog{app="pulsar", cluster="pulsar-gcp-uscentral1", component="broker", controller_revision_hash="pulsar-gcp-uscentral1-broker-<hash>f", exported_instance="<ip>:<port>", exported_job="broker", helm_release_name="astraproduction-gcp-pulsar-uscentral1", instance="prometheus-gcp-uscentral1.streaming.datastax.com:443", job="astra-pulsar-metrics-msgenrich", kubernetes_namespace="pulsar", kubernetes_pod_name="pulsar-gcp-uscentral1-broker-3", namespace="msgenrich/testns", prometheus="pulsar/astraproduction-gcp-pulsar-prometheus", prometheus_replica="prometheus-astraproduction-gcp-pulsar-prometheus-0", pulsar_cluster_dns="gcp-uscentral1.streaming.datastax.com", release="astraproduction-gcp-pulsar-uscentral1", statefulset_kubernetes_io_pod_name="pulsar-gcp-uscentral1-broker-3", topic="persistent://msgenrich/testns/raw-partition-0"}
To make raw metrics like this useful for end users, we recommend the following guidelines when aggregating raw metrics:
-
Aggregate metrics to at least the parent topic level, instead of at the partition level. In Pulsar, end user applications only deal with messages at the parent topic level (but internally, Pulsar is handling message processing at the partition level).
-
Exclude reported metrics that are associated with Astra Streaming’s system namespaces and topics. These namespaces and topics normally have a name starting with __ (two underscores). For example, when Pulsar’s Kafka protocol handler is enabled (via S4K integration), a system namespace __kafka is created with one system topic within called __transaction_producer_state. Do NOT aggregate metrics with the Astra Streaming Pay As You Go option, since one cluster may be shared among multiple organizations. For more, see Cluster Limits.
PromQL query patterns
Prometheus provides a powerful but easy-to-use query language called PromQL for selecting and aggregating time series data in real time. PromQL syntax is beyond this document’s scope, but the Prometheus documentation is a great place to start.
In the rest of this section, we’ll recommend some PromQL query patterns for aggregating raw Astra Streaming metrics. These examples use one Astra Streaming raw metric, pulsar_msg_backlog, as an example for illustrative purposes. We aggregate messages at the parent topic level or above, and exclude system topics per our recommendations above. We filter out system messages with the pattern {topic !~ ".*__.*"}. This PromQL pattern filters out messages when their topic labels do not include __. This works because Pulsar system topics usually have __ as the topic or namespace name prefix (e.g. persistent://<tenant>/__kafka/__consumer_offsets_partition_0). This pattern assumes that the user applications don’t also have namespaces and topics with __ as part of the names, or they will be filtered as well.
Pattern 1: Get the total message backlog of a specific parent topic, excluding system topics. "$ptopic" is a (Grafana dashboard) variable that represents a specific parent topic.
sum(pulsar_msg_backlog{topic=~$ptopic, topic !~ ".*__.*"})
Pattern 2: Get the total message backlog of a specific namespace, excluding system topics. "$namespace" is a (Grafana dashboard) variable that represents a specific namespace.
sum(pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})
Pattern 3: Get the total message backlog of a tenant, excluding system topics. "$tenant" is a (Grafana dashboard) variable that represents a specific tenant.
sum(pulsar_msg_backlog{namespace=~"$tenant.+"", topic !~ ".*__.*"})
Pattern 4: Get the total message backlog of each topic within a specific namespace, excluding system topics.
sum by(topic) (pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})
Pattern 5: Get the top 10 message backlog by topic within a specific namespace, excluding system topics.
topk by(topic) (10, sum(pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})
Metrics to be alerted
Most of the exposed Astra Streaming metrics are for informational purposes only and in most cases the metrics values are just reflecting the application workload characteristics. For example, message rate or throughput are common examples of such metrics.
There are, however, several metrics that need special attention when we see an increasing number of their values. Among the exposed Astra Streaming metrics, these metrics are: .Metrics for alerting
Metrics Name | Aggregate | Metrics Type | Note |
---|---|---|---|
pulsar_storage_size |
Topic |
Gauge |
The total storage size (in bytes) of a topic. |
pulsar_storage_backlog_size |
Topic |
Gauge |
The total backlog size (in bytes) of a topic. |
pulsar_replication_backlog |
Georeplication |
Gauge |
The total message backlog of the namespace replicating to a remote cluster. |
pulsar_subscription_back_log |
Subscription |
Gauge |
The total backlog (number of messages) for a subscription of a topic. |
pulsar_subscription_delayed |
Subscription |
Gauge |
The total number of messages of a subscription that are delayed to be dispatched for a subscription of a topic. |
pulsar_subscription_msg_drop_rate |
Subscription |
Gauge |
The rate of messages (message per second) dropped on a subscription of a topic. |
pulsar_subscription_unacked_messages |
Subscription |
Gauge |
The total number of unacknowledged messages for a subscription of a topic. |
Alerting rules
In a perfect world, these metrics should always stay at 0, but in reality, these metrics will increase when the application workload becomes heavier. If your system is behaving correctly, these metrics should go down when the application workload drops.
A simple way to trigger an alert on these metrics is to set a threshold which triggers an alert when the metric exceeds it. However, this will probably cause false alarms during workload spikes.
A better approach is calculating the metrics' increase rate over a period of time (e.g. 1 hour) and setting a threshold on the rate of increase. For example, if the average message backlog increase rate exceeds a threshold, an alert is triggered.
The actual threshold values for these metrics is highly dependent on each application’s workload and requirements, but the values should be relatively large positive numbers, e.g. several hundreds or several thousands. Otherwise, they may trigger too many false alarms.
What’s next?
-
For a list of exposed endpoints for Astra Streaming metrics, see Grafana dashboards for Astra Streaming metrics.
-
To scrape Astra Streaming metrics in Kubernetes with an external Prometheus server, see External Prometheus and Grafana Integration.
-
To scrape Astra Streaming metrics into New Relic, see New Relic Integration.