Monitor streaming tenants
Because Astra Streaming is a managed SaaS offering, some Apache Pulsar metrics aren’t exposed for external integration purposes. At a high level, Astra Streaming only exposes metrics related to namespaces. Metrics that are not directly related to namespaces aren’t exposed externally, such as the Bookkeeper ledger and journal metrics and Zookeeper metrics.
Additionally, of the exposed metrics, not all metrics are recommended for external integration.
Pulsar raw metrics
For a complete Pulsar metrics reference, see:
For a complete Astra Streaming metrics reference, see Grafana dashboards for Astra Streaming metrics.
Astra Streaming metrics
Namespace and topic metrics
Astra Streaming exposes both namespace and topic level metrics. Namespace metrics can always be inferred from corresponding topic metrics via metrics aggregation.
The following table lists recommended namespace and/or topic metrics as a starting point.
Metrics Name | Namespace and/or Topic Level | Metrics Type | Note |
---|---|---|---|
pulsar_topics_count |
Namespace |
Gauge |
The number of Pulsar topics of a namespace. |
pulsar_producers_count |
Topic |
Gauge |
The number of active producers of a topic. |
pulsar_consumers_count |
Topic |
Gauge |
The number of active consumers of a topic. |
pulsar_subscriptions_count |
Topic |
Gauge |
The number of Pulsar subscriptions of a topic. |
pulsar_rate_in |
Topic |
Gauge |
The total message rate (message per second) coming into a topic. |
pulsar_rate_out |
Topic |
Gauge |
The total message rate (message per second) coming out of a topic. |
pulsar_throughput_in |
Topic |
Gauge |
The total throughput (byte per second) coming into a topic. |
pulsar_throughput_out |
Topic |
Gauge |
The total throughput (byte per second) coming out of a topic. |
pulsar_msg_backlog |
Topic |
Gauge |
The total number of message backlog of a topic. |
pulsar_storage_size |
Topic |
Gauge |
The total storage size (in bytes) of a topic. |
pulsar_storage_backlog_size |
Topic |
Gauge |
The total backlog size (in bytes) of a topic. |
pulsar_storage_offloaded_size |
Topic |
Gauge |
The total amount of the data (in bytes) of a topic offloaded to the tiered storage. |
pulsar_in_bytes_total |
Topic |
Counter |
The total number of messages (in bytes) received for a topic. |
pulsar_out_bytes_total |
Topic |
Counter |
The total number of messages (in bytes) read from a topic. |
pulsar_in_messages_total |
Topic |
Counter |
The total number of messages received for a topic. |
pulsar_out_messages_total |
Topic |
Counter |
The total number of messages read from a topic. |
Replication metrics
When geo-replication is enabled for a particular namespace, a subset of namespace metrics is available specifically for geo-replication purposes. Below is the list of recommended geo-replication metrics as a starting point.
Metrics Name | Namespace and/or Topic Level | Metrics Type | Note |
---|---|---|---|
pulsar_replication_rate_in |
Namespace |
Gauge |
The total message rate (message per second) of the namespace replicating from a remote cluster. |
pulsar_replication_rate_out |
Namespace |
Gauge |
The total message rate (message per second) of the namespace replicating to a remote cluster. |
pulsar_replication_throughput_in |
Namespace |
Gauge |
The total throughput (bytes per second) of the namespace replicating from a remote cluster. |
pulsar_replication_throughput_out |
Namespace |
Gauge |
The total throughput (bytes per second) of the namespace replicating to a remote cluster. |
pulsar_replication_backlog |
Namespace |
Gauge |
The total message backlog of the namespace replicating to a remote cluster. |
Subscription metrics
The following table gives the list of recommended subscription metrics as a starting point.
Metrics Name | Metrics Type | Note |
---|---|---|
pulsar_subscription_back_log |
Gauge |
The total backlog (number of messages) for a subscription of a topic. |
pulsar_subscription_delayed |
Gauge |
The total number of messages of a subscription that are delayed to be dispatched for a subscription of a topic. |
pulsar_subscription_msg_rate_redeliver |
Gauge |
The total message rate (message per second) being redelivered for a subscription of a topic. |
pulsar_subscription_unacked_messages |
Gauge |
The total number of unacknowledged messages for a subscription of a topic. |
pulsar_subscription_blocked_on_unacked_messages |
Gauge |
Binary indicator (1 or 0) of whether a subscription of a topic is blocked on unacknowledged messages or not. |
pulsar_subscription_msg_rate_out |
Gauge |
The total message dispatch rate (message per second) for a subscription of a topic. |
pulsar_subscription_msg_throughput_out |
Gauge |
The total message dispatch throughput (bytes per second) for a subscription of a topic. |
pulsar_subscription_msg_ack_rate |
Gauge |
The total message acknowledgment rate (message per second) for a subscription of a topic. |
pulsar_subscription_msg_rate_expired |
Gauge |
The total rate of messages (message per second) expired on a subscription of a topic. |
pulsar_subscription_total_msg_expired |
Gauge |
The total number of messages expired on a subscription of a topic. |
pulsar_subscription_msg_drop_rate |
Gauge |
The rate of messages (message per second) dropped on a subscription of a topic. |
pulsar_subscription_consumers_count |
Gauge |
The number of connected consumers on a subscription of a topic. |
Function metrics
The following table gives the list of recommended function metrics as a starting point. This is only relevant when Pulsar functions are deployed in Astra Streaming.
Metrics Name | Metrics Type | Note |
---|---|---|
pulsar_function_processed_successfully_total |
Counter |
The total number of messages processed successfully by a function. |
pulsar_function_received_total |
Counter |
The total number of messages a function receives. |
pulsar_function_process_latency_ms |
Summary |
The process latency (in milliseconds) of a function. |
Source connector metrics
The following table gives the list of recommended source connector metrics as a starting point. This is only relevant when Pulsar source connectors are deployed in Astra Streaming.
Metrics Name | Metrics Type | Note |
---|---|---|
pulsar_source_written_total |
Counter |
The total number of messages processed by a source connector. |
pulsar_source_received_total |
Counter |
The total number of messages received by a source connector. |
Sink connector metrics
The following table gives the list of recommended source connector metrics as a starting point. This is only relevant when Pulsar sink connectors are deployed in Astra Streaming.
Metrics Name | Metrics Type | Note |
---|---|---|
pulsar_sink_written_total |
Counter |
The total number of messages processed by a sink connector. |
pulsar_sink_received_total |
Counter |
The total number of messages received by a sink connector. |
Aggregate Astra Streaming metrics
Do not aggregate metrics on shared clusters because one cluster can be shared among multiple organizations. For more information, see Astra Streaming limits and Astra Streaming pricing. |
Each externally exposed raw Astra Streaming metric is reported at a very low level, at each individual server instance (the exported_instance
label) and each topic partition (the topic
label). The same raw metrics could come from multiple server instances. From a Astra Streaming user’s perspective, the direct monitoring of raw metrics is not really useful. Raw metrics need to be aggregated first - for example, by averaging or summing the raw metrics over a period of time.
The following example shows some raw metrics for a Pulsar message backlog (pulsar_msg_backlog
) scraped from an Astra Streaming cluster in the Google Cloud us-central1
region:
....
pulsar_msg_backlog{app="pulsar", cluster="pulsar-gcp-uscentral1", component="broker", controller_revision_hash="pulsar-gcp-uscentral1-broker-<hash>f", exported_instance="<ip>:<port>", exported_job="broker", helm_release_name="astraproduction-gcp-pulsar-uscentral1", instance="prometheus-gcp-uscentral1.streaming.datastax.com:443", job="astra-pulsar-metrics-demo", kubernetes_namespace="pulsar", kubernetes_pod_name="pulsar-gcp-uscentral1-broker-3", namespace="demo/testns", prometheus="pulsar/astraproduction-gcp-pulsar-prometheus", prometheus_replica="prometheus-astraproduction-gcp-pulsar-prometheus-0", pulsar_cluster_dns="gcp-uscentral1.streaming.datastax.com", release="astraproduction-gcp-pulsar-uscentral1", statefulset_kubernetes_io_pod_name="pulsar-gcp-uscentral1-broker-3", topic="persistent://demo/testns/raw-partition-0"}
....
To transform raw metrics into a usable state, DataStax recommends the following:
-
Aggregate metrics at the parent topic level, at minimum, instead of at the partition level. In Pulsar, end user applications only deal with messages at the parent topic level; however, internally, Pulsar handles message processing at the partition level.
-
Exclude reported metrics that are associated with Astra Streaming’s system namespaces and topics, which are usually prefixed by two underscores, such as:
__kafka __transaction_producer_state
PromQL query patterns
PromQL is Prometheus’s simple and powerful query language that you can use to select and aggregate time series data in real time. For more information, see the PromQL documentation.
DataStax recommends the following PromQL query patterns for aggregating raw Astra Streaming metrics.
The following examples use the pulsar_msg_backlog
raw metric to demonstrate the patterns.
In accordance with the recommendations in Aggregate Astra Streaming metrics, the example patterns aggregate messages at the parent topic level or higher and they exclude system topics.
Filter system topics
You can use the following expression to filter system topics:
{topic !~ ".*__.*"}`
This expression excludes messages with topic labels that include two consecutive underscores. This works because Pulsar system topics and namespaces are usually prefixed by two underscores, such as:
persistent://some_tenant/__kafka/__consumer_offsets_partition_0
To use this expression, your applications' namespace and topic names don’t contain double underscores. If they do, they will also be excluded by this filter.
Get the total message backlog of a specific parent topic, excluding system topics
$ptopic
is a Grafana dashboard variable that represents a specific parent topic.
sum(pulsar_msg_backlog{topic=~$ptopic, topic !~ ".*__.*"})
Get the total message backlog of a specific namespace, excluding system topics
$namespace
is a Grafana dashboard variable that represents a specific namespace.
sum(pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})
Get the total message backlog of a tenant, excluding system topics
$tenant
is a (Grafana dashboard) variable that represents a specific tenant.
sum(pulsar_msg_backlog{namespace=~"$tenant.+"", topic !~ ".*__.*"})
Get the total message backlog of each topic within a specific namespace, excluding system topics
sum by(topic) (pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"})
Get the top 10 message backlog by topic within a specific namespace, excluding system topics
topk (10, sum by(topic) (pulsar_msg_backlog{namespace=~"$namespace", topic !~ ".*__.*"}))
Metrics alerts
Most of the exposed Astra Streaming metrics reflect generic application workload characteristics, such as message rate or throughput, and they are for informational purposes only.
However, DataStax recommends that you monitor the following metrics for unexpected increases:
Metrics Name | Aggregate | Metrics Type | Note |
---|---|---|---|
pulsar_storage_size |
Topic |
Gauge |
The total storage size (in bytes) of a topic. |
pulsar_storage_backlog_size |
Topic |
Gauge |
The total backlog size (in bytes) of a topic. |
pulsar_replication_backlog |
Georeplication |
Gauge |
The total message backlog of the namespace replicating to a remote cluster. |
pulsar_subscription_back_log |
Subscription |
Gauge |
The total backlog (number of messages) for a subscription of a topic. |
pulsar_subscription_delayed |
Subscription |
Gauge |
The total number of messages of a subscription that are delayed to be dispatched for a subscription of a topic. |
pulsar_subscription_msg_drop_rate |
Subscription |
Gauge |
The rate of messages (message per second) dropped on a subscription of a topic. |
pulsar_subscription_unacked_messages |
Subscription |
Gauge |
The total number of unacknowledged messages for a subscription of a topic. |
Alerting rules
In a perfect world, these metrics would always be 0
.
In reality, these metrics will increase when an application’s workload increases, and then return to normal when the workload decreases.
You can set an alert threshold to be notified when these metrics exceed normal capacity, but this can cause false alarms during expected workload spikes.
Alternatively, you can calculate the metrics' increase rate over a period of time, such as one hour, and then set a threshold based on the rate of increase. For example, if the average message backlog increase rate exceeds the given threshold, an alert is triggered.
Thresholds for these metrics depends on your application’s routine workloads and requirements. Generally, these values are large positive numbers, ranging in the several hundreds or several thousands. If your receive too many false alarms, adjust the alert threshold to a higher value.