Scale observability components in Mission Control
Mission Control automatically configures and collects telemetry data as cluster topologies change during scaling, node replacement, and other operations. In larger or busier environments, adjust observability capacity to keep monitoring reliable. This guide explains how to assess demand, decide when to scale, and tune each observability component.
Run the commands in this guide against the control plane cluster where Mission Control observability components are deployed.
Observability sizing
Mission Control provides observability components to monitor and manage the health of your deployments. These components include Vector for metrics collection, Grafana for visualization, Loki for log aggregation, and Mimir for long-term storage and query execution.
You can configure Mission Control to run observability components on dedicated Kubernetes nodes labeled with mission-control.datastax.com/role: platform.
This separation prevents platform services from competing with database workloads for resources.
When you scale observability components, ensure your platform nodes have sufficient resources.
For more information, see Mission Control installation requirements and Pin workloads to specific hosts in shared clusters.
Proper observability sizing depends on two primary factors: ingestion and retrieval.
For ingestion, the number of database nodes and tables determines the volume of telemetry you collect. Each node emits metrics, and each table increases the number of metrics per node. These factors directly affect the resources required to ingest and process telemetry data.
For retrieval, dashboard traffic, query complexity, and retention policies determine the resources you need. The observability stack must efficiently store and retrieve monitoring data for dashboards, alerting, and operational investigation.
Determine your cluster metrics
Before you scale observability components, gather cluster metrics and classify the deployment as Small, Medium, Large, or Enterprise. Enterprise deployments are large-scale deployments (typically 45+ nodes) that require additional configuration and planning beyond the Large baseline. For more information, see Enterprise-scale considerations.
Count database nodes
To count the total number of database nodes across all regions and datacenters, query the MissionControlCluster object in the control plane:
kubectl get missioncontrolcluster -A -o json | jq '[.items[].status.datacenters[].size] | add'
Count datacenters and regions
To count the number of datacenters across all regions, query the MissionControlCluster object in the control plane:
kubectl get missioncontrolcluster -A -o json | jq '[.items[].status.datacenters[]] | length'
Count tables
To count the number of tables across all keyspaces:
- Use the CLI
-
kubectl exec -it POD_NAME -n NAMESPACE -- cqlsh -u USERNAME -p PASSWORD -e "SELECT COUNT(*) FROM system_schema.tables;"Replace the following:
-
POD_NAME: Your pod name -
NAMESPACE: Your namespace -
USERNAME: Your database username -
PASSWORD: Your database password
-
- Use the Mission Control UI
-
In the UI, open the CQL console and run this query:
SELECT COUNT(*) FROM system_schema.tables;
Check current resource usage
To view current CPU and memory usage for observability components:
# Vector Aggregator
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=aggregator
# Grafana
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=grafana
# Loki
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=loki
# Mimir
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=mimir
Replace MC_NAMESPACE with the namespace where you deployed the Mission Control Helm chart.
Check metrics series count
To determine the number of metric series in Mimir:
kubectl exec -it MIMIR_QUERY_FRONTEND_POD -n MC_NAMESPACE -- wget -qO- http://localhost:8080/prometheus/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | length'
Replace the following:
-
MIMIR_QUERY_FRONTEND_POD: Your Mimir query frontend pod name. For example,mimir-query-frontend-0. -
MC_NAMESPACE: The namespace where you deployed the Mission Control Helm chart.
Check log ingestion rate
To check the current log ingestion rate in Loki:
kubectl exec -it LOKI_READ_POD -n MC_NAMESPACE -- wget -qO- http://localhost:3100/metrics | grep loki_distributor_bytes_received_total
Replace the following:
-
LOKI_READ_POD: Your Loki read pod name. For example,loki-read-0. -
MC_NAMESPACE: The namespace where you deployed the Mission Control Helm chart.
When to scale
Scale observability components before infrastructure changes create pressure or when performance degrades and resources tighten. For example, if you plan to deploy a new cluster with 50 tables, pre-scale the observability stack so existing clusters keep reporting metrics without interruption.
| Symptom | Primary component | Recommended action |
|---|---|---|
Metrics collection lag or drops |
Vector or an upstream dependency |
Increase resources, adjust batches and timeouts, or increase buffer size and backpressure tolerance. If Loki is slow, Vector buffers can fill even when Mimir is healthy. |
Request timeouts, event drops, or frequent |
Vector |
Enable compression and increase buffer |
Slow or unresponsive dashboards |
Grafana |
Increase resources or deploy multiple stateless Grafana instances with a shared backend. |
Log ingestion lag or dropped logs |
Loki |
Scale read, write, or backend pods and review ingestion rate limits. |
Slow or failed metric queries |
Mimir |
Add queriers and review compaction, retention, and query frontend performance. |
HTTP 429 errors, out-of-order sample rejections, or increasing discarded samples |
Mimir |
Increase |
Vector component scaling
Scale Vector when collection or forwarding no longer keeps up with telemetry volume.
Scale Vector when you observe any of the following conditions:
-
CPU consistently above 80%
-
Memory pressure above 80% of allocated memory
-
Metrics drops or delays
-
Slow processing times
-
Request timeouts every 1 to 2 minutes
-
Source send cancelledmessages -
Event drops due to backpressure
Vector monitoring and scaling reference
| Metric or signal | Threshold | Action |
|---|---|---|
|
Rate < 10000 per second |
Scale resources |
|
> 0 per minute |
Investigate configuration |
|
> 80% of capacity |
Increase buffer size |
|
p95 > 1 second |
Optimize processing |
|
Drops or stalls |
Enable compression and increase buffer size |
Request timeouts |
Every 1 to 2 minutes |
Enable compression and increase buffer |
|
Frequent occurrences |
Check backpressure and downstream rate limits |
Scale Vector vertically by increasing CPU and memory, or horizontally by adding instances behind proper load balancing. Tune replicas, resources, batch sizes, timeouts, buffer capacity, compression, and maximum buffer size. Target about 50% CPU utilization.
|
High-volume Vector configuration For deployments with 45 or more nodes, enable compression and increase buffer size to prevent request timeouts and pod restarts. For additional enterprise-scale guidance, see Enterprise-scale considerations.
|
Vector baseline configuration
|
The |
- Small
-
For deployments with one region, up to three nodes, and up to 1000 tables.
vector: resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi replicas: 1 config: sources: metrics: batch: max_events: 1000 timeout_secs: 30 sinks: prometheus: buffer: max_events: 5000 - Medium
-
For deployments with two regions, up to six nodes, and up to 2000 tables.
vector: resources: limits: cpu: 2 memory: 4Gi requests: cpu: 1 memory: 2Gi replicas: 2 config: sources: metrics: batch: max_events: 2000 timeout_secs: 60 sinks: prometheus: buffer: max_events: 10000 - Large
-
For deployments with more than two regions, more than six nodes, and more than 2000 tables.
vector: resources: limits: cpu: 4 memory: 8Gi requests: cpu: 2 memory: 4Gi replicas: 3 config: sources: metrics: batch: max_events: 5000 timeout_secs: 120 sinks: prometheus: buffer: max_events: 20000 vector_aggregator: compression: true buffer: max_size: 2147483648Use compression and larger buffers to prevent request timeouts and pod restarts in high-volume deployments.
Vector troubleshooting
Use the following guidance to troubleshoot and configure Vector:
-
If metrics are dropped, check resource limits and batch sizes.
-
If processing is slow, increase CPU or reduce batch sizes.
-
If memory usage is high, increase memory or adjust buffer sizes.
-
Place these entries under the
vectorkey invalues.yaml. -
For upstream configuration keys, see Vector configuration documentation.
-
For general product information, see Vector documentation.
Grafana component scaling
Scale Grafana when dashboard rendering, query load, or user traffic exceed current capacity.
Scale Grafana when you observe any of the following conditions:
-
Dashboard load times longer than 5 seconds
-
Frequent panel timeouts
-
CPU consistently above 70%
-
Memory pressure above 80% of allocated memory
Grafana monitoring and scaling reference
| Metric | Threshold | Action |
|---|---|---|
|
p95 > 1 second |
Optimize dashboards |
|
p95 > 5 seconds |
Reduce panel complexity |
|
> 10 per hour |
Consider rate limiting |
|
p95 > 2 seconds |
Investigate authentication |
Scale Grafana vertically by increasing CPU and memory, or horizontally by deploying multiple stateless instances with a shared backend. Tune replicas, resources, cache behavior, and query timeouts. Target 50 to 100 concurrent users per instance and about 70% CPU utilization.
Grafana baseline configuration
|
The |
- Small
-
grafana: resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi replicas: 1 - Medium
-
grafana: resources: limits: cpu: 2 memory: 4Gi requests: cpu: 1 memory: 2Gi replicas: 2 - Large
-
grafana: resources: limits: cpu: 4 memory: 8Gi requests: cpu: 2 memory: 4Gi replicas: 3
Grafana troubleshooting
Use the following guidance to troubleshoot and configure Grafana:
-
If dashboards are slow, check backend performance and network latency.
-
If panels time out, increase timeout settings or reduce query complexity.
-
If memory usage is high, increase memory or reduce dashboard refresh rates.
-
Place these entries under the
grafanakey invalues.yaml. -
For upstream configuration keys, see Grafana Helm chart repo.
-
For product-specific guidance, see Grafana in the Helm installation guide.
Loki component scaling
Scale Loki when log ingestion, storage demand, or query load exceed current read, write, or backend capacity.
Scale Loki when you observe any of the following conditions:
-
Log drops
-
Slow ingestion
-
Querier timeouts
Loki monitoring and scaling reference
| Metric | Threshold | Action |
|---|---|---|
|
> 80% of limit |
Scale ingesters |
|
> 80% of limit |
Adjust chunk size |
|
> 1 GB/s |
Scale distributors |
|
p95 > 5 seconds |
Optimize queries |
Scale Loki vertically by increasing CPU and memory, or horizontally by scaling read, write, and backend pods. Tune storage, retention periods, ingestion rate limits, and compaction behavior.
Loki baseline configuration
|
The |
- Small
-
loki: enabled: true loki: storage: bucketNames: chunks: my_loki_chunks_bucket limits_config: retention_period: 7d ingestion_rate_mb: 4 ingestion_burst_size_mb: 6 read: persistence: enabled: true size: 5Gi storageClassName: "" replicas: 1 resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi write: persistence: enabled: true size: 5Gi storageClassName: "" replicas: 1 resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi backend: replicas: 1 resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi gateway: replicas: 1 resources: limits: cpu: 500m memory: 512Mi requests: cpu: 100m memory: 128Mi chunksCache: enabled: false resources: limits: cpu: 500m memory: 512Mi requests: cpu: 100m memory: 128Mi resultsCache: enabled: false resources: limits: cpu: 500m memory: 512Mi requests: cpu: 100m memory: 128Mi - Medium
-
loki: enabled: true loki: storage: bucketNames: chunks: my_loki_chunks_bucket limits_config: retention_period: 15d ingestion_rate_mb: 8 ingestion_burst_size_mb: 12 read: persistence: enabled: true size: 10Gi storageClassName: "" replicas: 2 resources: limits: cpu: 2 memory: 4Gi requests: cpu: 1 memory: 2Gi write: persistence: enabled: true size: 10Gi storageClassName: "" replicas: 2 resources: limits: cpu: 2 memory: 4Gi requests: cpu: 1 memory: 2Gi backend: replicas: 2 resources: limits: cpu: 2 memory: 4Gi requests: cpu: 1 memory: 2Gi gateway: replicas: 1 resources: limits: cpu: 1 memory: 1Gi requests: cpu: 200m memory: 256Mi chunksCache: enabled: false resources: limits: cpu: 1 memory: 1Gi requests: cpu: 200m memory: 256Mi resultsCache: enabled: false resources: limits: cpu: 1 memory: 1Gi requests: cpu: 200m memory: 256Mi - Large
-
loki: enabled: true loki: storage: bucketNames: chunks: my_loki_chunks_bucket limits_config: retention_period: 30d ingestion_rate_mb: 16 ingestion_burst_size_mb: 24 read: persistence: enabled: true size: 20Gi storageClassName: "" replicas: 3 resources: limits: cpu: 4 memory: 8Gi requests: cpu: 2 memory: 4Gi write: persistence: enabled: true size: 20Gi storageClassName: "" replicas: 3 resources: limits: cpu: 4 memory: 8Gi requests: cpu: 2 memory: 4Gi backend: replicas: 3 resources: limits: cpu: 4 memory: 8Gi requests: cpu: 2 memory: 4Gi gateway: replicas: 2 resources: limits: cpu: 2 memory: 2Gi requests: cpu: 500m memory: 512Mi chunksCache: enabled: true replicas: 2 resources: limits: cpu: 2 memory: 2Gi requests: cpu: 500m memory: 1Gi resultsCache: enabled: true replicas: 2 resources: limits: cpu: 2 memory: 2Gi requests: cpu: 500m memory: 1Gi
Loki troubleshooting
Use the following guidance to troubleshoot and configure Loki:
-
If logs are dropped, check ingestion rate limits and storage capacity.
-
If queries are slow, optimize query patterns and increase querier resources.
-
If memory usage is high, increase memory or reduce retention period.
-
Place these entries under the
lokikey invalues.yaml. -
For upstream configuration keys, see Loki Helm chart repo.
-
For sizing guidance, see size Loki in the Grafana documentation.
Mimir component scaling
Scale Mimir when metrics ingestion, storage demand, or query load exceed the capacity of distributors, ingesters, or query components.
Scale Mimir when you observe any of the following conditions:
-
Query slowness occurs
-
Alertmanager lags
-
Ingester overloads occur
-
HTTP 429 errors occur from distributors
-
Out-of-order sample rejections occur
-
Discarded samples increase
Monitoring and scaling reference
| Metric or signal | Threshold | Action |
|---|---|---|
|
> 192k per ingester, or 80% of the 240k target |
Scale ingesters |
|
> 80% of limit |
Adjust chunk size |
|
Rate approaching the |
Scale distributors |
|
> 80% of capacity |
Scale distributors |
|
p95 > 5 seconds |
Optimize queries or scale queriers |
|
> 10 consistently |
Scale query frontend or queriers |
Distributor CPU utilization |
> 80% |
Add distributor replicas |
HTTP 429 responses |
Any occurrences |
Increase |
Out-of-order sample rejections |
Samples delayed more than 5 minutes |
Check Vector timeouts and increase Mimir rate limits |
|
Increasing |
Investigate rate limiting or out-of-order issues |
Ingester active items |
Consistently high |
Scale ingesters or increase memory |
Scale Mimir vertically by increasing CPU and memory, or horizontally by adding ingesters and queriers. Enable compaction and retention controls, and use object storage for long-term retention.
Tune the following Mimir components and settings:
-
alertmanager: replicas, resources, sharding ring replication factor -
ingester: replicas, storage backend, persistent volume size, resources, ring replication factor -
store_gateway: persistent volume, replicas, resources -
compactor: retention period, persistent volume, replicas, resources -
distributor: replicas, resources, ring replication factor -
querier: replicas, resources -
query_frontend: replicas, resources -
Mimir limits: ingestion burst size, ingestion rate, maximum label names per series, out-of-order time window
Target the following performance metrics:
-
240,000 series per ingester at 50% utilization
-
Size distributors by ingestion rate and target 50% CPU utilization at 2 cores per instance
-
Size storage for retention period and metrics cardinality
-
250 queries per second per query frontend
-
10 queries per second per querier
Baseline configurations
|
The |
- Small
-
mimir: alertmanager: enabled: true extraArgs: alertmanager-storage.backend: local alertmanager-storage.local.path: /etc/alertmanager/config alertmanager.configs.fallback: /etc/alertmanager/config/default.yml alertmanager.sharding-ring.replication-factor: "1" extraVolumeMounts: - mountPath: /etc/alertmanager/config name: alertmanager-config - mountPath: /alertmanager name: alertmanager-config-tmp extraVolumes: - name: alertmanager-config secret: secretName: alertmanager-config - emptyDir: {} name: alertmanager-config-tmp persistentVolume: accessModes: - ReadWriteOnce enabled: "1" size: 5Gi replicas: "1" resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi ingester: extraArgs: ingester.max-global-series-per-user: "0" ingester.ring.replication-factor: "1" persistentVolume: size: 10Gi replicas: "1" resources: limits: cpu: 2 memory: 5Gi requests: cpu: 1 memory: 2Gi store_gateway: persistentVolume: size: 10Gi replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi compactor: extraArgs: compactor.blocks-retention-period: 15d persistentVolume: enabled: "1" size: 10Gi replicas: "1" resources: limits: cpu: 1 memory: 4Gi requests: cpu: 500m memory: 2Gi distributor: extraArgs: ingester.ring.replication-factor: "1" replicas: "1" resources: limits: cpu: 2 memory: 2Gi requests: cpu: 1 memory: 1Gi querier: replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi query_frontend: replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi nginx: replicas: "1" resources: limits: cpu: 500m memory: 512Mi requests: cpu: 100m memory: 128Mi overrides_exporter: replicas: "1" resources: limits: cpu: 500m memory: 512Mi requests: cpu: 100m memory: 128Mi query_scheduler: replicas: "1" resources: limits: memory: 2Gi requests: cpu: 100m memory: 128Mi ruler: replicas: "1" resources: limits: memory: 2Gi requests: cpu: 100m memory: 128Mi mimir: structuredConfig: activity_tracker: filepath: /data/activity.log limits: ingestion_burst_size: 50000 ingestion_rate: 25000 max_label_names_per_series: 60 out_of_order_time_window: 2m - Medium
-
For deployments approaching 45 or more nodes, monitor for HTTP 429 errors and consider increasing rate limits to
250000and500000. For more information, see Enterprise-scale considerations.mimir: alertmanager: enabled: true extraArgs: alertmanager-storage.backend: local alertmanager-storage.local.path: /etc/alertmanager/config alertmanager.configs.fallback: /etc/alertmanager/config/default.yml alertmanager.sharding-ring.replication-factor: "2" extraVolumeMounts: - mountPath: /etc/alertmanager/config name: alertmanager-config - mountPath: /alertmanager name: alertmanager-config-tmp extraVolumes: - name: alertmanager-config secret: secretName: alertmanager-config - emptyDir: {} name: alertmanager-config-tmp persistentVolume: accessModes: - ReadWriteOnce enabled: "1" size: 10Gi replicas: "2" resources: limits: cpu: 2 memory: 2Gi requests: cpu: 1 memory: 1Gi ingester: extraArgs: ingester.max-global-series-per-user: "0" ingester.ring.replication-factor: "1" persistentVolume: size: 30Gi replicas: "3" resources: limits: cpu: 2 memory: 5Gi requests: cpu: 1 memory: 2Gi store_gateway: persistentVolume: size: 30Gi replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi compactor: extraArgs: compactor.blocks-retention-period: 30d persistentVolume: enabled: "1" size: 30Gi replicas: "1" resources: limits: cpu: 1 memory: 4Gi requests: cpu: 500m memory: 2Gi distributor: extraArgs: ingester.ring.replication-factor: "1" replicas: "2" resources: limits: cpu: 2 memory: 2Gi requests: cpu: 1 memory: 1Gi querier: replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi query_frontend: replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi nginx: replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 200m memory: 256Mi overrides_exporter: replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 100m memory: 128Mi query_scheduler: replicas: "1" resources: limits: memory: 4Gi requests: cpu: 100m memory: 128Mi ruler: replicas: "1" resources: limits: memory: 4Gi requests: cpu: 100m memory: 128Mi mimir: structuredConfig: activity_tracker: filepath: /data/activity.log limits: ingestion_burst_size: 100000 ingestion_rate: 50000 max_label_names_per_series: 120 out_of_order_time_window: 5m - Large
-
These higher rate limits help prevent HTTP 429 errors and cascading failures in large deployments.
mimir: alertmanager: enabled: true extraArgs: alertmanager-storage.backend: local alertmanager-storage.local.path: /etc/alertmanager/config alertmanager.configs.fallback: /etc/alertmanager/config/default.yml alertmanager.sharding-ring.replication-factor: "3" extraVolumeMounts: - mountPath: /etc/alertmanager/config name: alertmanager-config - mountPath: /alertmanager name: alertmanager-config-tmp extraVolumes: - name: alertmanager-config secret: secretName: alertmanager-config - emptyDir: {} name: alertmanager-config-tmp persistentVolume: accessModes: - ReadWriteOnce enabled: "1" size: 20Gi replicas: "3" resources: limits: cpu: 2 memory: 2Gi requests: cpu: 1 memory: 1Gi ingester: extraArgs: ingester.max-global-series-per-user: "0" ingester.ring.replication-factor: "1" persistentVolume: size: 50Gi replicas: "9" resources: limits: cpu: 2 memory: 5Gi requests: cpu: 1 memory: 2Gi store_gateway: persistentVolume: size: 50Gi replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi compactor: extraArgs: compactor.blocks-retention-period: 60d persistentVolume: enabled: "1" size: 50Gi replicas: "1" resources: limits: cpu: 1 memory: 4Gi requests: cpu: 500m memory: 2Gi distributor: extraArgs: ingester.ring.replication-factor: "1" replicas: "2" resources: limits: cpu: 2 memory: 2Gi requests: cpu: 1 memory: 1Gi querier: replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi query_frontend: replicas: "1" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 500m memory: 512Mi nginx: replicas: "2" resources: limits: cpu: 2 memory: 2Gi requests: cpu: 500m memory: 512Mi overrides_exporter: replicas: "2" resources: limits: cpu: 1 memory: 1Gi requests: cpu: 200m memory: 256Mi query_scheduler: replicas: "2" resources: limits: memory: 4Gi requests: cpu: 200m memory: 256Mi ruler: replicas: "2" resources: limits: memory: 4Gi requests: cpu: 200m memory: 256Mi mimir: structuredConfig: activity_tracker: filepath: /data/activity.log limits: ingestion_burst_size: 500000 ingestion_rate: 250000 max_label_names_per_series: 240 out_of_order_time_window: 10m distributor: instance_limits: max_ingestion_rate: 300000 ingester: instance_limits: max_ingestion_rate: 300000 runtimeConfig: overrides: anonymous: ingestion_rate: 500000 ingestion_burst_size: 1000000
Troubleshooting and configuration references
Use the following guidance to troubleshoot and configure Mimir:
-
If queries are slow, check query frontend and store gateway performance.
-
If ingestion is slow, check distributor and ingester performance.
-
If memory usage is high, increase memory or optimize compaction settings.
-
If you see HTTP 429 errors, increase
ingestion_rateandingestion_burst_size. -
If Mimir rejects out-of-order samples, check for pipeline backpressure, investigate Vector timeouts, and increase Mimir rate limits.
-
If you observe cascading failures, address the root cause by increasing Mimir ingestion limits and enabling Vector compression.
-
If ingesters show high memory usage with many active items, scale ingesters or increase memory.
-
Place these entries under the
mimirkey invalues.yaml. -
For upstream configuration keys, see Mimir Helm chart repo.
-
For general product information, see Mimir documentation.
Cross-component reference
This section provides best practices and troubleshooting guidance for managing observability components across your Mission Control deployment.
Best practices
Follow these best practices for observability component management:
-
Monitor resource usage and adjust limits as needed.
-
Use dashboards to monitor ingestion rates and resource usage.
-
Set alerts on CPU, memory, and data-drop metrics.
-
Scale proactively before adding new database clusters.
-
Enable rate limits and burst protection to avoid overloading the system.
Consider the following when scaling:
- Performance optimization
-
Optimize observability component performance by implementing query optimization techniques and resource management strategies.
- Query optimization
-
Optimize queries with the following techniques:
-
Use appropriate time ranges for queries.
-
Implement query caching where possible.
-
Use
rate()andincrease()for counter metrics. -
Avoid high-cardinality labels.
-
Use recording rules for complex queries.
-
- Storage optimization
-
Optimize storage with the following strategies:
-
Implement data lifecycle policies.
-
Use appropriate retention periods.
-
Enable compression where available.
-
Consider tiered storage for long-term data.
-
Monitor and adjust chunk sizes.
-
- Resource optimization
-
Optimize resources with the following approaches:
-
Right-size resource requests and limits.
-
Implement proper pod scheduling.
-
Use node affinity for critical components.
-
Monitor and adjust resource quotas.
-
Implement proper garbage collection.
-
Quick reference
Use the following tables as a quick reference for component sizing and troubleshooting common issues.
Component sizing
| Component | Small (≤3 nodes, ≤1000 tables) | Medium (≤6 nodes, ≤2000 tables) | Large (>6 nodes, >2000 tables) |
|---|---|---|---|
Vector |
1 replica + 1 CPU, 2Gi |
2 replicas + 2 CPU, 4Gi |
3 replicas + 4 CPU, 8Gi |
Grafana |
1 replica + 1 CPU, 2Gi |
2 replicas + 2 CPU, 4Gi |
3 replicas + 4 CPU, 8Gi |
Loki |
3 replicas + 6 CPU, 12Gi |
6 replicas + 12 CPU, 24Gi |
9 replicas + 24 CPU, 48Gi |
Mimir Distributor |
1 replica + 2 CPU, 2Gi |
2 replicas + 4 CPU, 4Gi |
2 replicas + 4 CPU, 4Gi |
Mimir Ingester |
1 replica + 2 CPU, 5Gi |
3 replicas + 6 CPU, 15Gi |
9 replicas + 18 CPU, 45Gi |
Mimir Querier |
1 replica + 1 CPU, 1Gi |
1 replica + 1 CPU, 1Gi |
1 replica + 1 CPU, 1Gi |
Mimir Query Frontend |
1 replica + 1 CPU, 1Gi |
1 replica + 1 CPU, 1Gi |
1 replica + 1 CPU, 1Gi |
Mimir Compactor |
1 replica + 1 CPU, 4Gi |
1 replica + 1 CPU, 4Gi |
1 replica + 1 CPU, 4Gi |
Mimir Store Gateway |
1 replica + 1 CPU, 1Gi |
1 replica + 1 CPU, 1Gi |
1 replica + 1 CPU, 1Gi |
Mimir Alertmanager |
1 replica + 1 CPU, 2Gi |
2 replicas + 2 CPU, 2Gi |
3 replicas + 6 CPU, 6Gi |
Storage requirements
| Component | Small | Medium | Large |
|---|---|---|---|
Loki, per component |
5Gi read, 5Gi write |
10Gi read, 10Gi write |
20Gi read, 20Gi write |
Mimir Ingester |
10Gi per replica |
30Gi per replica |
50Gi per replica |
Mimir Store Gateway |
10Gi |
30Gi |
50Gi |
Mimir Compactor |
10Gi |
30Gi |
50Gi |
Mimir Alertmanager |
5Gi |
10Gi |
20Gi |
Metrics capacity
| Deployment size | Estimated series count | Ingester replicas | Target series per ingester |
|---|---|---|---|
Small |
Up to 240k series |
1 |
240k |
Medium |
Up to 720k series |
3 |
240k |
Large |
Up to 2.16M series |
9 |
240k |
Enterprise-scale considerations
For enterprise-scale deployments, additional planning and configuration are required to handle high-volume telemetry data and ensure system reliability.
Scaling guidelines for enterprises
For enterprise-scale deployments, use the following formulas to calculate required resources.
- Mimir ingester replicas
-
Ingester replicas = Total series ÷ 240,000
- Mimir distributor replicas
-
Distributor replicas = (Ingestion rate ÷ 50,000) × 2 Round up to nearest integer, target 50% CPU utilization
- Storage per ingester
-
Storage = (Series count × Average sample size × Retention period) ÷ Ingester count Add 20% buffer for overhead
Architecture and best practices
Consider the following architectural approaches and best practices for enterprise deployments:
-
Deploy components across multiple regions in an active-active configuration.
-
Distribute data using sharding strategies.
-
Deploy multiple component instances with load balancing.
-
Use separate clusters for development, staging, and production workloads.
-
Implement object storage for long-term metrics and log retention.
-
Monitor and alert on all components using the thresholds in this guide.
-
Use a service mesh for traffic management and security.
-
Implement backup and disaster recovery strategies.
-
Assign dedicated teams to manage different observability components.
-
Review and optimize resource allocation based on usage patterns.
-
Scale proactively before adding new database clusters.
-
Use the formulas above to estimate resource needs before deployment.
Enterprise production Mimir configuration example
For this example, consider an enterprise production deployment with the following characteristics:
-
Database nodes: 180
-
Multi-region nodes: Yes
-
Metrics series: 6.38 million
-
Ingestion rate: 218,722 samples per second
Assume the following performance targets are set:
-
Series per ingester: 240k
-
CPU utilization on distributors: 50%
-
Query rate on the query frontend: 250 queries per second
-
Query rate for each querier: 10 queries per second
To meet these performance targets, you might use the following Mimir configuration. This configuration requires 70 CPU cores and 173Gi memory.
| Component | Replicas | Resources per replica |
|---|---|---|
Distributor |
5 |
2 cores, 2Gi memory |
Ingester |
27 |
2 cores, 5Gi memory |
Compactor |
1 |
1 core, 4Gi memory |
Query Frontend |
1 |
1 core, 1Gi memory |
Querier |
1 |
1 core, 1Gi memory |
Store Gateway |
1 |
1 core, 1Gi memory |
Alertmanager |
2 |
1 core, 1Gi memory |
Resolve Mimir and Vector rate limiting issues
Large deployments with the following characteristics may experience rate limiting issues:
-
40+ database nodes across multiple racks
-
High metrics volume exceeding default rate limits
-
Symptoms including HTTP 429 errors, out-of-order sample rejections, and Vector timeouts
Mimir rate limit increases
Apply these critical configuration changes to resolve cascading failures:
mimir:
mimir:
structuredConfig:
limits:
ingestion_rate: 250000
ingestion_burst_size: 500000
distributor:
instance_limits:
max_ingestion_rate: 300000
ingester:
instance_limits:
max_ingestion_rate: 300000
runtimeConfig:
overrides:
anonymous:
ingestion_rate: 500000
ingestion_burst_size: 1000000
Vector configuration for stability
vector:
config:
sinks:
vector_aggregator:
compression: true
buffer:
max_size: 2147483648
Expected outcomes
This configuration can help avoid the following issues:
-
HTTP 429 rate-limit errors
-
Out-of-order sample rejections
-
Vector request timeouts that previously occurred every 1 to 2 minutes
-
Pod restart cycles
-
Cascading failures in the metrics pipeline
After applying these changes, the cluster runs stably with no pod restarts for 4 to 6 days.