Scale observability components in Mission Control

Mission Control automatically configures and collects telemetry data as cluster topologies change during scaling, node replacement, and other operations. In larger or busier environments, adjust observability capacity to keep monitoring reliable. This guide explains how to assess demand, decide when to scale, and tune each observability component.

Run the commands in this guide against the control plane cluster where Mission Control observability components are deployed.

Prerequisites

  • Administrative access to your Mission Control deployment

  • jq installed on your system

  • kubectl configured to access your Mission Control cluster

  • Access to monitoring dashboards

  • An understanding of your current resource usage patterns

  • Knowledge of your expected workload growth

Observability sizing

Mission Control provides observability components to monitor and manage the health of your deployments. These components include Vector for metrics collection, Grafana for visualization, Loki for log aggregation, and Mimir for long-term storage and query execution.

You can configure Mission Control to run observability components on dedicated Kubernetes nodes labeled with mission-control.datastax.com/role: platform. This separation prevents platform services from competing with database workloads for resources. When you scale observability components, ensure your platform nodes have sufficient resources. For more information, see Mission Control installation requirements and Pin workloads to specific hosts in shared clusters.

Proper observability sizing depends on two primary factors: ingestion and retrieval.

For ingestion, the number of database nodes and tables determines the volume of telemetry you collect. Each node emits metrics, and each table increases the number of metrics per node. These factors directly affect the resources required to ingest and process telemetry data.

For retrieval, dashboard traffic, query complexity, and retention policies determine the resources you need. The observability stack must efficiently store and retrieve monitoring data for dashboards, alerting, and operational investigation.

Determine your cluster metrics

Before you scale observability components, gather cluster metrics and classify the deployment as Small, Medium, Large, or Enterprise. Enterprise deployments are large-scale deployments (typically 45+ nodes) that require additional configuration and planning beyond the Large baseline. For more information, see Enterprise-scale considerations.

Count database nodes

To count the total number of database nodes across all regions and datacenters, query the MissionControlCluster object in the control plane:

kubectl get missioncontrolcluster -A -o json | jq '[.items[].status.datacenters[].size] | add'

Count datacenters and regions

To count the number of datacenters across all regions, query the MissionControlCluster object in the control plane:

kubectl get missioncontrolcluster -A -o json | jq '[.items[].status.datacenters[]] | length'

Count tables

To count the number of tables across all keyspaces:

Use the CLI
kubectl exec -it POD_NAME -n NAMESPACE -- cqlsh -u USERNAME -p PASSWORD -e "SELECT COUNT(*) FROM system_schema.tables;"

Replace the following:

  • POD_NAME: Your pod name

  • NAMESPACE: Your namespace

  • USERNAME: Your database username

  • PASSWORD: Your database password

Use the Mission Control UI

In the UI, open the CQL console and run this query:

SELECT COUNT(*) FROM system_schema.tables;

Check current resource usage

To view current CPU and memory usage for observability components:

# Vector Aggregator
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=aggregator

# Grafana
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=grafana

# Loki
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=loki

# Mimir
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=mimir

Replace MC_NAMESPACE with the namespace where you deployed the Mission Control Helm chart.

Check metrics series count

To determine the number of metric series in Mimir:

kubectl exec -it MIMIR_QUERY_FRONTEND_POD -n MC_NAMESPACE -- wget -qO- http://localhost:8080/prometheus/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | length'

Replace the following:

  • MIMIR_QUERY_FRONTEND_POD: Your Mimir query frontend pod name. For example, mimir-query-frontend-0.

  • MC_NAMESPACE: The namespace where you deployed the Mission Control Helm chart.

Check log ingestion rate

To check the current log ingestion rate in Loki:

kubectl exec -it LOKI_READ_POD -n MC_NAMESPACE -- wget -qO- http://localhost:3100/metrics | grep loki_distributor_bytes_received_total

Replace the following:

  • LOKI_READ_POD: Your Loki read pod name. For example, loki-read-0.

  • MC_NAMESPACE: The namespace where you deployed the Mission Control Helm chart.

When to scale

Scale observability components before infrastructure changes create pressure or when performance degrades and resources tighten. For example, if you plan to deploy a new cluster with 50 tables, pre-scale the observability stack so existing clusters keep reporting metrics without interruption.

Symptom Primary component Recommended action

Metrics collection lag or drops

Vector or an upstream dependency

Increase resources, adjust batches and timeouts, or increase buffer size and backpressure tolerance. If Loki is slow, Vector buffers can fill even when Mimir is healthy.

Request timeouts, event drops, or frequent Source send cancelled messages

Vector

Enable compression and increase buffer max_size to 2GB in high-volume environments.

Slow or unresponsive dashboards

Grafana

Increase resources or deploy multiple stateless Grafana instances with a shared backend.

Log ingestion lag or dropped logs

Loki

Scale read, write, or backend pods and review ingestion rate limits.

Slow or failed metric queries

Mimir

Add queriers and review compaction, retention, and query frontend performance.

HTTP 429 errors, out-of-order sample rejections, or increasing discarded samples

Mimir

Increase ingestion_rate and ingestion_burst_size, and add distributor or ingester instance limits. Out-of-order rejections often indicate downstream backpressure.

Vector component scaling

Scale Vector when collection or forwarding no longer keeps up with telemetry volume.

Scale Vector when you observe any of the following conditions:

  • CPU consistently above 80%

  • Memory pressure above 80% of allocated memory

  • Metrics drops or delays

  • Slow processing times

  • Request timeouts every 1 to 2 minutes

  • Source send cancelled messages

  • Event drops due to backpressure

Vector monitoring and scaling reference

Metric or signal Threshold Action

vector_component_events_processed_total

Rate < 10000 per second

Scale resources

vector_component_errors_total

> 0 per minute

Investigate configuration

vector_component_buffer_events

> 80% of capacity

Increase buffer size

vector_component_processing_duration_seconds

p95 > 1 second

Optimize processing

vector_component_sent_event_bytes_total

Drops or stalls

Enable compression and increase buffer size

Request timeouts

Every 1 to 2 minutes

Enable compression and increase buffer max_size to 2GB

Source send cancelled

Frequent occurrences

Check backpressure and downstream rate limits

Scale Vector vertically by increasing CPU and memory, or horizontally by adding instances behind proper load balancing. Tune replicas, resources, batch sizes, timeouts, buffer capacity, compression, and maximum buffer size. Target about 50% CPU utilization.

High-volume Vector configuration

For deployments with 45 or more nodes, enable compression and increase buffer size to prevent request timeouts and pod restarts. For additional enterprise-scale guidance, see Enterprise-scale considerations.

vector:
  config:
    sinks:
      vector_aggregator:
        compression: true
        buffer:
          max_size: 2147483648  # 2GB for high-volume environments

Vector baseline configuration

The resources.limits configuration is optional. You can define only resources.requests if you prefer not to set resource limits.

Small

For deployments with one region, up to three nodes, and up to 1000 tables.

vector:
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 1Gi
  replicas: 1
  config:
    sources:
      metrics:
        batch:
          max_events: 1000
          timeout_secs: 30
    sinks:
      prometheus:
        buffer:
          max_events: 5000
Medium

For deployments with two regions, up to six nodes, and up to 2000 tables.

vector:
  resources:
    limits:
      cpu: 2
      memory: 4Gi
    requests:
      cpu: 1
      memory: 2Gi
  replicas: 2
  config:
    sources:
      metrics:
        batch:
          max_events: 2000
          timeout_secs: 60
    sinks:
      prometheus:
        buffer:
          max_events: 10000
Large

For deployments with more than two regions, more than six nodes, and more than 2000 tables.

vector:
  resources:
    limits:
      cpu: 4
      memory: 8Gi
    requests:
      cpu: 2
      memory: 4Gi
  replicas: 3
  config:
    sources:
      metrics:
        batch:
          max_events: 5000
          timeout_secs: 120
    sinks:
      prometheus:
        buffer:
          max_events: 20000
      vector_aggregator:
        compression: true
        buffer:
          max_size: 2147483648

Use compression and larger buffers to prevent request timeouts and pod restarts in high-volume deployments.

Vector troubleshooting

Use the following guidance to troubleshoot and configure Vector:

  • If metrics are dropped, check resource limits and batch sizes.

  • If processing is slow, increase CPU or reduce batch sizes.

  • If memory usage is high, increase memory or adjust buffer sizes.

  • Place these entries under the vector key in values.yaml.

  • For upstream configuration keys, see Vector configuration documentation.

  • For general product information, see Vector documentation.

Grafana component scaling

Scale Grafana when dashboard rendering, query load, or user traffic exceed current capacity.

Scale Grafana when you observe any of the following conditions:

  • Dashboard load times longer than 5 seconds

  • Frequent panel timeouts

  • CPU consistently above 70%

  • Memory pressure above 80% of allocated memory

Grafana monitoring and scaling reference

Metric Threshold Action

grafana_api_request_duration_seconds

p95 > 1 second

Optimize dashboards

grafana_dashboard_rendering_duration_seconds

p95 > 5 seconds

Reduce panel complexity

grafana_api_user_signups

> 10 per hour

Consider rate limiting

grafana_api_login_post_seconds

p95 > 2 seconds

Investigate authentication

Scale Grafana vertically by increasing CPU and memory, or horizontally by deploying multiple stateless instances with a shared backend. Tune replicas, resources, cache behavior, and query timeouts. Target 50 to 100 concurrent users per instance and about 70% CPU utilization.

Grafana baseline configuration

The resources.limits configuration is optional. You can define only resources.requests if you prefer not to set resource limits.

Small
grafana:
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 1Gi
  replicas: 1
Medium
grafana:
  resources:
    limits:
      cpu: 2
      memory: 4Gi
    requests:
      cpu: 1
      memory: 2Gi
  replicas: 2
Large
grafana:
  resources:
    limits:
      cpu: 4
      memory: 8Gi
    requests:
      cpu: 2
      memory: 4Gi
  replicas: 3

Grafana troubleshooting

Use the following guidance to troubleshoot and configure Grafana:

  • If dashboards are slow, check backend performance and network latency.

  • If panels time out, increase timeout settings or reduce query complexity.

  • If memory usage is high, increase memory or reduce dashboard refresh rates.

  • Place these entries under the grafana key in values.yaml.

  • For upstream configuration keys, see Grafana Helm chart repo.

  • For product-specific guidance, see Grafana in the Helm installation guide.

Loki component scaling

Scale Loki when log ingestion, storage demand, or query load exceed current read, write, or backend capacity.

Scale Loki when you observe any of the following conditions:

  • Log drops

  • Slow ingestion

  • Querier timeouts

Loki monitoring and scaling reference

Metric Threshold Action

loki_ingester_memory_streams

> 80% of limit

Scale ingesters

loki_ingester_memory_chunks

> 80% of limit

Adjust chunk size

loki_distributor_bytes_received_total

> 1 GB/s

Scale distributors

loki_querier_request_duration_seconds

p95 > 5 seconds

Optimize queries

Scale Loki vertically by increasing CPU and memory, or horizontally by scaling read, write, and backend pods. Tune storage, retention periods, ingestion rate limits, and compaction behavior.

Loki baseline configuration

The resources.limits configuration is optional. You can define only resources.requests if you prefer not to set resource limits.

Small
loki:
  enabled: true
  loki:
    storage:
      bucketNames:
        chunks: my_loki_chunks_bucket
      limits_config:
        retention_period: 7d
        ingestion_rate_mb: 4
        ingestion_burst_size_mb: 6
  read:
    persistence:
      enabled: true
      size: 5Gi
      storageClassName: ""
    replicas: 1
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  write:
    persistence:
      enabled: true
      size: 5Gi
      storageClassName: ""
    replicas: 1
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  backend:
    replicas: 1
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  gateway:
    replicas: 1
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
  chunksCache:
    enabled: false
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
  resultsCache:
    enabled: false
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
Medium
loki:
  enabled: true
  loki:
    storage:
      bucketNames:
        chunks: my_loki_chunks_bucket
      limits_config:
        retention_period: 15d
        ingestion_rate_mb: 8
        ingestion_burst_size_mb: 12
  read:
    persistence:
      enabled: true
      size: 10Gi
      storageClassName: ""
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 4Gi
      requests:
        cpu: 1
        memory: 2Gi
  write:
    persistence:
      enabled: true
      size: 10Gi
      storageClassName: ""
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 4Gi
      requests:
        cpu: 1
        memory: 2Gi
  backend:
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 4Gi
      requests:
        cpu: 1
        memory: 2Gi
  gateway:
    replicas: 1
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
  chunksCache:
    enabled: false
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
  resultsCache:
    enabled: false
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
Large
loki:
  enabled: true
  loki:
    storage:
      bucketNames:
        chunks: my_loki_chunks_bucket
      limits_config:
        retention_period: 30d
        ingestion_rate_mb: 16
        ingestion_burst_size_mb: 24
  read:
    persistence:
      enabled: true
      size: 20Gi
      storageClassName: ""
    replicas: 3
    resources:
      limits:
        cpu: 4
        memory: 8Gi
      requests:
        cpu: 2
        memory: 4Gi
  write:
    persistence:
      enabled: true
      size: 20Gi
      storageClassName: ""
    replicas: 3
    resources:
      limits:
        cpu: 4
        memory: 8Gi
      requests:
        cpu: 2
        memory: 4Gi
  backend:
    replicas: 3
    resources:
      limits:
        cpu: 4
        memory: 8Gi
      requests:
        cpu: 2
        memory: 4Gi
  gateway:
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 512Mi
  chunksCache:
    enabled: true
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  resultsCache:
    enabled: true
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi

Loki troubleshooting

Use the following guidance to troubleshoot and configure Loki:

  • If logs are dropped, check ingestion rate limits and storage capacity.

  • If queries are slow, optimize query patterns and increase querier resources.

  • If memory usage is high, increase memory or reduce retention period.

  • Place these entries under the loki key in values.yaml.

  • For upstream configuration keys, see Loki Helm chart repo.

  • For sizing guidance, see size Loki in the Grafana documentation.

Mimir component scaling

Scale Mimir when metrics ingestion, storage demand, or query load exceed the capacity of distributors, ingesters, or query components.

Scale Mimir when you observe any of the following conditions:

  • Query slowness occurs

  • Alertmanager lags

  • Ingester overloads occur

  • HTTP 429 errors occur from distributors

  • Out-of-order sample rejections occur

  • Discarded samples increase

Monitoring and scaling reference

Metric or signal Threshold Action

cortex_ingester_memory_series

> 192k per ingester, or 80% of the 240k target

Scale ingesters

cortex_ingester_memory_chunks

> 80% of limit

Adjust chunk size

cortex_distributor_received_samples_total

Rate approaching the ingestion_rate limit

Scale distributors

cortex_distributor_inflight_requests

> 80% of capacity

Scale distributors

cortex_querier_request_duration_seconds

p95 > 5 seconds

Optimize queries or scale queriers

cortex_query_frontend_queue_length

> 10 consistently

Scale query frontend or queriers

Distributor CPU utilization

> 80%

Add distributor replicas

HTTP 429 responses

Any occurrences

Increase ingestion_rate and ingestion_burst_size

Out-of-order sample rejections

Samples delayed more than 5 minutes

Check Vector timeouts and increase Mimir rate limits

cortex_discarded_samples_total

Increasing

Investigate rate limiting or out-of-order issues

Ingester active items

Consistently high

Scale ingesters or increase memory

Scale Mimir vertically by increasing CPU and memory, or horizontally by adding ingesters and queriers. Enable compaction and retention controls, and use object storage for long-term retention.

Tune the following Mimir components and settings:

  • alertmanager: replicas, resources, sharding ring replication factor

  • ingester: replicas, storage backend, persistent volume size, resources, ring replication factor

  • store_gateway: persistent volume, replicas, resources

  • compactor: retention period, persistent volume, replicas, resources

  • distributor: replicas, resources, ring replication factor

  • querier: replicas, resources

  • query_frontend: replicas, resources

  • Mimir limits: ingestion burst size, ingestion rate, maximum label names per series, out-of-order time window

Target the following performance metrics:

  • 240,000 series per ingester at 50% utilization

  • Size distributors by ingestion rate and target 50% CPU utilization at 2 cores per instance

  • Size storage for retention period and metrics cardinality

  • 250 queries per second per query frontend

  • 10 queries per second per querier

Baseline configurations

The resources.limits configuration is optional. You can define only resources.requests if you prefer not to set resource limits.

Small
mimir:
  alertmanager:
    enabled: true
    extraArgs:
      alertmanager-storage.backend: local
      alertmanager-storage.local.path: /etc/alertmanager/config
      alertmanager.configs.fallback: /etc/alertmanager/config/default.yml
      alertmanager.sharding-ring.replication-factor: "1"
    extraVolumeMounts:
    - mountPath: /etc/alertmanager/config
      name: alertmanager-config
    - mountPath: /alertmanager
      name: alertmanager-config-tmp
    extraVolumes:
    - name: alertmanager-config
      secret:
        secretName: alertmanager-config
    - emptyDir: {}
      name: alertmanager-config-tmp
    persistentVolume:
      accessModes:
      - ReadWriteOnce
      enabled: "1"
      size: 5Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  ingester:
    extraArgs:
      ingester.max-global-series-per-user: "0"
      ingester.ring.replication-factor: "1"
    persistentVolume:
      size: 10Gi
    replicas: "1"
    resources:
      limits:
        cpu: 2
        memory: 5Gi
      requests:
        cpu: 1
        memory: 2Gi
  store_gateway:
    persistentVolume:
      size: 10Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  compactor:
    extraArgs:
      compactor.blocks-retention-period: 15d
    persistentVolume:
      enabled: "1"
      size: 10Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi
  distributor:
    extraArgs:
      ingester.ring.replication-factor: "1"
    replicas: "1"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  querier:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  query_frontend:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  nginx:
    replicas: "1"
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
  overrides_exporter:
    replicas: "1"
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
  query_scheduler:
    replicas: "1"
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 100m
        memory: 128Mi
  ruler:
    replicas: "1"
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 100m
        memory: 128Mi
  mimir:
    structuredConfig:
      activity_tracker:
        filepath: /data/activity.log
      limits:
        ingestion_burst_size: 50000
        ingestion_rate: 25000
        max_label_names_per_series: 60
        out_of_order_time_window: 2m
Medium

For deployments approaching 45 or more nodes, monitor for HTTP 429 errors and consider increasing rate limits to 250000 and 500000. For more information, see Enterprise-scale considerations.

mimir:
  alertmanager:
    enabled: true
    extraArgs:
      alertmanager-storage.backend: local
      alertmanager-storage.local.path: /etc/alertmanager/config
      alertmanager.configs.fallback: /etc/alertmanager/config/default.yml
      alertmanager.sharding-ring.replication-factor: "2"
    extraVolumeMounts:
    - mountPath: /etc/alertmanager/config
      name: alertmanager-config
    - mountPath: /alertmanager
      name: alertmanager-config-tmp
    extraVolumes:
    - name: alertmanager-config
      secret:
        secretName: alertmanager-config
    - emptyDir: {}
      name: alertmanager-config-tmp
    persistentVolume:
      accessModes:
      - ReadWriteOnce
      enabled: "1"
      size: 10Gi
    replicas: "2"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  ingester:
    extraArgs:
      ingester.max-global-series-per-user: "0"
      ingester.ring.replication-factor: "1"
    persistentVolume:
      size: 30Gi
    replicas: "3"
    resources:
      limits:
        cpu: 2
        memory: 5Gi
      requests:
        cpu: 1
        memory: 2Gi
  store_gateway:
    persistentVolume:
      size: 30Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  compactor:
    extraArgs:
      compactor.blocks-retention-period: 30d
    persistentVolume:
      enabled: "1"
      size: 30Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi
  distributor:
    extraArgs:
      ingester.ring.replication-factor: "1"
    replicas: "2"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  querier:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  query_frontend:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  nginx:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
  overrides_exporter:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
  query_scheduler:
    replicas: "1"
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 128Mi
  ruler:
    replicas: "1"
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 128Mi
  mimir:
    structuredConfig:
      activity_tracker:
        filepath: /data/activity.log
      limits:
        ingestion_burst_size: 100000
        ingestion_rate: 50000
        max_label_names_per_series: 120
        out_of_order_time_window: 5m
Large

These higher rate limits help prevent HTTP 429 errors and cascading failures in large deployments.

mimir:
  alertmanager:
    enabled: true
    extraArgs:
      alertmanager-storage.backend: local
      alertmanager-storage.local.path: /etc/alertmanager/config
      alertmanager.configs.fallback: /etc/alertmanager/config/default.yml
      alertmanager.sharding-ring.replication-factor: "3"
    extraVolumeMounts:
    - mountPath: /etc/alertmanager/config
      name: alertmanager-config
    - mountPath: /alertmanager
      name: alertmanager-config-tmp
    extraVolumes:
    - name: alertmanager-config
      secret:
        secretName: alertmanager-config
    - emptyDir: {}
      name: alertmanager-config-tmp
    persistentVolume:
      accessModes:
      - ReadWriteOnce
      enabled: "1"
      size: 20Gi
    replicas: "3"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  ingester:
    extraArgs:
      ingester.max-global-series-per-user: "0"
      ingester.ring.replication-factor: "1"
    persistentVolume:
      size: 50Gi
    replicas: "9"
    resources:
      limits:
        cpu: 2
        memory: 5Gi
      requests:
        cpu: 1
        memory: 2Gi
  store_gateway:
    persistentVolume:
      size: 50Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  compactor:
    extraArgs:
      compactor.blocks-retention-period: 60d
    persistentVolume:
      enabled: "1"
      size: 50Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi
  distributor:
    extraArgs:
      ingester.ring.replication-factor: "1"
    replicas: "2"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  querier:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  query_frontend:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  nginx:
    replicas: "2"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 512Mi
  overrides_exporter:
    replicas: "2"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
  query_scheduler:
    replicas: "2"
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 200m
        memory: 256Mi
  ruler:
    replicas: "2"
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 200m
        memory: 256Mi
  mimir:
    structuredConfig:
      activity_tracker:
        filepath: /data/activity.log
      limits:
        ingestion_burst_size: 500000
        ingestion_rate: 250000
        max_label_names_per_series: 240
        out_of_order_time_window: 10m
      distributor:
        instance_limits:
          max_ingestion_rate: 300000
      ingester:
        instance_limits:
          max_ingestion_rate: 300000
  runtimeConfig:
    overrides:
      anonymous:
        ingestion_rate: 500000
        ingestion_burst_size: 1000000

Troubleshooting and configuration references

Use the following guidance to troubleshoot and configure Mimir:

  • If queries are slow, check query frontend and store gateway performance.

  • If ingestion is slow, check distributor and ingester performance.

  • If memory usage is high, increase memory or optimize compaction settings.

  • If you see HTTP 429 errors, increase ingestion_rate and ingestion_burst_size.

  • If Mimir rejects out-of-order samples, check for pipeline backpressure, investigate Vector timeouts, and increase Mimir rate limits.

  • If you observe cascading failures, address the root cause by increasing Mimir ingestion limits and enabling Vector compression.

  • If ingesters show high memory usage with many active items, scale ingesters or increase memory.

  • Place these entries under the mimir key in values.yaml.

  • For upstream configuration keys, see Mimir Helm chart repo.

  • For general product information, see Mimir documentation.

Cross-component reference

This section provides best practices and troubleshooting guidance for managing observability components across your Mission Control deployment.

Best practices

Follow these best practices for observability component management:

  • Monitor resource usage and adjust limits as needed.

  • Use dashboards to monitor ingestion rates and resource usage.

  • Set alerts on CPU, memory, and data-drop metrics.

  • Scale proactively before adding new database clusters.

  • Enable rate limits and burst protection to avoid overloading the system.

Consider the following when scaling:

Performance optimization

Optimize observability component performance by implementing query optimization techniques and resource management strategies.

Query optimization

Optimize queries with the following techniques:

  • Use appropriate time ranges for queries.

  • Implement query caching where possible.

  • Use rate() and increase() for counter metrics.

  • Avoid high-cardinality labels.

  • Use recording rules for complex queries.

Storage optimization

Optimize storage with the following strategies:

  • Implement data lifecycle policies.

  • Use appropriate retention periods.

  • Enable compression where available.

  • Consider tiered storage for long-term data.

  • Monitor and adjust chunk sizes.

Resource optimization

Optimize resources with the following approaches:

  • Right-size resource requests and limits.

  • Implement proper pod scheduling.

  • Use node affinity for critical components.

  • Monitor and adjust resource quotas.

  • Implement proper garbage collection.

Quick reference

Use the following tables as a quick reference for component sizing and troubleshooting common issues.

Component sizing

Component Small (≤3 nodes, ≤1000 tables) Medium (≤6 nodes, ≤2000 tables) Large (>6 nodes, >2000 tables)

Vector

1 replica + 1 CPU, 2Gi

2 replicas + 2 CPU, 4Gi

3 replicas + 4 CPU, 8Gi

Grafana

1 replica + 1 CPU, 2Gi

2 replicas + 2 CPU, 4Gi

3 replicas + 4 CPU, 8Gi

Loki

3 replicas + 6 CPU, 12Gi

6 replicas + 12 CPU, 24Gi

9 replicas + 24 CPU, 48Gi

Mimir Distributor

1 replica + 2 CPU, 2Gi

2 replicas + 4 CPU, 4Gi

2 replicas + 4 CPU, 4Gi

Mimir Ingester

1 replica + 2 CPU, 5Gi

3 replicas + 6 CPU, 15Gi

9 replicas + 18 CPU, 45Gi

Mimir Querier

1 replica + 1 CPU, 1Gi

1 replica + 1 CPU, 1Gi

1 replica + 1 CPU, 1Gi

Mimir Query Frontend

1 replica + 1 CPU, 1Gi

1 replica + 1 CPU, 1Gi

1 replica + 1 CPU, 1Gi

Mimir Compactor

1 replica + 1 CPU, 4Gi

1 replica + 1 CPU, 4Gi

1 replica + 1 CPU, 4Gi

Mimir Store Gateway

1 replica + 1 CPU, 1Gi

1 replica + 1 CPU, 1Gi

1 replica + 1 CPU, 1Gi

Mimir Alertmanager

1 replica + 1 CPU, 2Gi

2 replicas + 2 CPU, 2Gi

3 replicas + 6 CPU, 6Gi

Storage requirements

Component Small Medium Large

Loki, per component

5Gi read, 5Gi write

10Gi read, 10Gi write

20Gi read, 20Gi write

Mimir Ingester

10Gi per replica

30Gi per replica

50Gi per replica

Mimir Store Gateway

10Gi

30Gi

50Gi

Mimir Compactor

10Gi

30Gi

50Gi

Mimir Alertmanager

5Gi

10Gi

20Gi

Metrics capacity

Deployment size Estimated series count Ingester replicas Target series per ingester

Small

Up to 240k series

1

240k

Medium

Up to 720k series

3

240k

Large

Up to 2.16M series

9

240k

Enterprise-scale considerations

For enterprise-scale deployments, additional planning and configuration are required to handle high-volume telemetry data and ensure system reliability.

Scaling guidelines for enterprises

For enterprise-scale deployments, use the following formulas to calculate required resources.

Mimir ingester replicas
Ingester replicas = Total series ÷ 240,000
Mimir distributor replicas
Distributor replicas = (Ingestion rate ÷ 50,000) × 2
Round up to nearest integer, target 50% CPU utilization
Storage per ingester
Storage = (Series count × Average sample size × Retention period) ÷ Ingester count
Add 20% buffer for overhead

Architecture and best practices

Consider the following architectural approaches and best practices for enterprise deployments:

  • Deploy components across multiple regions in an active-active configuration.

  • Distribute data using sharding strategies.

  • Deploy multiple component instances with load balancing.

  • Use separate clusters for development, staging, and production workloads.

  • Implement object storage for long-term metrics and log retention.

  • Monitor and alert on all components using the thresholds in this guide.

  • Use a service mesh for traffic management and security.

  • Implement backup and disaster recovery strategies.

  • Assign dedicated teams to manage different observability components.

  • Review and optimize resource allocation based on usage patterns.

  • Scale proactively before adding new database clusters.

  • Use the formulas above to estimate resource needs before deployment.

Enterprise production Mimir configuration example

For this example, consider an enterprise production deployment with the following characteristics:

  • Database nodes: 180

  • Multi-region nodes: Yes

  • Metrics series: 6.38 million

  • Ingestion rate: 218,722 samples per second

Assume the following performance targets are set:

  • Series per ingester: 240k

  • CPU utilization on distributors: 50%

  • Query rate on the query frontend: 250 queries per second

  • Query rate for each querier: 10 queries per second

To meet these performance targets, you might use the following Mimir configuration. This configuration requires 70 CPU cores and 173Gi memory.

Component Replicas Resources per replica

Distributor

5

2 cores, 2Gi memory

Ingester

27

2 cores, 5Gi memory

Compactor

1

1 core, 4Gi memory

Query Frontend

1

1 core, 1Gi memory

Querier

1

1 core, 1Gi memory

Store Gateway

1

1 core, 1Gi memory

Alertmanager

2

1 core, 1Gi memory

Resolve Mimir and Vector rate limiting issues

Large deployments with the following characteristics may experience rate limiting issues:

  • 40+ database nodes across multiple racks

  • High metrics volume exceeding default rate limits

  • Symptoms including HTTP 429 errors, out-of-order sample rejections, and Vector timeouts

Mimir rate limit increases

Apply these critical configuration changes to resolve cascading failures:

mimir:
  mimir:
    structuredConfig:
      limits:
        ingestion_rate: 250000
        ingestion_burst_size: 500000
      distributor:
        instance_limits:
          max_ingestion_rate: 300000
      ingester:
        instance_limits:
          max_ingestion_rate: 300000
  runtimeConfig:
    overrides:
      anonymous:
        ingestion_rate: 500000
        ingestion_burst_size: 1000000

Vector configuration for stability

vector:
  config:
    sinks:
      vector_aggregator:
        compression: true
        buffer:
          max_size: 2147483648

Expected outcomes

This configuration can help avoid the following issues:

  • HTTP 429 rate-limit errors

  • Out-of-order sample rejections

  • Vector request timeouts that previously occurred every 1 to 2 minutes

  • Pod restart cycles

  • Cascading failures in the metrics pipeline

After applying these changes, the cluster runs stably with no pod restarts for 4 to 6 days.

Was this helpful?

Give Feedback

How can we improve the documentation?

© Copyright IBM Corporation 2026 | Privacy policy | Terms of use Manage Privacy Choices

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: Contact IBM