Scale observability components in Mission Control

Mission Control automatically configures and collects telemetry data as cluster topologies change during scaling, node replacement, and other operations. In larger or busier environments, adjust observability capacity to keep monitoring reliable. This guide explains how to assess demand, decide when to scale, and tune each observability component.

Run the commands in this guide against the control plane cluster where Mission Control observability components are deployed.

Prerequisites

Administrative access to your Mission Control deployment
jq installed on your system
kubectl configured to access your Mission Control cluster
Access to monitoring dashboards
An understanding of your current resource usage patterns
Knowledge of your expected workload growth

Observability sizing

Mission Control provides observability components to monitor and manage the health of your deployments. These components include Vector for metrics collection, Grafana for visualization, Loki for log aggregation, and Mimir for long-term storage and query execution.

You can configure Mission Control to run observability components on dedicated Kubernetes nodes labeled with mission-control.datastax.com/role: platform. This separation prevents platform services from competing with database workloads for resources. When you scale observability components, ensure your platform nodes have sufficient resources. For more information, see Mission Control installation requirements and Pin workloads to specific hosts in shared clusters.

Proper observability sizing depends on two primary factors: ingestion and retrieval.

For ingestion, the number of database nodes and tables determines the volume of telemetry you collect. Each node emits metrics, and each table increases the number of metrics per node. These factors directly affect the resources required to ingest and process telemetry data.

For retrieval, dashboard traffic, query complexity, and retention policies determine the resources you need. The observability stack must efficiently store and retrieve monitoring data for dashboards, alerting, and operational investigation.

Determine your cluster metrics

Before you scale observability components, gather cluster metrics and classify the deployment as Small, Medium, Large, or Enterprise. Enterprise deployments are large-scale deployments (typically 45+ nodes) that require additional configuration and planning beyond the Large baseline. For more information, see Enterprise-scale considerations.

Count database nodes

To count the total number of database nodes across all regions and datacenters, query the MissionControlCluster object in the control plane:

kubectl get missioncontrolcluster -A -o json | jq '[.items[].status.datacenters[].size] | add'

Count datacenters and regions

To count the number of datacenters across all regions, query the MissionControlCluster object in the control plane:

kubectl get missioncontrolcluster -A -o json | jq '[.items[].status.datacenters[]] | length'

Count tables

To count the number of tables across all keyspaces:

Use the CLI

kubectl exec -it POD_NAME -n NAMESPACE -- cqlsh -u USERNAME -p PASSWORD -e "SELECT COUNT(*) FROM system_schema.tables;"

Replace the following:

POD_NAME: Your pod name
NAMESPACE: Your namespace
USERNAME: Your database username
PASSWORD: Your database password

Use the Mission Control UI

In the UI, open the CQL console and run this query:

SELECT COUNT(*) FROM system_schema.tables;

Check current resource usage

To view current CPU and memory usage for observability components:

# Vector Aggregator
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=aggregator

# Grafana
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=grafana

# Loki
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=loki

# Mimir
kubectl top pod -n MC_NAMESPACE -l app.kubernetes.io/name=mimir

Replace MC_NAMESPACE with the namespace where you deployed the Mission Control Helm chart.

Check metrics series count

To determine the number of metric series in Mimir:

kubectl exec -it MIMIR_QUERY_FRONTEND_POD -n MC_NAMESPACE -- wget -qO- http://localhost:8080/prometheus/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | length'

Replace the following:

MIMIR_QUERY_FRONTEND_POD: Your Mimir query frontend pod name. For example, mimir-query-frontend-0.
MC_NAMESPACE: The namespace where you deployed the Mission Control Helm chart.

Check log ingestion rate

To check the current log ingestion rate in Loki:

kubectl exec -it LOKI_READ_POD -n MC_NAMESPACE -- wget -qO- http://localhost:3100/metrics | grep loki_distributor_bytes_received_total

Replace the following:

LOKI_READ_POD: Your Loki read pod name. For example, loki-read-0.
MC_NAMESPACE: The namespace where you deployed the Mission Control Helm chart.

When to scale

Scale observability components before infrastructure changes create pressure or when performance degrades and resources tighten. For example, if you plan to deploy a new cluster with 50 tables, pre-scale the observability stack so existing clusters keep reporting metrics without interruption.

Symptom Primary component Recommended action

Symptom	Primary component	Recommended action
Metrics collection lag or drops	Vector or an upstream dependency	Increase resources, adjust batches and timeouts, or increase buffer size and backpressure tolerance. If Loki is slow, Vector buffers can fill even when Mimir is healthy.
Request timeouts, event drops, or frequent `Source send cancelled` messages	Vector	Enable compression and increase buffer `max_size` to 2GB in high-volume environments.
Slow or unresponsive dashboards	Grafana	Increase resources or deploy multiple stateless Grafana instances with a shared backend.
Log ingestion lag or dropped logs	Loki	Scale read, write, or backend pods and review ingestion rate limits.
Slow or failed metric queries	Mimir	Add queriers and review compaction, retention, and query frontend performance.
HTTP 429 errors, out-of-order sample rejections, or increasing discarded samples	Mimir	Increase `ingestion_rate` and `ingestion_burst_size`, and add distributor or ingester instance limits. Out-of-order rejections often indicate downstream backpressure.

Metrics collection lag or drops

Vector or an upstream dependency

Increase resources, adjust batches and timeouts, or increase buffer size and backpressure tolerance. If Loki is slow, Vector buffers can fill even when Mimir is healthy.

Request timeouts, event drops, or frequent Source send cancelled messages

Vector

Enable compression and increase buffer max_size to 2GB in high-volume environments.

Slow or unresponsive dashboards

Grafana

Increase resources or deploy multiple stateless Grafana instances with a shared backend.

Log ingestion lag or dropped logs

Loki

Scale read, write, or backend pods and review ingestion rate limits.

Slow or failed metric queries

Mimir

Add queriers and review compaction, retention, and query frontend performance.

HTTP 429 errors, out-of-order sample rejections, or increasing discarded samples

Mimir

Increase ingestion_rate and ingestion_burst_size, and add distributor or ingester instance limits. Out-of-order rejections often indicate downstream backpressure.

Vector component scaling

Scale Vector when collection or forwarding no longer keeps up with telemetry volume.

Scale Vector when you observe any of the following conditions:

CPU consistently above 80%
Memory pressure above 80% of allocated memory
Metrics drops or delays
Slow processing times
Request timeouts every 1 to 2 minutes
Source send cancelled messages
Event drops due to backpressure

Vector monitoring and scaling reference

Metric or signal Threshold Action

Metric or signal	Threshold	Action
`vector_component_events_processed_total`	Rate < 10000 per second	Scale resources
`vector_component_errors_total`	> 0 per minute	Investigate configuration
`vector_component_buffer_events`	> 80% of capacity	Increase buffer size
`vector_component_processing_duration_seconds`	p95 > 1 second	Optimize processing
`vector_component_sent_event_bytes_total`	Drops or stalls	Enable compression and increase buffer size
Request timeouts	Every 1 to 2 minutes	Enable compression and increase buffer `max_size` to 2GB
`Source send cancelled`	Frequent occurrences	Check backpressure and downstream rate limits

vector_component_events_processed_total

Rate < 10000 per second

Scale resources

vector_component_errors_total

> 0 per minute

Investigate configuration

vector_component_buffer_events

> 80% of capacity

Increase buffer size

vector_component_processing_duration_seconds

p95 > 1 second

Optimize processing

vector_component_sent_event_bytes_total

Drops or stalls

Enable compression and increase buffer size

Request timeouts

Every 1 to 2 minutes

Enable compression and increase buffer max_size to 2GB

Source send cancelled

Frequent occurrences

Check backpressure and downstream rate limits

Scale Vector vertically by increasing CPU and memory, or horizontally by adding instances behind proper load balancing. Tune replicas, resources, batch sizes, timeouts, buffer capacity, compression, and maximum buffer size. Target about 50% CPU utilization.

High-volume Vector configuration

For deployments with 45 or more nodes, enable compression and increase buffer size to prevent request timeouts and pod restarts. For additional enterprise-scale guidance, see Enterprise-scale considerations.

vector:
  config:
    sinks:
      vector_aggregator:
        compression: true
        buffer:
          max_size: 2147483648  # 2GB for high-volume environments

Vector baseline configuration

The resources.limits configuration is optional. You can define only resources.requests if you prefer not to set resource limits.

Small

For deployments with one region, up to three nodes, and up to 1000 tables.

vector:
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 1Gi
  replicas: 1
  config:
    sources:
      metrics:
        batch:
          max_events: 1000
          timeout_secs: 30
    sinks:
      prometheus:
        buffer:
          max_events: 5000

Medium

For deployments with two regions, up to six nodes, and up to 2000 tables.

vector:
  resources:
    limits:
      cpu: 2
      memory: 4Gi
    requests:
      cpu: 1
      memory: 2Gi
  replicas: 2
  config:
    sources:
      metrics:
        batch:
          max_events: 2000
          timeout_secs: 60
    sinks:
      prometheus:
        buffer:
          max_events: 10000

Large

For deployments with more than two regions, more than six nodes, and more than 2000 tables.

vector:
  resources:
    limits:
      cpu: 4
      memory: 8Gi
    requests:
      cpu: 2
      memory: 4Gi
  replicas: 3
  config:
    sources:
      metrics:
        batch:
          max_events: 5000
          timeout_secs: 120
    sinks:
      prometheus:
        buffer:
          max_events: 20000
      vector_aggregator:
        compression: true
        buffer:
          max_size: 2147483648

Use compression and larger buffers to prevent request timeouts and pod restarts in high-volume deployments.

Vector troubleshooting

Use the following guidance to troubleshoot and configure Vector:

If metrics are dropped, check resource limits and batch sizes.
If processing is slow, increase CPU or reduce batch sizes.
If memory usage is high, increase memory or adjust buffer sizes.
Place these entries under the vector key in values.yaml.
For upstream configuration keys, see Vector configuration documentation.
For general product information, see Vector documentation.

Grafana component scaling

Scale Grafana when dashboard rendering, query load, or user traffic exceed current capacity.

Scale Grafana when you observe any of the following conditions:

Dashboard load times longer than 5 seconds
Frequent panel timeouts
CPU consistently above 70%
Memory pressure above 80% of allocated memory

Grafana monitoring and scaling reference

Metric Threshold Action

Metric	Threshold	Action
`grafana_api_request_duration_seconds`	p95 > 1 second	Optimize dashboards
`grafana_dashboard_rendering_duration_seconds`	p95 > 5 seconds	Reduce panel complexity
`grafana_api_user_signups`	> 10 per hour	Consider rate limiting
`grafana_api_login_post_seconds`	p95 > 2 seconds	Investigate authentication

grafana_api_request_duration_seconds

p95 > 1 second

Optimize dashboards

grafana_dashboard_rendering_duration_seconds

p95 > 5 seconds

Reduce panel complexity

grafana_api_user_signups

> 10 per hour

Consider rate limiting

grafana_api_login_post_seconds

p95 > 2 seconds

Investigate authentication

Scale Grafana vertically by increasing CPU and memory, or horizontally by deploying multiple stateless instances with a shared backend. Tune replicas, resources, cache behavior, and query timeouts. Target 50 to 100 concurrent users per instance and about 70% CPU utilization.

Grafana baseline configuration

The resources.limits configuration is optional. You can define only resources.requests if you prefer not to set resource limits.

Small

grafana:
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 1Gi
  replicas: 1

Medium

grafana:
  resources:
    limits:
      cpu: 2
      memory: 4Gi
    requests:
      cpu: 1
      memory: 2Gi
  replicas: 2

Large

grafana:
  resources:
    limits:
      cpu: 4
      memory: 8Gi
    requests:
      cpu: 2
      memory: 4Gi
  replicas: 3

Grafana troubleshooting

Use the following guidance to troubleshoot and configure Grafana:

If dashboards are slow, check backend performance and network latency.
If panels time out, increase timeout settings or reduce query complexity.
If memory usage is high, increase memory or reduce dashboard refresh rates.
Place these entries under the grafana key in values.yaml.
For upstream configuration keys, see Grafana Helm chart repo.
For product-specific guidance, see Grafana in the Helm installation guide.

Loki component scaling

Scale Loki when log ingestion, storage demand, or query load exceed current read, write, or backend capacity.

Scale Loki when you observe any of the following conditions:

Log drops
Slow ingestion
Querier timeouts

Loki monitoring and scaling reference

Metric Threshold Action

Metric	Threshold	Action
`loki_ingester_memory_streams`	> 80% of limit	Scale ingesters
`loki_ingester_memory_chunks`	> 80% of limit	Adjust chunk size
`loki_distributor_bytes_received_total`	> 1 GB/s	Scale distributors
`loki_querier_request_duration_seconds`	p95 > 5 seconds	Optimize queries

loki_ingester_memory_streams

> 80% of limit

Scale ingesters

loki_ingester_memory_chunks

> 80% of limit

Adjust chunk size

loki_distributor_bytes_received_total

> 1 GB/s

Scale distributors

loki_querier_request_duration_seconds

p95 > 5 seconds

Optimize queries

Scale Loki vertically by increasing CPU and memory, or horizontally by scaling read, write, and backend pods. Tune storage, retention periods, ingestion rate limits, and compaction behavior.

Loki baseline configuration

The resources.limits configuration is optional. You can define only resources.requests if you prefer not to set resource limits.

Small

loki:
  enabled: true
  loki:
    storage:
      bucketNames:
        chunks: my_loki_chunks_bucket
      limits_config:
        retention_period: 7d
        ingestion_rate_mb: 4
        ingestion_burst_size_mb: 6
  read:
    persistence:
      enabled: true
      size: 5Gi
      storageClassName: ""
    replicas: 1
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  write:
    persistence:
      enabled: true
      size: 5Gi
      storageClassName: ""
    replicas: 1
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  backend:
    replicas: 1
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  gateway:
    replicas: 1
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
  chunksCache:
    enabled: false
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
  resultsCache:
    enabled: false
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi

Medium

loki:
  enabled: true
  loki:
    storage:
      bucketNames:
        chunks: my_loki_chunks_bucket
      limits_config:
        retention_period: 15d
        ingestion_rate_mb: 8
        ingestion_burst_size_mb: 12
  read:
    persistence:
      enabled: true
      size: 10Gi
      storageClassName: ""
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 4Gi
      requests:
        cpu: 1
        memory: 2Gi
  write:
    persistence:
      enabled: true
      size: 10Gi
      storageClassName: ""
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 4Gi
      requests:
        cpu: 1
        memory: 2Gi
  backend:
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 4Gi
      requests:
        cpu: 1
        memory: 2Gi
  gateway:
    replicas: 1
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
  chunksCache:
    enabled: false
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
  resultsCache:
    enabled: false
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi

Large

loki:
  enabled: true
  loki:
    storage:
      bucketNames:
        chunks: my_loki_chunks_bucket
      limits_config:
        retention_period: 30d
        ingestion_rate_mb: 16
        ingestion_burst_size_mb: 24
  read:
    persistence:
      enabled: true
      size: 20Gi
      storageClassName: ""
    replicas: 3
    resources:
      limits:
        cpu: 4
        memory: 8Gi
      requests:
        cpu: 2
        memory: 4Gi
  write:
    persistence:
      enabled: true
      size: 20Gi
      storageClassName: ""
    replicas: 3
    resources:
      limits:
        cpu: 4
        memory: 8Gi
      requests:
        cpu: 2
        memory: 4Gi
  backend:
    replicas: 3
    resources:
      limits:
        cpu: 4
        memory: 8Gi
      requests:
        cpu: 2
        memory: 4Gi
  gateway:
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 512Mi
  chunksCache:
    enabled: true
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  resultsCache:
    enabled: true
    replicas: 2
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi

Loki troubleshooting

Use the following guidance to troubleshoot and configure Loki:

If logs are dropped, check ingestion rate limits and storage capacity.
If queries are slow, optimize query patterns and increase querier resources.
If memory usage is high, increase memory or reduce retention period.
Place these entries under the loki key in values.yaml.
For upstream configuration keys, see Loki Helm chart repo.
For sizing guidance, see size Loki in the Grafana documentation.

Mimir component scaling

Scale Mimir when metrics ingestion, storage demand, or query load exceed the capacity of distributors, ingesters, or query components.

Scale Mimir when you observe any of the following conditions:

Query slowness occurs
Alertmanager lags
Ingester overloads occur
HTTP 429 errors occur from distributors
Out-of-order sample rejections occur
Discarded samples increase

Monitoring and scaling reference

Metric or signal Threshold Action

Metric or signal	Threshold	Action
`cortex_ingester_memory_series`	> 192k per ingester, or 80% of the 240k target	Scale ingesters
`cortex_ingester_memory_chunks`	> 80% of limit	Adjust chunk size
`cortex_distributor_received_samples_total`	Rate approaching the `ingestion_rate` limit	Scale distributors
`cortex_distributor_inflight_requests`	> 80% of capacity	Scale distributors
`cortex_querier_request_duration_seconds`	p95 > 5 seconds	Optimize queries or scale queriers
`cortex_query_frontend_queue_length`	> 10 consistently	Scale query frontend or queriers
Distributor CPU utilization	> 80%	Add distributor replicas
HTTP 429 responses	Any occurrences	Increase `ingestion_rate` and `ingestion_burst_size`
Out-of-order sample rejections	Samples delayed more than 5 minutes	Check Vector timeouts and increase Mimir rate limits
`cortex_discarded_samples_total`	Increasing	Investigate rate limiting or out-of-order issues
Ingester active items	Consistently high	Scale ingesters or increase memory

cortex_ingester_memory_series

> 192k per ingester, or 80% of the 240k target

Scale ingesters

cortex_ingester_memory_chunks

> 80% of limit

Adjust chunk size

cortex_distributor_received_samples_total

Rate approaching the ingestion_rate limit

Scale distributors

cortex_distributor_inflight_requests

> 80% of capacity

Scale distributors

cortex_querier_request_duration_seconds

p95 > 5 seconds

Optimize queries or scale queriers

cortex_query_frontend_queue_length

> 10 consistently

Scale query frontend or queriers

Distributor CPU utilization

> 80%

Add distributor replicas

HTTP 429 responses

Any occurrences

Increase ingestion_rate and ingestion_burst_size

Out-of-order sample rejections

Samples delayed more than 5 minutes

Check Vector timeouts and increase Mimir rate limits

cortex_discarded_samples_total

Increasing

Investigate rate limiting or out-of-order issues

Ingester active items

Consistently high

Scale ingesters or increase memory

Scale Mimir vertically by increasing CPU and memory, or horizontally by adding ingesters and queriers. Enable compaction and retention controls, and use object storage for long-term retention.

Tune the following Mimir components and settings:

alertmanager: replicas, resources, sharding ring replication factor
ingester: replicas, storage backend, persistent volume size, resources, ring replication factor
store_gateway: persistent volume, replicas, resources
compactor: retention period, persistent volume, replicas, resources
distributor: replicas, resources, ring replication factor
querier: replicas, resources
query_frontend: replicas, resources
Mimir limits: ingestion burst size, ingestion rate, maximum label names per series, out-of-order time window

Target the following performance metrics:

240,000 series per ingester at 50% utilization
Size distributors by ingestion rate and target 50% CPU utilization at 2 cores per instance
Size storage for retention period and metrics cardinality
250 queries per second per query frontend
10 queries per second per querier

Baseline configurations

The resources.limits configuration is optional. You can define only resources.requests if you prefer not to set resource limits.

Small

mimir:
  alertmanager:
    enabled: true
    extraArgs:
      alertmanager-storage.backend: local
      alertmanager-storage.local.path: /etc/alertmanager/config
      alertmanager.configs.fallback: /etc/alertmanager/config/default.yml
      alertmanager.sharding-ring.replication-factor: "1"
    extraVolumeMounts:
    - mountPath: /etc/alertmanager/config
      name: alertmanager-config
    - mountPath: /alertmanager
      name: alertmanager-config-tmp
    extraVolumes:
    - name: alertmanager-config
      secret:
        secretName: alertmanager-config
    - emptyDir: {}
      name: alertmanager-config-tmp
    persistentVolume:
      accessModes:
      - ReadWriteOnce
      enabled: "1"
      size: 5Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 1Gi
  ingester:
    extraArgs:
      ingester.max-global-series-per-user: "0"
      ingester.ring.replication-factor: "1"
    persistentVolume:
      size: 10Gi
    replicas: "1"
    resources:
      limits:
        cpu: 2
        memory: 5Gi
      requests:
        cpu: 1
        memory: 2Gi
  store_gateway:
    persistentVolume:
      size: 10Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  compactor:
    extraArgs:
      compactor.blocks-retention-period: 15d
    persistentVolume:
      enabled: "1"
      size: 10Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi
  distributor:
    extraArgs:
      ingester.ring.replication-factor: "1"
    replicas: "1"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  querier:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  query_frontend:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  nginx:
    replicas: "1"
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
  overrides_exporter:
    replicas: "1"
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 100m
        memory: 128Mi
  query_scheduler:
    replicas: "1"
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 100m
        memory: 128Mi
  ruler:
    replicas: "1"
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 100m
        memory: 128Mi
  mimir:
    structuredConfig:
      activity_tracker:
        filepath: /data/activity.log
      limits:
        ingestion_burst_size: 50000
        ingestion_rate: 25000
        max_label_names_per_series: 60
        out_of_order_time_window: 2m

Medium

For deployments approaching 45 or more nodes, monitor for HTTP 429 errors and consider increasing rate limits to 250000 and 500000. For more information, see Enterprise-scale considerations.

mimir:
  alertmanager:
    enabled: true
    extraArgs:
      alertmanager-storage.backend: local
      alertmanager-storage.local.path: /etc/alertmanager/config
      alertmanager.configs.fallback: /etc/alertmanager/config/default.yml
      alertmanager.sharding-ring.replication-factor: "2"
    extraVolumeMounts:
    - mountPath: /etc/alertmanager/config
      name: alertmanager-config
    - mountPath: /alertmanager
      name: alertmanager-config-tmp
    extraVolumes:
    - name: alertmanager-config
      secret:
        secretName: alertmanager-config
    - emptyDir: {}
      name: alertmanager-config-tmp
    persistentVolume:
      accessModes:
      - ReadWriteOnce
      enabled: "1"
      size: 10Gi
    replicas: "2"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  ingester:
    extraArgs:
      ingester.max-global-series-per-user: "0"
      ingester.ring.replication-factor: "1"
    persistentVolume:
      size: 30Gi
    replicas: "3"
    resources:
      limits:
        cpu: 2
        memory: 5Gi
      requests:
        cpu: 1
        memory: 2Gi
  store_gateway:
    persistentVolume:
      size: 30Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  compactor:
    extraArgs:
      compactor.blocks-retention-period: 30d
    persistentVolume:
      enabled: "1"
      size: 30Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi
  distributor:
    extraArgs:
      ingester.ring.replication-factor: "1"
    replicas: "2"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  querier:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  query_frontend:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  nginx:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
  overrides_exporter:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 128Mi
  query_scheduler:
    replicas: "1"
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 128Mi
  ruler:
    replicas: "1"
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 128Mi
  mimir:
    structuredConfig:
      activity_tracker:
        filepath: /data/activity.log
      limits:
        ingestion_burst_size: 100000
        ingestion_rate: 50000
        max_label_names_per_series: 120
        out_of_order_time_window: 5m

Large

These higher rate limits help prevent HTTP 429 errors and cascading failures in large deployments.

mimir:
  alertmanager:
    enabled: true
    extraArgs:
      alertmanager-storage.backend: local
      alertmanager-storage.local.path: /etc/alertmanager/config
      alertmanager.configs.fallback: /etc/alertmanager/config/default.yml
      alertmanager.sharding-ring.replication-factor: "3"
    extraVolumeMounts:
    - mountPath: /etc/alertmanager/config
      name: alertmanager-config
    - mountPath: /alertmanager
      name: alertmanager-config-tmp
    extraVolumes:
    - name: alertmanager-config
      secret:
        secretName: alertmanager-config
    - emptyDir: {}
      name: alertmanager-config-tmp
    persistentVolume:
      accessModes:
      - ReadWriteOnce
      enabled: "1"
      size: 20Gi
    replicas: "3"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  ingester:
    extraArgs:
      ingester.max-global-series-per-user: "0"
      ingester.ring.replication-factor: "1"
    persistentVolume:
      size: 50Gi
    replicas: "9"
    resources:
      limits:
        cpu: 2
        memory: 5Gi
      requests:
        cpu: 1
        memory: 2Gi
  store_gateway:
    persistentVolume:
      size: 50Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  compactor:
    extraArgs:
      compactor.blocks-retention-period: 60d
    persistentVolume:
      enabled: "1"
      size: 50Gi
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 4Gi
      requests:
        cpu: 500m
        memory: 2Gi
  distributor:
    extraArgs:
      ingester.ring.replication-factor: "1"
    replicas: "2"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi
  querier:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  query_frontend:
    replicas: "1"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 512Mi
  nginx:
    replicas: "2"
    resources:
      limits:
        cpu: 2
        memory: 2Gi
      requests:
        cpu: 500m
        memory: 512Mi
  overrides_exporter:
    replicas: "2"
    resources:
      limits:
        cpu: 1
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 256Mi
  query_scheduler:
    replicas: "2"
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 200m
        memory: 256Mi
  ruler:
    replicas: "2"
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 200m
        memory: 256Mi
  mimir:
    structuredConfig:
      activity_tracker:
        filepath: /data/activity.log
      limits:
        ingestion_burst_size: 500000
        ingestion_rate: 250000
        max_label_names_per_series: 240
        out_of_order_time_window: 10m
      distributor:
        instance_limits:
          max_ingestion_rate: 300000
      ingester:
        instance_limits:
          max_ingestion_rate: 300000
  runtimeConfig:
    overrides:
      anonymous:
        ingestion_rate: 500000
        ingestion_burst_size: 1000000

Troubleshooting and configuration references

Use the following guidance to troubleshoot and configure Mimir:

If queries are slow, check query frontend and store gateway performance.
If ingestion is slow, check distributor and ingester performance.
If memory usage is high, increase memory or optimize compaction settings.
If you see HTTP 429 errors, increase ingestion_rate and ingestion_burst_size.
If Mimir rejects out-of-order samples, check for pipeline backpressure, investigate Vector timeouts, and increase Mimir rate limits.
If you observe cascading failures, address the root cause by increasing Mimir ingestion limits and enabling Vector compression.
If ingesters show high memory usage with many active items, scale ingesters or increase memory.
Place these entries under the mimir key in values.yaml.
For upstream configuration keys, see Mimir Helm chart repo.
For general product information, see Mimir documentation.

Cross-component reference

This section provides best practices and troubleshooting guidance for managing observability components across your Mission Control deployment.

Best practices

Follow these best practices for observability component management:

Monitor resource usage and adjust limits as needed.
Use dashboards to monitor ingestion rates and resource usage.
Set alerts on CPU, memory, and data-drop metrics.
Scale proactively before adding new database clusters.
Enable rate limits and burst protection to avoid overloading the system.

Consider the following when scaling:

Performance optimization

Optimize observability component performance by implementing query optimization techniques and resource management strategies.

Query optimization

Optimize queries with the following techniques:

Use appropriate time ranges for queries.
Implement query caching where possible.
Use rate() and increase() for counter metrics.
Avoid high-cardinality labels.
Use recording rules for complex queries.

Storage optimization

Optimize storage with the following strategies:

Implement data lifecycle policies.
Use appropriate retention periods.
Enable compression where available.
Consider tiered storage for long-term data.
Monitor and adjust chunk sizes.

Resource optimization

Optimize resources with the following approaches:

Right-size resource requests and limits.
Implement proper pod scheduling.
Use node affinity for critical components.
Monitor and adjust resource quotas.
Implement proper garbage collection.

Quick reference

Use the following tables as a quick reference for component sizing and troubleshooting common issues.

Component sizing

Component	Small (≤3 nodes, ≤1000 tables)	Medium (≤6 nodes, ≤2000 tables)	Large (>6 nodes, >2000 tables)
Vector	1 replica + 1 CPU, 2Gi	2 replicas + 2 CPU, 4Gi	3 replicas + 4 CPU, 8Gi
Grafana	1 replica + 1 CPU, 2Gi	2 replicas + 2 CPU, 4Gi	3 replicas + 4 CPU, 8Gi
Loki	3 replicas + 6 CPU, 12Gi	6 replicas + 12 CPU, 24Gi	9 replicas + 24 CPU, 48Gi
Mimir Distributor	1 replica + 2 CPU, 2Gi	2 replicas + 4 CPU, 4Gi	2 replicas + 4 CPU, 4Gi
Mimir Ingester	1 replica + 2 CPU, 5Gi	3 replicas + 6 CPU, 15Gi	9 replicas + 18 CPU, 45Gi
Mimir Querier	1 replica + 1 CPU, 1Gi	1 replica + 1 CPU, 1Gi	1 replica + 1 CPU, 1Gi
Mimir Query Frontend	1 replica + 1 CPU, 1Gi	1 replica + 1 CPU, 1Gi	1 replica + 1 CPU, 1Gi
Mimir Compactor	1 replica + 1 CPU, 4Gi	1 replica + 1 CPU, 4Gi	1 replica + 1 CPU, 4Gi
Mimir Store Gateway	1 replica + 1 CPU, 1Gi	1 replica + 1 CPU, 1Gi	1 replica + 1 CPU, 1Gi
Mimir Alertmanager	1 replica + 1 CPU, 2Gi	2 replicas + 2 CPU, 2Gi	3 replicas + 6 CPU, 6Gi

Component

Small (≤3 nodes, ≤1000 tables)

Medium (≤6 nodes, ≤2000 tables)

Large (>6 nodes, >2000 tables)

Vector

1 replica + 1 CPU, 2Gi

2 replicas + 2 CPU, 4Gi

3 replicas + 4 CPU, 8Gi

Grafana

1 replica + 1 CPU, 2Gi

2 replicas + 2 CPU, 4Gi

3 replicas + 4 CPU, 8Gi

Loki

3 replicas + 6 CPU, 12Gi

6 replicas + 12 CPU, 24Gi

9 replicas + 24 CPU, 48Gi

Mimir Distributor

1 replica + 2 CPU, 2Gi

2 replicas + 4 CPU, 4Gi

Mimir Ingester

1 replica + 2 CPU, 5Gi

3 replicas + 6 CPU, 15Gi

9 replicas + 18 CPU, 45Gi

Mimir Querier

1 replica + 1 CPU, 1Gi

Mimir Query Frontend

1 replica + 1 CPU, 1Gi

Mimir Compactor

1 replica + 1 CPU, 4Gi

Mimir Store Gateway

1 replica + 1 CPU, 1Gi

Mimir Alertmanager

1 replica + 1 CPU, 2Gi

2 replicas + 2 CPU, 2Gi

3 replicas + 6 CPU, 6Gi

Storage requirements

Component	Small	Medium	Large
Loki, per component	5Gi read, 5Gi write	10Gi read, 10Gi write	20Gi read, 20Gi write
Mimir Ingester	10Gi per replica	30Gi per replica	50Gi per replica
Mimir Store Gateway	10Gi	30Gi	50Gi
Mimir Compactor	10Gi	30Gi	50Gi
Mimir Alertmanager	5Gi	10Gi	20Gi

Component

Small

Medium

Large

Loki, per component

5Gi read, 5Gi write

10Gi read, 10Gi write

20Gi read, 20Gi write

Mimir Ingester

10Gi per replica

30Gi per replica

50Gi per replica

Mimir Store Gateway

10Gi

30Gi

50Gi

Mimir Compactor

10Gi

30Gi

50Gi

Mimir Alertmanager

5Gi

10Gi

20Gi

Metrics capacity

Deployment size	Estimated series count	Ingester replicas	Target series per ingester
Small	Up to 240k series	1	240k
Medium	Up to 720k series	3	240k
Large	Up to 2.16M series	9	240k

Deployment size

Estimated series count

Ingester replicas

Target series per ingester

Small

Up to 240k series

240k

Medium

Up to 720k series

240k

Large

Up to 2.16M series

240k

Enterprise-scale considerations

For enterprise-scale deployments, additional planning and configuration are required to handle high-volume telemetry data and ensure system reliability.

Scaling guidelines for enterprises

For enterprise-scale deployments, use the following formulas to calculate required resources.

Mimir ingester replicas

Ingester replicas = Total series ÷ 240,000

Mimir distributor replicas

Distributor replicas = (Ingestion rate ÷ 50,000) × 2
Round up to nearest integer, target 50% CPU utilization

Storage per ingester

Storage = (Series count × Average sample size × Retention period) ÷ Ingester count
Add 20% buffer for overhead

Architecture and best practices

Consider the following architectural approaches and best practices for enterprise deployments:

Deploy components across multiple regions in an active-active configuration.
Distribute data using sharding strategies.
Deploy multiple component instances with load balancing.
Use separate clusters for development, staging, and production workloads.
Implement object storage for long-term metrics and log retention.
Monitor and alert on all components using the thresholds in this guide.
Use a service mesh for traffic management and security.
Implement backup and disaster recovery strategies.
Assign dedicated teams to manage different observability components.
Review and optimize resource allocation based on usage patterns.
Scale proactively before adding new database clusters.
Use the formulas above to estimate resource needs before deployment.

Enterprise production Mimir configuration example

For this example, consider an enterprise production deployment with the following characteristics:

Database nodes: 180
Multi-region nodes: Yes
Metrics series: 6.38 million
Ingestion rate: 218,722 samples per second

Assume the following performance targets are set:

Series per ingester: 240k
CPU utilization on distributors: 50%
Query rate on the query frontend: 250 queries per second
Query rate for each querier: 10 queries per second

To meet these performance targets, you might use the following Mimir configuration. This configuration requires 70 CPU cores and 173Gi memory.

Component	Replicas	Resources per replica
Distributor	5	2 cores, 2Gi memory
Ingester	27	2 cores, 5Gi memory
Compactor	1	1 core, 4Gi memory
Query Frontend	1	1 core, 1Gi memory
Querier	1	1 core, 1Gi memory
Store Gateway	1	1 core, 1Gi memory
Alertmanager	2	1 core, 1Gi memory

Component

Replicas

Resources per replica

Distributor

2 cores, 2Gi memory

Ingester

2 cores, 5Gi memory

Compactor

1 core, 4Gi memory

Query Frontend

1 core, 1Gi memory

Querier

1 core, 1Gi memory

Store Gateway

1 core, 1Gi memory

Alertmanager

1 core, 1Gi memory

Resolve Mimir and Vector rate limiting issues

Large deployments with the following characteristics may experience rate limiting issues:

40+ database nodes across multiple racks
High metrics volume exceeding default rate limits
Symptoms including HTTP 429 errors, out-of-order sample rejections, and Vector timeouts

Mimir rate limit increases

Apply these critical configuration changes to resolve cascading failures:

mimir:
  mimir:
    structuredConfig:
      limits:
        ingestion_rate: 250000
        ingestion_burst_size: 500000
      distributor:
        instance_limits:
          max_ingestion_rate: 300000
      ingester:
        instance_limits:
          max_ingestion_rate: 300000
  runtimeConfig:
    overrides:
      anonymous:
        ingestion_rate: 500000
        ingestion_burst_size: 1000000

Vector configuration for stability

vector:
  config:
    sinks:
      vector_aggregator:
        compression: true
        buffer:
          max_size: 2147483648

Expected outcomes

This configuration can help avoid the following issues:

HTTP 429 rate-limit errors
Out-of-order sample rejections
Vector request timeouts that previously occurred every 1 to 2 minutes
Pod restart cycles
Cascading failures in the metrics pipeline

After applying these changes, the cluster runs stably with no pod restarts for 4 to 6 days.

Scale observability components in Mission Control

Was this helpful?