Speculative query execution

Speculative queries are used to preemptively start a second execution of a query against another node, before the first node has replied or returned an error.

Speculative queries are used to preemptively start a second execution of a query against another node, before the first node has replied or returned an error. Sometimes a node may be slow to respond. Queries sent to that node will experience increased latency.

To mitigate this situation, use speculative execution to preemptively start a second execution of the query against another node, before the first node has replied or returned an error. If the second node replies faster, that response is returned to the client and the first execution is cancelled. If the first execution is cancelled, the driver ignores the response, but the request still interacts with the server.

Figure: Speculative execution with the second node responding first

Speculative execution with the second node responding first

The first node might reply just after the second execution call was started. In this case, the second preemptive execution is cancelled. For applications that use speculative execution, the data from whichever node replies faster is returned to the client.

Figure: Speculative execution with the first node responding first

Speculative execution with the first node responding first

Configuration

Speculative execution is disabled by default in the DSE drivers. See the driver-specific links for how to enable and configure speculative execution.

Table 1. Speculative execution for drivers
C/C++ C# Java Node.js PHP (not supported) Python Ruby (not supported)

Tuning and practical details

The goal of speculative execution is to improve the overall latency of an application. However, too many speculative executions increase the load on the cluster. If speculative executions are used to avoid sending queries to unhealthy nodes, a healthy node should rarely reach resource limits. Drivers provide a configurable delay threshold at which speculative executions will be sent. In order to determine an appropriate threshold for your application, benchmark the healthy platform state (all nodes are up and under a normal load), monitoring the response latencies. Based on these results, use the latency at a high percentile (p99.9) as the speculative executions threshold.

Alternatively, when low latency is the highest priority and the cluster can handle the increased throughput, set the threshold to 0, which effectively always performs speculative executions.

Most drivers surface metrics for speculative executions that can be used to observe the speculative executions frequency.

Query idempotence

If a query is not idempotent, the driver will never schedule speculative executions for the query, because there is no way to guarantee that only one coordinator will apply the mutation.

Retries

Turning on speculative executions doesn’t change the driver’s retry behavior. Each parallel execution triggers retries independently. The only impact is that all executions of the same query always share the same query plan, so each node will be used by no more than one execution.

Figure: Query plan used in speculative execution

Query plan used in speculative execution

Stream ID exhaustion

One effect of speculative executions is that many requests get cancelled, which can lead to a phenomenon called stream ID exhaustion. Each TCP connection can handle multiple simultaneous requests, identified by a unique number called a stream ID. See Connection pooling. When a request gets cancelled, the stream ID for that request cannot be used immediately because the response for that request may be returned later from the server once it fulfills the request. If this happens often, the number of available stream IDs diminishes over time. When the available stream IDs goes below a given threshold, the connection is closed and a new connection is created. If requests are often cancelled, connections will be recycled at a high rate.

In practice, exhausting all stream IDs on a connection should not occur. Each connection can reference 32768 stream IDs. Most driver implementations are configured by default to only send 1000 requests per connection.

Most drivers provide a metric for observing the number of inflight requests for a node. Some drivers provide a metric for observing the number of orphaned stream IDs. Monitor these metrics to ensure that stream ID exhaustion is not occurring.

If stream ID exhaustion is occurring, the typical solution is to add more capacity to the cluster, or adjust the system settings of the nodes to avoid resource limits.

Request ordering

Ordering issues are only a problem when using server-side timestamps. All recent versions of the DataStax drivers use client-side timestamps with exception of the C++, PHP, and Ruby drivers. Unless the driver is explicitly configured to use server-side timestamps, this section does not apply. See Query timestamps for details.

For example, the following query is run with speculative execution and server-side timestamps enabled.

INSERT INTO my_table (k, v) VALUES (1, 1);

When the first execution is slow, a second execution is triggered. Finally, the first execution completes, so the second execution is cancelled. However, cancelling an execution only means that the driver stops waiting for the server’s response. The request could still be active.

Suppose that while the second request is still active after the driver canceled the execution, the following query is run and completes successfully.

DELETE from my_table where k = 1;

The second request of the INSERT query finally reaches its target node, which applies the INSERT. The row that was successfully deleted is back, despite the driver canceling the second request.

Important: To avoid this scenario, use client-side timestamps.