Engine options

Use the engine options for general performance tuning and throughput for dsbulk operations.

For query-level tuning, results paging, and other driver performance settings, see Driver options.

Synopsis

The standard form for engine options is --engine.KEY VALUE:

KEY: The specific option to configure, such as the dryRun option.
VALUE: The value for the option, such as a string, number, or Boolean.

HOCON syntax rules apply unless otherwise noted. For more information, see Escape and quote DSBulk command line arguments.

Short and long forms

On the command line, you can specify options in short form (if available), standard form, or long form.

For all engine options, the long form is the standard form with a dsbulk. prefix, such as --dsbulk.engine.dryRun.

The following examples show the same command with different forms of the dryRun option:

# Short form
dsbulk load -dryRun -url filename.csv -k ks1 -t table1

# Standard form
dsbulk load --engine.dryRun -url filename.csv -k ks1 -t table1

# Long form
dsbulk load --dsbulk.engine.dryRun -url filename.csv -k ks1 -t table1

In configuration files, you must use the long form with the dsbulk. prefix. For example:

dsbulk.engine.dryRun = true

--engine.dataSizeSamplingEnabled

Whether to use data size sampling to optimize the DSBulk execution engine.

true (default): Enable data size sampling for applicable dsbulk load commands.

This option only applies to dsbulk load commands that read from a file, directory, or HTTP/HTTPS URL. Data size sampling never occurs for dsbulk unload, dsbulk count, or dsbulk load commands that read from standard input (-url - or -url stdin:/).

false: Disable data size sampling.

Set --engine.dataSizeSamplingEnabled false if your data source cannot be read multiple times.

To perform data size sampling, DSBulk invokes the connector to read a few sample records and estimate the total data size. Then, DSBulk invokes the connector again to read the entire body of data and perform the load operation.

Because DSBulk must invoke the connector twice, data size sampling is only possible if the data source can be read multiple times. DSBulk can’t complete the load operation if it cannot reread the entire data source, such as short-lived ephemeral files and single-use URLs.

--engine.dryRun (-dryRun)

Whether to run a dsbulk load command as a dry run (simulation) without actually modifying the database.

false (default): Disable dry-run mode. The load operation runs normally, and the data is written to the database.
true: Enables dry-run mode so you can test a load operation without writing data to the database.

--engine.executionId

Set a unique identifier for each dsbulk load, dsbulk unload, and dsbulk count operation.

null (default), unspecified, or empty: The DSBulk engine automatically generates identifiers in the format WORKFLOW_TIMESTAMP, such as LOAD_20240315-142530-123456:
- WORKFLOW: The operation type, either LOAD, UNLOAD, or COUNT.
- TIMESTAMP: The UTC time that the workflow started, formatted as uuuuMMdd-HHmmss-SSSSSS, with microsecond precision, if available, or millisecond precision otherwise.
User-defined string: A user-supplied identifier, such as an environment variable or template.

If the execution ID is user-defined, you must guarantee its uniqueness. Duplicate execution IDs can cause commands to fail.

To specify a template, the format must comply with formatting rules for String.format(), and it can use the following variables:
- %1$s: A variable representing the workflow type. At runtime, this resolves to LOAD, UNLOAD, or COUNT.
- %2$t: A variable representing the UTC time that the workflow started, formatted as uuuuMMdd-HHmmss-SSSSSS, with microsecond precision, if available, or millisecond precision otherwise.
- %3$s: A variable representing the JVM process PID for the operation. This parameter isn’t available on all operating systems. If the PID cannot be determined, DSBulk inserts a random integer instead.
--engine.executionId "%1$s-%2$t" is the same as the default identifier format (WORKFLOW_TIMESTAMP). You can use this template to amend the default identifier format. For example, to add a test- prefix to the default identifier format, set --engine.executionID "test-%1$s-%2$t".

--engine.maxConcurrentQueries (-maxConcurrentQueries)

The maximum number of concurrent queries to run in parallel.

This option is a safeguard that can prevent a command from starting more concurrent queries than the cluster can handle. You can also use this option to regulate throughput if latency is too high.

Allowed values for --engine.maxConcurrentQueries include the following:

AUTO (default): DSBulk calculates an optimal number of concurrent queries according to the number of available cores and the operation being executed. Typically, the resulting value is no more than 8 times the number of available cores.
NC: A special syntax that you can use to set the value as a multiple of the number of available cores for a given operation. For example, if you set -maxConcurrentQueries 10C and there are 8 cores, then there can be 80 parallel queries (10 * 8 = 80).
Positive integer: Specify an exact number of concurrent queries allowed to run in parallel, such as 100.

When using continuous paging, make sure -maxConcurrentQueries doesn’t exceed the number of nodes in the local datacenter multiplied by the server-side value of continuous_paging.max_concurrent_sessions in your cluster’s cassandra.yaml file:
```
maxConcurrentQueries <= (local datacenter nodes) * (continuous_paging.max_concurrent_sessions)
```
If -maxConcurrentQueries is too high, requests can be rejected due to overloaded server resources.
Astra DB rate limits are always enforced on the Astra DB server side, regardless of the DSBulk configuration. If your dsbulk commands connect to an Astra DB database, make sure this limit doesn’t exceed the Astra DB rate limit.
DataStax recommends using --engine.maxConcurrentQueries to control throughput instead of the following deprecated options:

To check the --engine.maxConcurrentQueries setting, set -verbosity 2, run a load, unload, or count command, and then check the operation.log file for a line starting with Using read concurrency: or Using write concurrency:.

For the purposes of this option, batch statements are considered one query.

--s3.clientCacheSize

The --s3.clientCacheSize (--dsbulk.s3.clientCacheSize) option doesn’t belong to the engine options group. It’s documented here because it relates to the core configuration for dsbulk. However, it’s unlikely that this option would have an impact on performance.

When using the urlfile option with AWS S3 URLs, DSBulk creates an S3 client for each bucket specified in the S3 URLs. DSBulk caches the S3 clients to prevent them from being re-created unnecessarily when processing many S3 URLs that target the same buckets. If all of your S3 URLs target the same bucket, then the same S3 client is used for each URL, and the cache contains only one entry. The size of the S3 client cache is controlled by the --s3.clientCacheSize (--dsbulk.s3.clientCacheSize) option, and the default is 20 entries. The default value is arbitrary, and it only needs to be changed when loading from many different S3 buckets in a single command.

For more information, see Load from AWS S3 and --connector.csv.urlfile or --connector.json.urlfile.

Engine options

Synopsis

Short and long forms

--engine.dataSizeSamplingEnabled

--engine.dryRun (-dryRun)

--engine.executionId

--engine.maxConcurrentQueries (-maxConcurrentQueries)

--s3.clientCacheSize

See also

Was this helpful?

Give Feedback