Engine options

Specify engine options for the dsbulk command.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-dryRun, --engine.dryRun, --dsbulk.engine.dryRun { true | false }

Enable or disable dry-run mode, a test mode that runs the command but does not load data. Not applicable for unloading.

Default: false

--engine.executionId, --dsbulk.engine.executionId string

A unique identifier to attribute to each execution. When unspecified or empty, the engine will automatically generate identifiers of the following form: workflow_timestamp, where :

workflow stands for the workflow type (LOAD, UNLOAD, etc.);
timestamp is the current timestamp formatted as uuuuMMdd-HHmmss-SSSSSS (see Patterns for Formatting and Parsing in Oracle Java documentation) in UTC, with microsecond precision if available, and millisecond precision otherwise. When this identifier is user-supplied, it is important to guarantee its uniqueness; failing to do so may result in execution failures. It is also possible to provide templates here. Any format compliant with the formatting rules of String.format() is accepted, and can contain the following parameters:
%1$s : the workflow type (LOAD, UNLOAD, etc.);
%2$t : the current time (with microsecond precision if available, and millisecond precision otherwise);
%3$s : the JVM process PID (this parameter might not be available on some operating systems; if its value cannot be determined, a random integer will be inserted instead). Default: null

-maxConcurrentQueries, --engine.maxConcurrentQueries, --dsbulk.engine.maxConcurrentQueries string

The maximum number of concurrent queries that should be carried in parallel.

This option acts as a safeguard to prevent more queries executing in parallel than the cluster can handle, or to regulate throughput when latencies get too high. Batch statements count as one query.

When using continuous paging, also make sure to set this number to a value equal to or lesser than the number of nodes in the local datacenter multiplied by the value configured server-side for continuous_paging.max_concurrent_sessions in the cassandra.yaml configuration file (60 by default); otherwise some requests might be rejected.

The special syntax NC can be used to specify a number that is a multiple of the number of available cores. For example, if the number of cores is 8, then 0.5C = 0.5 * 8 = 4 concurrent queries.

The default value is AUTO. With this special value, DataStax Bulk Loader® optimizes the number of concurrent queries according to the number of available cores, and the operation being executed. The actual value usually ranges from the number of cores to eight times that number.

Starting in 1.6.0, using maxConcurrentQueries is the preferred way to regulate DataStax Bulk Loader throughput. Avoid using the following settings to regulate throughput:

dsbulk.executor.maxInFlight
dsbulk.executor.maxPerSecond

Those two settings are still supported. However, their default values changed in 1.6.0 to -1 (disabled). The settings create semaphores and thus block the driver under high contention. The new setting, --dsbulk.engine.maxConcurrentQueries, achieves the same effect without blocking the driver.

Also,the setting executor.continuousPaging.maxConcurrentQueries is deprecated. Instead, use engine.maxConcurrentQueries. If executor.continuousPaging.maxConcurrentQueries is provided, DataStax Bulk Loader 1.6.0 and later ignores it and logs a warning.

To check the current engine.maxConcurrentQueries setting, set logging -verbosity 2. See -verbosity, --log.verbosity, --dsbulk.log.verbosity { 0 | 1 | 2 } for option details. Then look in the operation.log file for a line starting with Using read concurrency: or Using write concurrency:.

Default: AUTO

Engine options

-dryRun, --engine.dryRun, --dsbulk.engine.dryRun { true | false }

--engine.executionId, --dsbulk.engine.executionId string

-maxConcurrentQueries, --engine.maxConcurrentQueries, --dsbulk.engine.maxConcurrentQueries string

Was this helpful?

Give Feedback