Batch options

Use the batch options to specify how the Cassandra Query Language (CQL) statements generated from a dsbulk load command are grouped before being written to the database.

These options only apply to dsbulk load operations. They are ignored for dsbulk unload and dsbulk count operations.

Synopsis

The standard form for batch options is --batch.KEY VALUE:

KEY: The specific option to configure, such as the mode option.
VALUE: The value for the option, such as a string, number, or Boolean.

HOCON syntax rules apply unless otherwise noted. For more information, see Escape and quote DSBulk command line arguments.

Short and long forms

On the command line, you can specify options in short form (if available), standard form, or long form.

For all batch options, the long form is the standard form with a dsbulk. prefix. For example, the long form for --batch.mode is --dsbulk.batch.mode.

In configuration files, you must use the long form, such as dsbulk.batch.mode = "PARTITION_KEY".

Tune batch performance

DataStax recommends that you follow these steps to tune DSBulk batching:

Start with the default of --batch.mode "PARTITION_KEY".
If the average batch size is close to 1, increase --batch.bufferSize, and adjust --batch.maxBatchStatements or --batch.maxSizeInBytes if needed.
If increasing --batch.bufferSize doesn’t improve batch performance, set the batch mode to REPLICA_SET, and set --batch.maxBatchStatements or --batch.maxSizeInBytes to low values to help avoid timeout errors.

REPLICA_SET isn’t performant for all clusters. Instead, try adjusting --batch.maxBatchStatements or --batch.maxSizeInBytes without changing the --batch.mode.
Continue to improve throughput by gradually increasing --batch.maxBatchStatements or --batch.maxSizeInBytes.

--batch.mode

To disable batching, set --batch.mode "DISABLED".

Set the grouping mode:

PARTITION_KEY (default, recommended): Groups statements that have the same partition key.
REPLICA_SET: Groups statements that have the same replica set. This mode can perform better with small clusters and lower replication factors. For large clusters and high replication factors, this mode is typically equal to or worse than PARTITION_KEY.
DISABLED: Disable statement batching.

--batch.maxBatchStatements

Set the maximum number of statements that a batch can contain. For an unlimited number of statements per batch, set this option to 0 or a negative number.

The ideal value depends on two factors:

The size of the data being loaded: If each statement writes a large amount of data, use smaller batches.
The batch mode: In PARTITION_KEY mode, you can use larger batches. In REPLICA_SET mode, smaller batches usually perform better.

If --batch.mode is PARTITION_KEY or REPLICA_SET, you must set at least one of --batch.maxBatchStatements or --batch.maxSizeInBytes to a positive number.

Default: 32

Set the maximum data size per batch. This is the number of bytes required to encode all the data to be persisted, not including the overhead generated by the native protocol, such as headers and frames. Be aware that the heuristic used to compute data sizes isn’t perfectly accurate, and it can sometimes underestimate the actual size.

For self-managed clusters (not Astra DB databases), this value must be less than or equal to the value of batch_size_fail_threshold_in_kb in cassandra.yaml.

For an unlimited data size per batch, set this option to 0 or a negative number.

If --batch.mode is PARTITION_KEY or REPLICA_SET, you must set at least one of --batch.maxBatchStatements or --batch.maxSizeInBytes to a positive number.

Default: -1 (unlimited)

--batch.bufferSize

A number specifying the buffer size to use for flushing batched statements.

DataStax recommends that you set this option to a multiple of --batch.maxBatchStatements. For example, if --batch.maxBatchStatements is set to 50, you could set --batch.bufferSize to 100 (2 * --batch.maxBatchStatements). Higher values consume more memory, often without any noticeable performance gain.

If --batch.maxBatchStatements is unlimited, consider setting --batch.bufferSize to a fixed value, such as 1000, to avoid an unbounded memory usage.

If --batch.bufferSize is set to 0 or a negative number, the buffer size is automatically set to 4 times the value of --batch.maxBatchStatements. This rule is also triggered if --batch.bufferSize isn’t set because it defaults to -1. For example, if --batch.maxBatchStatements is set to 0 or a negative number, then there is no buffer (4 * unlimited = unlimited). If --batch.maxBatchStatements is set to a positive number, such as 32, then the buffer size is a fixed value, such as 4 * 32 = 128.

Deprecated batch options

--batch.maxBatchSize: Deprecated. Use batch.maxSizeInBytes and batch.maxBatchStatements.

Batch options

Synopsis

Short and long forms

Tune batch performance

--batch.mode

--batch.maxBatchStatements

--batch.maxSizeInBytes

--batch.bufferSize

Deprecated batch options

Was this helpful?

Give Feedback