Batch options
Use the batch options to specify how the CQL statements generated from a dsbulk load command are grouped before being written to the database.
These options only apply to dsbulk load operations.
They are ignored for dsbulk unload and dsbulk count operations.
Synopsis
The standard form for batch options is --batch.KEY VALUE:
-
KEY: The specific option to configure, such as themodeoption. -
VALUE: The value for the option, such as a string, number, or Boolean.HOCON syntax rules apply unless otherwise noted. For more information, see Escape and quote DSBulk command line arguments.
Short and long forms
On the command line, you can specify options in short form (if available), standard form, or long form.
For all batch options, the long form is the standard form with a dsbulk. prefix.
For example, the long form for --batch.mode is --dsbulk.batch.mode.
In configuration files, you must use the long form, such as dsbulk.batch.mode = "PARTITION_KEY".
Tune batch performance
DataStax recommends that you follow these steps to tune DSBulk batching:
-
Start with the default of
--batch.mode "PARTITION_KEY". -
If the average batch size is close to 1, increase
--batch.bufferSize, and adjust--batch.maxBatchStatementsor--batch.maxSizeInBytesif needed. -
If increasing
--batch.bufferSizedoesn’t improve batch performance, set the batch mode toREPLICA_SET, and set--batch.maxBatchStatementsor--batch.maxSizeInBytesto low values to help avoid timeout errors.REPLICA_SETisn’t performant for all clusters. Instead, try adjusting--batch.maxBatchStatementsor--batch.maxSizeInByteswithout changing the--batch.mode. -
Continue to improve throughput by gradually increasing
--batch.maxBatchStatementsor--batch.maxSizeInBytes.
--batch.mode
|
To disable batching, set |
Set the grouping mode:
-
PARTITION_KEY(default, recommended): Groups statements that have the same partition key. -
REPLICA_SET: Groups statements that have the same replica set. This mode can perform better with small clusters and lower replication factors. For large clusters and high replication factors, this mode is typically equal to or worse thanPARTITION_KEY. -
DISABLED: Disable statement batching.
--batch.maxBatchStatements
Set the maximum number of statements that a batch can contain.
For an unlimited number of statements per batch, set this option to 0 or a negative number.
The ideal value depends on two factors:
-
The size of the data being loaded: If each statement writes a large amount of data, use smaller batches.
-
The batch mode: In
PARTITION_KEYmode, you can use larger batches. InREPLICA_SETmode, smaller batches usually perform better.
If --batch.mode is PARTITION_KEY or REPLICA_SET, you must set at least one of --batch.maxBatchStatements or --batch.maxSizeInBytes to a positive number.
Default: 32
--batch.maxSizeInBytes
Set the maximum data size per batch. This is the number of bytes required to encode all the data to be persisted, not including the overhead generated by the native protocol, such as headers and frames. Be aware that the heuristic used to compute data sizes isn’t perfectly accurate, and it can sometimes underestimate the actual size.
For self-managed clusters (not Astra DB databases), this value must be less than or equal to the value of batch_size_fail_threshold_in_kb in cassandra.yaml.
For an unlimited data size per batch, set this option to 0 or a negative number.
If --batch.mode is PARTITION_KEY or REPLICA_SET, you must set at least one of --batch.maxBatchStatements or --batch.maxSizeInBytes to a positive number.
Default: -1 (unlimited)
--batch.bufferSize
A number specifying the buffer size to use for flushing batched statements.
DataStax recommends that you set this option to a multiple of --batch.maxBatchStatements.
For example, if --batch.maxBatchStatements is set to 50, you could set --batch.bufferSize to 100 (2 * --batch.maxBatchStatements).
Higher values consume more memory, often without any noticeable performance gain.
If --batch.maxBatchStatements is unlimited, consider setting --batch.bufferSize to a fixed value, such as 1000, to avoid an unbounded memory usage.
If --batch.bufferSize is set to 0 or a negative number, the buffer size is automatically set to 4 times the value of --batch.maxBatchStatements.
This rule is also triggered if --batch.bufferSize isn’t set because it defaults to -1.
For example, if --batch.maxBatchStatements is set to 0 or a negative number, then there is no buffer (4 * unlimited = unlimited).
If --batch.maxBatchStatements is set to a positive number, such as 32, then the buffer size is a fixed value, such as 4 * 32 = 128.
Deprecated batch options
- --batch.maxBatchSize
-
Deprecated. Use
batch.maxSizeInBytesandbatch.maxBatchStatements.