Batch options
Specify batch options for the dsbulk
command.
Batch options specify how statements are grouped before writing for loading.
These options are not applicable for unloading.
--batch.bufferSize, --dsbulk.batch.bufferSize number
The buffer size to use for flushing batched statements.
This option should be set to a multiple of maxBatchStatements
, such as 2 or 4 times its value.
Higher values consume more memory and usually do not result in any noticeable performance gain.
When set to less than or equal to zero, the buffer size is implicitly set to 4 times maxBatchStatements
.
Default: -1
--batch.maxBatchSize number
Deprecated.
Instead use --batch.maxSizeInBytes
and --batch.maxBatchStatements
.
--batch.maxSizeInBytes, --dsbulk.batch.maxSizeInBytes number
The maximum data size that a batch can hold.
This is the number of bytes required to encode all the data to be persisted, without counting the overhead generated by the native protocol (headers, frames, and so on).
The value specified should be less than or equal to the value that has been configured server-side for the option batch_size_fail_threshold_in_kb
in cassandra.yaml.
The heuristic used to compute data sizes is not 100% accurate and sometimes underestimates the actual size. For more information, refer to the topic Cassandra.yaml configuration file. |
When set to a value less than or equal to zero, the maximum data size is considered unlimited.
At least one of maxBatchStatements
or maxSizeInBytes
must be set to a positive value when batching is enabled.
Default: -1
--batch.maxBatchStatements, --dsbulk.batch.maxBatchStatements number
The maximum number of statements that a batch can contain. The ideal value depends on two factors:
-
The data being loaded: the larger the data, the smaller the batches should be.
-
The batch mode: when
PARTITION_KEY
is used, larger batches are acceptable, whereas whenREPLICA_SET
is used, smaller batches usually perform better. When set to a value less than or equal to zero, the maximum number of statements is considered unlimited. At least one ofmaxBatchStatements
ormaxSizeInBytes
must be set to a positive value when batching is enabled.
Default: 32
--batch.mode, --dsbulk.batch.mode string
The grouping mode. Valid values are:
-
DISABLED
: Disables statement batching. -
PARTITION_KEY
: Groups together statements that share the same partition key. This is the default mode, and the preferred one. -
REPLICA_SET
: Groups together statements that share the same replica set. This mode might yield better results for small clusters and lower replication factors, but tends to perform equally well or worse thanPARTITION_KEY
for larger clusters or high replication factors. When tuning DataStax Bulk Loader for batching, the recommended approach is:-
Start with
PARTITION_KEY
. -
If the average batch size is close to 1, try increasing
bufferSize
. -
If increasing
bufferSize
does not help, switch toREPLICA_SET
and setmaxBatchStatements
ormaxSizeInBytes
to low values, which may avoid timeouts or errors. -
To improve throughput, increase
maxBatchStatements
ormaxSizeInBytes
. Default:PARTITION_KEY
-