Specify batch options for the
Batch options specify how statements are grouped before writing for loading.
These options are not applicable for unloading.
--batch.bufferSize, --dsbulk.batch.bufferSize number
The buffer size to use for flushing batched statements.
This option should be set to a multiple of
maxBatchStatements, such as 2 or 4 times its value.
Higher values consume more memory and usually do not result in any noticeable performance gain.
When set to less than or equal to zero, the buffer size is implicitly set to 4 times
--batch.maxSizeInBytes, --dsbulk.batch.maxSizeInBytes number
The maximum data size that a batch can hold.
This is the number of bytes required to encode all the data to be persisted, without counting the overhead generated by the native protocol (headers, frames, and so on).
The value specified should be less than or equal to the value that has been configured server-side for the option
batch_size_fail_threshold_in_kb in cassandra.yaml.
The heuristic used to compute data sizes is not 100% accurate and sometimes underestimates the actual size. For more information, refer to the topic Cassandra.yaml configuration file.
When set to a value less than or equal to zero, the maximum data size is considered unlimited.
At least one of
maxSizeInBytes must be set to a positive value when batching is enabled.
--batch.maxBatchStatements, --dsbulk.batch.maxBatchStatements number
The maximum number of statements that a batch can contain. The ideal value depends on two factors:
The data being loaded: the larger the data, the smaller the batches should be.
The batch mode: when
PARTITION_KEYis used, larger batches are acceptable, whereas when
REPLICA_SETis used, smaller batches usually perform better. When set to a value less than or equal to zero, the maximum number of statements is considered unlimited. At least one of
maxSizeInBytesmust be set to a positive value when batching is enabled.
--batch.mode, --dsbulk.batch.mode string
The grouping mode. Valid values are:
DISABLED: Disables statement batching.
PARTITION_KEY: Groups together statements that share the same partition key. This is the default mode, and the preferred one.
REPLICA_SET: Groups together statements that share the same replica set. This mode might yield better results for small clusters and lower replication factors, but tends to perform equally well or worse than
PARTITION_KEYfor larger clusters or high replication factors. When tuning DataStax Bulk Loader for batching, the recommended approach is:
If the average batch size is close to 1, try increasing
bufferSizedoes not help, switch to
maxSizeInBytesto low values, which may avoid timeouts or errors.
To improve throughput, increase