Batch options

Specify batch options for the dsbulk command. Batch options specify how statements are grouped before writing for loading. These options are not applicable for unloading.

--batch.bufferSize, --dsbulk.batch.bufferSize number

The buffer size to use for flushing batched statements. This option should be set to a multiple of maxBatchStatements, such as 2 or 4 times its value. Higher values consume more memory and usually do not result in any noticeable performance gain. When set to less than or equal to zero, the buffer size is implicitly set to 4 times maxBatchStatements.

Default: -1

--batch.maxBatchSize number

Deprecated. Instead use --batch.maxSizeInBytes and --batch.maxBatchStatements.

--batch.maxSizeInBytes, --dsbulk.batch.maxSizeInBytes number

The maximum data size that a batch can hold. This is the number of bytes required to encode all the data to be persisted, without counting the overhead generated by the native protocol (headers, frames, and so on). The value specified should be less than or equal to the value that has been configured server-side for the option batch_size_fail_threshold_in_kb in cassandra.yaml.

The heuristic used to compute data sizes is not 100% accurate and sometimes underestimates the actual size. For more information, refer to the topic Cassandra.yaml configuration file.

When set to a value less than or equal to zero, the maximum data size is considered unlimited. At least one of maxBatchStatements or maxSizeInBytes must be set to a positive value when batching is enabled.

Default: -1

--batch.maxBatchStatements, --dsbulk.batch.maxBatchStatements number

The maximum number of statements that a batch can contain. The ideal value depends on two factors:

  1. The data being loaded: the larger the data, the smaller the batches should be.

  2. The batch mode: when PARTITION_KEY is used, larger batches are acceptable, whereas when REPLICA_SET is used, smaller batches usually perform better. When set to a value less than or equal to zero, the maximum number of statements is considered unlimited. At least one of maxBatchStatements or maxSizeInBytes must be set to a positive value when batching is enabled.

Default: 32

--batch.mode, --dsbulk.batch.mode string

The grouping mode. Valid values are:

  • DISABLED: Disables statement batching.

  • PARTITION_KEY: Groups together statements that share the same partition key. This is the default mode, and the preferred one.

  • REPLICA_SET: Groups together statements that share the same replica set. This mode might yield better results for small clusters and lower replication factors, but tends to perform equally well or worse than PARTITION_KEY for larger clusters or high replication factors. When tuning DataStax Bulk Loader for batching, the recommended approach is:

    1. Start with PARTITION_KEY.

    2. If the average batch size is close to 1, try increasing bufferSize.

    3. If increasing bufferSize does not help, switch to REPLICA_SET and set maxBatchStatements or maxSizeInBytes to low values, which may avoid timeouts or errors.

    4. To improve throughput, increase maxBatchStatements or maxSizeInBytes. Default: PARTITION_KEY

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com