dsbulk

DataStax Bulk Loader provides the dsbulk command for loading, unloading, and counting data to or from:

  • DataStax Astra DB

  • DataStax Enterprise (DSE) 5.1 and 6.8 databases

  • Open source Apache Cassandra® 2.1 and later databases

Three subcommands, load, unload, and count are straightforward. The subcommands require the options keyspace and table, or a schema.query. The load and unload commands also require a designated data source (CSV or JSON).

A wide variety of options are also available to help you tailor how DataStax Bulk Loader operates. These options have defined default values or values inferred from the input data, if the operation is loading, or from the database data, if the operation is unloading. The options described here are grouped functionally, so that additional requirements can be noted. For example, if loading or unloading CSV data, the connector.csv.url option must be set, specifying the path or URL of the CSV data file used for loading or unloading.

The standalone tool is launched using the command dsbulk from within the bin directory of your distribution. The tool also provides inline help for all settings. A configuration file specifying option values can be used, or options can be specified on the command line. Options specified on the command line will override the configuration file option settings.

Synopsis

dsbulk ( load | unload | count ) [options]
  (( -k | --keyspace ) keyspace_name
  ( -t | --table ) table_name)
  | ( --schema.query string )
  [ help | --help ]
Legend
Syntax conventions Description

Italics

Variable value. Replace with a user-defined value.

[ ]

Optional. Square brackets ([ ]) surround optional command arguments. Do not type the square brackets.

( )

Group. Parentheses ( ( ) ) identify a group to choose from. Do not type the parentheses.

|

Or. A vertical bar (|) separates alternative elements. Type any one of the elements. Do not type the vertical bar.

[ -- ]

Separate the command line options from the command arguments with two hyphens ( -- ). This syntax is useful when arguments might be mistaken for command line options.

General use

Get general help about dsbulk and the common options:

dsbulk help

Get help about particular dsbulk options, such as connector.csv options using the help subcommand:

dsbulk help connector.csv

Run dsbulk -c csv with --help option to see its short options, along with the general help:

dsbulk -c csv --help

Display the version number:

dsbulk --version

Escaping and Quoting Command Line Arguments

When supplied via the command line, all option values are expected to be in valid HOCON syntax. For example, control characters, the backslash character, and the double-quote character all need to be properly escaped. For example, \t is the escape sequence that corresponds to the tab character, whereas \\ is the escape sequence for the backslash character:

dsbulk load -delim '\t' -url 'C:\\Users\\My Folder'

In general, string values containing special characters also need to be properly quoted with double-quotes, as required by the HOCON syntax:

dsbulk load -url '"C:\\Users\\My Folder"'

However, when the expected type of an option is a string, it is possible to omit the surrounding double-quotes, for convenience. Note the absence of the double-quotes in the first example. Similarly, when an argument is a list, it is possible to omit the surrounding square brackets; making the following two lines equivalent:

dsbulk load --codec.nullStrings 'NIL, NULL'
dsbulk load --codec.nullStrings '[NIL, NULL]'

The same applies for arguments of type map: it is possible to omit the surrounding curly braces, making the following two lines equivalent:

dsbulk load --connector.json.deserializationFeatures '{ USE_BIG_DECIMAL_FOR_FLOATS : true }'
dsbulk load --connector.json.deserializationFeatures 'USE_BIG_DECIMAL_FOR_FLOATS : true'

This syntactic sugar is only available for command line arguments of type string, list or map; all other option types, as well as all options specified in a configuration file must be fully compliant with HOCON syntax, and it is the user’s responsibility to ensure that such options are properly escaped and quoted.

Detection of write failures

In the Cassandra documentation, you may have encountered one or more of the following terms, all of which have the same meaning:

  • Lightweight Transactions (LWT), used in this topic

  • Compare-And-Set (CAS)

  • Paxos protocol

DataStax Bulk Loader detects any failures due to failed LWT write operations. In 1.3.2 or later, records that could not be inserted are shown in two files:

  • paxos.bad is the "bad file" devoted to LWT write failures.

  • paxos-erros.log is the debug file devoted to LWT write failures.

DataStax Bulk Loader also writes any failed records to one of the following files in the operation’s directory, depending on when the failure occurred. If the failure occurred while:

  • parsing data, the records are written to connector.bad.

  • mapping data to the supported DSE, DataStax Astra, Apache Cassandra databases, the records are written to mapping.bad.

  • inserting data into any of those supported databases, the records are written to load.bad.

The operation’s directory is the logs subdirectory under the location from which you ran the dsbulk command.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com