dsbulk

Bulk loading and unloading tool for DSE and DDAC databases.

The DataStax Bulk Loader dsbulk can be used for both loading data from a variety of sources and unloading data from DataStax Enterprise (DSE) or DataStax Distribution of Apache Cassandra™ (DDAC) databases for transfer, use, or storage of data. Two subcommands, load and unload, are straightforward. Both subcommands require either the options keyspace and table or a schema.query, plus a data source.

A wide variety of options are also available to help you tailor how DataStax Bulk Loader operates. These options have defined default values or values inferred from the input data, if the operation is loading, or from the database data, if the operation is unloading. The options described here are grouped functionally, so that additional requirements can be noted. For example, if loading or unloading CSV data, the connector.csv.url option must be set, specifying the path or URL of the CSV data file used for loading or unloading.

The standalone tool is launched using the command dsbulk from within the bin directory of your distribution. The tool also provides inline help for all settings. A configuration file specifying option values can be used, or options can be specified on the command line. Options specified on the command line will override the configuration file option settings.

Synopsis

dsbulk ( load | unload | count ) [options]
  (( -k | --keyspace ) keyspace_name 
  ( -t | --table ) table_name) 
  | ( --schema.query string )
  [ help | --help ]
Table 1. Legend
Syntax conventions Description
Italics Variable value. Replace with a user-defined value.
[ ] Optional. Square brackets ( [ ] ) surround optional command arguments. Do not type the square brackets.
( ) Group. Parentheses ( ( ) ) identify a group to choose from. Do not type the parentheses.
| Or. A vertical bar ( | ) separates alternative elements. Type any one of the elements. Do not type the vertical bar.
[ -- ] Separate the command line options from the command arguments with two hyphens ( -- ). This syntax is useful when arguments might be mistaken for command line options.

General use

Get general help about dsbulk and the common options:
dsbulk help
Get help about particular dsbulk options, such as connector.csv options using the help subcommand:
dsbulk help connector.csv
Run dsbulk -c csv with --help option to see its short options, along with the general help:
dsbulk -c csv --help
Display the version number:
dsbulk --version

Escaping and Quoting Command Line Arguments

When supplied via the command line, all option values are expected to be in valid HOCON syntax. For example, control characters, the backslash character, and the double-quote character all need to be properly escaped. For example, \t is the escape sequence that corresponds to the tab character, whereas \\ is the escape sequence for the backslash character:
dsbulk load -delim '\t' -url 'C:\\Users\\My Folder'
In general, string values containing special characters also need to be properly quoted with double-quotes, as required by the HOCON syntax:
dsbulk load -url '"C:\\Users\\My Folder"'
However, when the expected type of an option is a string, it is possible to omit the surrounding double-quotes, for convenience. Thus, note the absence of the double-quotes in the first example. Similarly, when an argument is a list, it is possible to omit the surrounding square brackets; making the following two lines equivalent:
dsbulk load --codec.nullStrings 'NIL, NULL'
dsbulk load --codec.nullStrings '[NIL, NULL]'
The same applies for arguments of type map: it is possible to omit the surrounding curly braces, making the following two lines equivalent:
dsbulk load --connector.json.deserializationFeatures '{ USE_BIG_DECIMAL_FOR_FLOATS : true }'
dsbulk load --connector.json.deserializationFeatures 'USE_BIG_DECIMAL_FOR_FLOATS : true'

This syntactic sugar is only available for command line arguments of type string, list or map; all other option types, as well as all options specified in a configuration file must be fully compliant with HOCON syntax, and it is the user's responsibility to ensure that such options are properly escaped and quoted.

Detection of write failures

In the Cassandra documentation, you may have encountered one or more of the following terms, all of which have the same meaning:
  • Lightweight Transactions (LWT), used in this topic
  • Compare-And-Set (CAS)
  • Paxos protocol
DataStax Bulk Loader detects any failures due to failed LWT write operations. In 1.3.2 or later, records that could not be inserted are shown in two files:
  • paxos.bad is the "bad file" devoted to LWT write failures.
  • paxos-erros.log is the debug file devoted to LWT write failures.
DataStax Bulk Loader also writes any failed records to one of the following files in the operation's directory, depending on when the failure occurred:
  • If while parsing data, the records are written to connector.bad.
  • If while mapping data to DSE or DDAC database fields, the records are written to mapping.bad.
  • If while inserting data into DSE or DDAC database tables, the records are written to load.bad.
The operation's directory is the logs subdirectory under the location from which you ran the dsbulk command.