dsbulk
DataStax Bulk Loader provides the dsbulk
command for loading, unloading, and counting data to or from:
-
Hyper-Converged Database (HCD) 1.0 databases
-
DataStax Enterprise (DSE) 5.1, 6.8, and 6.9 databases
-
Open source Apache Cassandra® 2.1 and later databases
Three subcommands, load
, unload
, and count
are straightforward.
The subcommands require the options keyspace
and table
, or a schema.query
.
The load
and unload
commands also require a designated data source (CSV or JSON).
A wide variety of options are also available to help you tailor how DataStax Bulk Loader operates.
These options have defined default values or values inferred from the input data, if the operation is loading, or from the database data, if the operation is unloading.
The options described here are grouped functionally, so that additional requirements can be noted.
For example, if loading or unloading CSV data, the connector.csv.url
option must be set, specifying the path or URL of the CSV data file used for loading or unloading.
The standalone tool is launched using the command dsbulk
from within the bin
directory of your distribution.
The tool also provides inline help for all settings.
A configuration file specifying option values can be used, or options can be specified on the command line.
Options specified on the command line will override the configuration file option settings.
Synopsis
dsbulk ( load | unload | count ) [options]
(( -k | --keyspace ) keyspace_name
( -t | --table ) table_name)
| ( --schema.query string )
[ help | --help ]
Syntax conventions | Description |
---|---|
|
Variable value. Replace with a user-defined value. |
|
Optional.
Square brackets ( |
|
Group.
Parentheses ( |
|
Or.
A vertical bar ( |
|
Separate the command line options from the command arguments with two hyphens ( |
General use
Get general help about dsbulk
and the common options:
dsbulk help
Get help about particular dsbulk
options, such as connector.csv
options using the help
subcommand:
dsbulk help connector.csv
Run dsbulk -c csv
with --help
option to see its short options, along with the general help:
dsbulk -c csv --help
Display the version number:
dsbulk --version
Escaping and Quoting Command Line Arguments
When supplied via the command line, all option values are expected to be in valid HOCON syntax.
For example, control characters, the backslash character, and the double-quote character all need to be properly escaped.
For example, \t
is the escape sequence that corresponds to the tab character, whereas \\
is the escape sequence for the backslash character:
dsbulk load -delim '\t' -url 'C:\\Users\\My Folder'
In general, string values containing special characters also need to be properly quoted with double-quotes, as required by the HOCON syntax:
dsbulk load -url '"C:\\Users\\My Folder"'
However, when the expected type of an option is a string, it is possible to omit the surrounding double-quotes, for convenience. Note the absence of the double-quotes in the first example. Similarly, when an argument is a list, it is possible to omit the surrounding square brackets; making the following two lines equivalent:
dsbulk load --codec.nullStrings 'NIL, NULL'
dsbulk load --codec.nullStrings '[NIL, NULL]'
The same applies for arguments of type map: it is possible to omit the surrounding curly braces, making the following two lines equivalent:
dsbulk load --connector.json.deserializationFeatures '{ USE_BIG_DECIMAL_FOR_FLOATS : true }'
dsbulk load --connector.json.deserializationFeatures 'USE_BIG_DECIMAL_FOR_FLOATS : true'
This syntactic sugar is only available for command line arguments of type string, list or map; all other option types, as well as all options specified in a configuration file must be fully compliant with HOCON syntax, and it is the user’s responsibility to ensure that such options are properly escaped and quoted.
Detection of write failures
In the Cassandra documentation, you may have encountered one or more of the following terms, all of which have the same meaning:
-
Lightweight Transactions (LWT), used in this topic
-
Compare-And-Set (CAS)
-
Paxos protocol
DataStax Bulk Loader detects any failures due to failed LWT write operations. In 1.3.2 or later, records that could not be inserted are shown in two files:
-
paxos.bad
is the "bad file" devoted to LWT write failures. -
paxos-erros.log
is the debug file devoted to LWT write failures.
DataStax Bulk Loader also writes any failed records to one of the following files in the operation’s directory, depending on when the failure occurred. If the failure occurred while:
-
parsing data, the records are written to
connector.bad
. -
mapping data to the supported DSE, DataStax Astra, Apache Cassandra databases, the records are written to
mapping.bad
. -
inserting data into any of those supported databases, the records are written to
load.bad
.
The operation’s directory is the logs subdirectory under the location from which you ran the dsbulk
command.