dsbulk

Bulk loading and unloading tool for DataStax Enterprise

The DataStax data loader dsbulk can be used for both loading data from a variety of sources and unloading data from DataStax Enterprise for transfer, use, or storage of data. Two subcommands, load and unload, are straightforward. Both subcommands require either the options keyspace and table or a schema.query, plus a data source.

A wide variety of options are also available to help you tailor how DataStax Bulk Loader operates. These options have defined default values or values inferred from the input data, if the operation is loading, or from the database data, if the operation is unloading. The options described here are grouped functionally, so that additional requirements can be noted. For example, if loading or unloading CSV data, the connector.csv.url option must be set, specifying the path or URL of the CSV data file used for loading or unloading.

The standalone tool is launched using the command dsbulk from within the bin directory of your distribution. The tool also provides inline help for all settings. A configuration file specifying option values can be used, or options can be specified on the command line. Options specified on the command line will override the configuration file option settings.

Synopsis

dsbulk ( load | unload | count ) [options]
  (( -k | --keyspace ) keyspace_name 
  ( -t | --table ) table_name) 
  | ( --schema.query string )
  [ help | --help ]

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

Table 1. Legend
Syntax conventions Description
UPPERCASE Literal keyword.
Lowercase Not literal.
Italics Variable value. Replace with a valid option or user-defined value.
[ ] Optional. Square brackets ( [ ] ) surround optional command arguments. Do not type the square brackets.
( ) Group. Parentheses ( ( ) ) identify a group to choose from. Do not type the parentheses.
| Or. A vertical bar ( | ) separates alternative elements. Type any one of the elements. Do not type the vertical bar.
... Repeatable. An ellipsis ( ... ) indicates that you can repeat the syntax element as often as required.
'Literal string' Single quotation ( ' ) marks must surround literal strings in dsbulk statements. Use single quotation marks to preserve upper case.
{ key:value } Map collection. Braces ( { } ) enclose map collections or key value pairs. A colon separates the key and the value.
<datatype1,datatype2> Set, list, map, or tuple. Angle brackets ( < > ) enclose data types in a set, list, map, or tuple. Separate the data types with a comma.
[ -- ] Separate the command line options from the command arguments with two hyphens ( -- ). This syntax is useful when arguments might be mistaken for command line options.

General use

Get general help about dsbulk and the common options:
dsbulk help
Get help about particular dsbulk options, such as connector.csv options using the help subcommand:
dsbulk help connector.csv
Run dsbulk -c csv with --help option to see its short options, along with the general help:
dsbulk -c csv --help

Escaping and Quoting Command Line Arguments

When supplied via the command line, all option values are expected to be in valid HOCON syntax. For example, control characters, the backslash character, and the double-quote character all need to be properly escaped. For example, \t is the escape sequence that corresponds to the tab character, whereas \\ is the escape sequence for the backslash character:
dsbulk load -delim '\t' -url 'C:\\Users\\My Folder'
In general, string values containing special characters also need to be properly quoted with double-quotes, as required by the HOCON syntax:
dsbulk load -url '"C:\\Users\\My Folder"'
However, when the expected type of an option is a string, it is possible to omit the surrounding double-quotes, for convenience. Thus, note the absence of the double-quotes in the first example. Similarly, when an argument is a list, it is possible to omit the surrounding square brackets; making the following two lines equivalent:
dsbulk load --codec.nullStrings 'NIL, NULL'
dsbulk load --codec.nullStrings '[NIL, NULL]'
The same applies for arguments of type map: it is possible to omit the surrounding curly braces, making the following two lines equivalent:
dsbulk load --connector.json.deserializationFeatures '{ USE_BIG_DECIMAL_FOR_FLOATS : true }'
dsbulk load --connector.json.deserializationFeatures 'USE_BIG_DECIMAL_FOR_FLOATS : true'

This syntactic sugar is only available for command line arguments of type string, list or map; all other option types, as well as all options specified in a configuration file must be fully compliant with HOCON syntax, and it is the user's responsibility to ensure that such options are properly escaped and quoted.

Detection of CAS write failures

In the Cassandra documentation, you may have encountered one or more of the following terms, all of which have the same meaning:
  • Lightweight Transactions (LWT)
  • Compare-And-Set (CAS), used in this topic
  • Paxos protocol
DataStax Bulk Loader detects any failures due to failed CAS write operations. In 1.3.2 or later, records that could not be inserted are shown in two files:
  • paxos.bad is the "bad file" devoted to CAS write failures.
  • paxos-erros.log is the debug file devoted to CAS write failures.

Loading data examples

Examples of loading data using dsbulk

Load data from CSV data read from stdin to table table1 in keyspace ks1:
dsbulk load -k ks1 -t table1
Load a configuration file from /tmp/dsbulk_load.conf to use for loading the file export.csv to table table1 in keyspace ks1:
dsbulk load -f /tmp/dsbulk_load.conf --connector.csv.url export.csv -k ks1 -t table1
Load the file export.csv to table table1 in keyspace ks1 using the short form option for url:
dsbulk load -url export.csv -k ks1 -t table1
Load the file export.csv to table table1 in keyspace ks1 using the short form option for url and the tab character as a field delimiter:
dsbulk load -k ks1 -t table1 -url export.csv -delim '\t'
Load the file /tmp/export.csv to table table1 in keyspace ks1 using the long form option for url:
dsbulk load --connector.csv.url file:///tmp/export.csv -k ks1 -t table1
Note: Note that file:///tmp/export.csv is loading from localhost, hence the empty host in the file:// designation.
Load table table1 in keyspace ks1 from a gzipped CSV file by unzipping it to stdout and piping to stdin of dsbulk:
gzcat table1.csv.gz | dsbulk load -k ks1 -t table1
Specify a few hosts (initial contact points) that belong to the desired cluster and load from a local file, without headers. Map field indices of the input to table columns with -m:
dsbulk load -url ~/export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -header false -m '0=col1,1=col3'
Specify port 9876 for the cluster hosts and load from an external source url:
dsbulk load -url https://192.168.1.100/data/export.csv -k ks1 -t table1 -h '10.200.1.3,10.200.1.4' -port 9876
Load all csv files from a directory. The files do not have a header row, -header false. Map field indices of the input to table columns with -m:
dsbulk load -url ~/export-dir -k ks1 -t table1 -header false -m '0=col1,1=col3'
Load data using a CQL query and include TTL values from the input data:
dsbulk load -query 
  "INSERT INTO dsbulkblog.iris_with_id
    (sepal_length,sepal_width,petal_length,petal_width,species,id) 
    VALUES (:sepal_length,:sepal_width,:petal_length,:petal_width,:species,:id) 
    USING TTL :ttl_to_use" 
Load data using a field-to-column mapping. The example also shows how a function call can be used to apply a function to one or more fields before mapping to a single column named odd_column:
dsbulk load 
  -url /tmp/dsbulkblog/iris.csv 
  -k dsbulkblog 
  -t iris_with_id 
  --schema.mapping "sepal_length = sepal_length, sepal_width = sepal_width, petal_length = petal_length, 
    petal_width = petal_width, species = species, id = id, max(sepal_length, petal_length) = odd_column"
With default port for cluster hosts, keyspace, table, and mapping set in conf/application.conf:
dsbulk load -url https://192.168.1.100/data/export.csv -h '10.200.1.3,10.200.1.4'
Load table table1 in keyspace ks1 from a CSV file, where double-quote characters in fields are escaped with a double-quote; for example, "f1","value with ""quotes"" and more" is a line in the CSV file:
dsbulk load -url ~/export.csv -k ks1 -t table1 -escape '\"'
Loading collections to a table has some specific helpful simplification. Collections inside a CSV file can either contain valid JSON or simpler non-compliant JSON. For example, for the following table:
CREATE TABLE t1 (col1 set<int> PRIMARY KEY, col2 list<int>, col3 map<text,int>);
This pipe-delimited CSV file contains valid JSON and could be loaded:
col1|col2|col3
[1,2,3]|[1,2,3]|{"key1":1,"key2":2}
or this similar CSV file could also be loaded:
col1|col2|col3
1,2,3|1,2,3|"key1":1,"key2":2
Notice that the surrounding brackets and braces are omitted from the valid JSON in the third column.
Important: Simplification of the JSON data will not work for nested collections. If your table has a column col4 of type list<list<int>>, only the outermost structure can omit the surrounding characters. For example:
col1,col2,col3,col4
[1,2,3]|[1,2,3]|{"key1":1,"key2":2}|[[1,2,3],[4,5,6]]
can be simplifed to:
col1,col2,col3,col4
1,2,3|1,2,3|"key1":1,"key2":2|[1,2,3],[4,5,6]
but the inner list items cannot be simplified further.

Unloading data examples

Examples of unloading data using dsbulk

Unloading is the process of extracting data from DataStax Enterprise into a CSV or JSON file. Many options used in both loading and unloading are the same.

Unload data to stdout from the ks1.table1 table in a cluster with a localhost contact point. Column names in the table map to field names in the data. Field names must be emitted in a header row in the output:
dsbulk unload -k ks1 -t table1
Unload data to stdout from the ks1.table1 table and gzip the result:
dsbulk unload -k ks1 -t table1 | gzip > table1.gz
Unload data to a local directory (which may not yet exist):
dsbulk unload -url ~/data-export -k ks1 -t table1
Unload and count data from a materialized view named health_data_mv to a local directory:
dsbulk unload count -k healthcare -t health_data_mv -url ~/export-mv
Note: In DataStax Bulk Loader 1.3.2 and later, you can use dsbulk unload count and specify a materialized view on the -t or --table option. Previously, you could only use dsbulk unload count with tables. For related information, refer to Creating a materialized view.
Unload data using a search index:
dsbulk unload -query 
  "SELECT id, petal_length, petal_width, sepal_length, sepal_width, species 
    FROM dsbulkblog.iris_with_id 
    WHERE solr_query = '{\\\"q\\\": \\\"species:Iris-setosa\\\"}'"
 --executor.continuousPaging.enabled false
where a search index exists for the table iris_with_id. Continuous Paging must be disabled because DSE Search does not work with that feature. The solr_query requires double-escaping of the double-quotes in the solr_query string.
Restriction: Faceted queries are not currently supported.
Unload data using a CQL query using function calls:
dsbulk unload -query 
  "SELECT id, species, writetime(species) AS TIMESTAMP, ttl(species) AS TTL FROM dsbulkblog.iris_with_id" 

Exit codes

Exit codes for dsbulk

The dsbulk command has exit codes that are returned to a calling process. The following values link the integer value returned with the status:
Integer value Status value
0 STATUS_OK
1 STATUS_COMPLETED_WITH_ERRORS
2 STATUS_ABORTED_TOO_MANY_ERRORS
3 STATUS_ABORTED_FATAL_ERROR
4 STATUS_INTERRUPTED
5 STATUS_CRASHED