Getting Started

Getting Started with dsbulk.

This guide demonstrates the key features of using dsbulk to get a user started.

Prerequisites

Obtain the following information and resources:

Key features

Procedure

Simple loading without configuration file

  1. Loading CSV data with a dsbulk load command:

    Specify two hosts (initial contact points) that belong to the desired cluster and load from a local file export.csv with headers into keyspace ks1 and table table1:

    dsbulk load -url export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -header true
    url can designate the path to a resource, such as a local file, or a web URL from which to read/write data.
    Specify an external source of data, as well as a port for the cluster hosts:
    dsbulk load -url https://svr/data/export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -port 9876

    Load CSV data from stdin as it is generated from a loading script generate_data. The data is loaded to the keyspace ks1 and table table1 in a cluster with a localhost contact point (default if no hosts are defined). By default if not specified, the field names are read from a header row in the input file.

    generate_data | dsbulk load -url stdin:/ -k ks1 -t table1

Simple unloading without configuration file

  1. Unloading CSV data with a dsbulk unload command:
    Specify the external file to write the data to from keyspace ks1 and table table1:
    dsbulk unload -url myData.csv -k ks1 -t table1

Creating a configuration file

  1. The configuration file for setting values for dsbulk are written in a simple format, one option per line:
    ############ MyConfFile.conf ############
    
    dsbulk {
       # The name of the connector to use
       connector.name = "csv"
       # CSV field delimiter
       connector.csv.delimiter = "|"
       # The keyspace to connect to
       schema.keyspace = "myKeyspace"
       # The table to connect to
       schema.table = "myTable"
       # The field-to-column mapping
       schema.mapping = "0=name, 1=age, 2=email" 
    }
    Tip: Settings in the config file always start with the dsbulk prefix, while on the command line, this prefix must be omitted. To avoid confusion, configuration files are formatted with the following equivalent Human-Optimized Config Object Notation (HOCON) syntax: dsbulk { connector.name = "csv" ... }. For information about HOCON syntax, refer to this specification.
    To use the configuration file, specify -f filename, where filename is the configuration file:
    dsbulk load -f myConfFile.conf -url export.csv -k ks1 -t table1

Using SSL with dsbulk

  1. To use SSL with dsbulk, first refer to DSE Security docs to set up SSL. The SSL options can be specified on the command line, but a configuration file is a good option given the long option names:
    driver.ssl.keystore.password = cassandra
    driver.ssl.keystore.path = "/Users/johndoe/tmp/ssl/keystore.node0"
    driver.ssl.provider = OpenSSL
    driver.ssl.truststore.password = dserocks
    driver.ssl.truststore.path = "/Users/johndoe/tmp/ssl/truststore.node0"
    The command is:
    dsbulk load -f mySSLFile.conf -url file1.csv -k ks1 -t table1

Printing cluster information

  1. When you enable verbose logging by using --log.verbosity 2 on the dsbulk command, DataStax Bulk Loader prints basic information about the associated cluster. These data points often help when you are investigating any load or unload issues.
    The output log format is:
    Partitioner: name-of-partitioner
    Total number of hosts: number
    DataCenters: list-of-datacenter-names
    Hosts: list-of-hosts, formatted as follows: address, dseVersion, cassandraVersion, dataCenter
    The output is sorted by ascending IP address. If there are more than 100 nodes comprising the cluster, the other nodes are not listed. Instead, DataStax Bulk Loader prints (Other nodes omitted).

    Log messages are only logged to the main log file, operation.log, and to standard error. Nothing from the log is printed to stdout. For information about log levels, refer to Logging Options.