Getting Started

Getting Started with DataStax Bulk Loader.

To help you get started, this topic describes the key features of DataStax Bulk Loader and its dsbulk command.

Prerequisites

Obtain the following information and resources:

Key features

Procedure

Simple loading without configuration file

  1. Loading CSV or JSON data with a dsbulk load command:

    Specify two hosts (initial contact points) that belong to the desired cluster and load from a local file export.csv with headers into keyspace ks1 and table table1:

    dsbulk load -url export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -header true
    url can designate the path to a resource, such as a local file, or a web URL from which to read/write data.
    Specify an external source of data, as well as a port for the cluster hosts:
    dsbulk load -url https://svr/data/export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -port 9876
    Specify a file that contains a list of multiple, well-formed URLs for the CSV or JSON data files to load:
    dsbulk load --connector.json.urlfile "my/local/multiple-input-data-urls.txt" -k ks1 -t table1 -h '10.200.1.3'

    Load CSV data from stdin as it is generated from a loading script generate_data. The data is loaded to the keyspace ks1 and table table1 in a cluster with a localhost contact point (default if no hosts are defined). By default if not specified, the field names are read from a header row in the input file.

    generate_data | dsbulk load -url stdin:/ -k ks1 -t table1

Simple unloading without configuration file

  1. Unloading CSV data with a dsbulk unload command:
    Specify the external file to write the data to from keyspace ks1 and table table1:
    dsbulk unload -url myData.csv -k ks1 -t table1

Creating a configuration file

  1. The configuration file for setting values for dsbulk are written in a simple format, one option per line:
    ############ MyConfFile.conf ############
    
    dsbulk {
       # The name of the connector to use
       connector.name = "csv"
       # CSV field delimiter
       connector.csv.delimiter = "|"
       # The keyspace to connect to
       schema.keyspace = "myKeyspace"
       # The table to connect to
       schema.table = "myTable"
       # The field-to-column mapping
       schema.mapping = "0=name, 1=age, 2=email" 
       # Auth credentials
       driver.auth.username = "myAccount4"
       driver.auth.password = "dse#r0cks!"
    }
    Tip: Settings in the config file always start with the dsbulk prefix, while on the command line, this prefix must be omitted. To avoid confusion, configuration files are formatted with the following equivalent Human-Optimized Config Object Notation (HOCON) syntax: dsbulk { connector.name = "csv" ... }. For information about HOCON syntax, refer to this specification.
    Important: DataStax recommends specifying username and password credentials in a configuration file, as shown in the example above, instead of on the command line. When done in a configuration file, the credentials are not displayed in subsequent commands such as ps -ef | grep cqlsh.
    To use the configuration file, specify -f filename, where filename is the configuration file:
    dsbulk load -f myConfFile.conf -url export.csv -k ks1 -t table1

Using SSL with dsbulk

  1. To use SSL with dsbulk, first refer to DSE Security docs to set up SSL. While the SSL options can be specified on the command line, using a configuration file is recommended. Also, as explained in the section above, be sure to enclose the configuration options in a dsbulk { … } block:
    dsbulk {
    .
    .
    .
    driver.ssl.keystore.password = "cassandra"
    driver.ssl.keystore.path = "/Users/johndoe/tmp/ssl/keystore.node0"
    driver.ssl.provider = OpenSSL
    driver.ssl.truststore.password = "dse#r0cks!"
    driver.ssl.truststore.path = "/Users/johndoe/tmp/ssl/truststore.node0"
    }
    Tip: Enclose passwords that contain special characters in quotes. Example:
    driver.ssl.truststore.password = "dse#r0cks!"
    The command is:
    dsbulk load -f mySSLFile.conf -url file1.csv -k ks1 -t table1

Printing cluster information

  1. When you enable verbose logging by using --log.verbosity 2 on the dsbulk command, DataStax Bulk Loader prints basic information about the associated cluster. These data points often help when you are investigating any load or unload issues.
    The output log format is:
    Partitioner: name-of-partitioner
    Total number of hosts: number
    DataCenters: list-of-datacenter-names
    Hosts: list-of-hosts, formatted as follows: address, dseVersion, cassandraVersion, dataCenter
    The output is sorted by ascending IP address. If there are more than 100 nodes comprising the cluster, the other nodes are not listed. Instead, DataStax Bulk Loader prints (Other nodes omitted).

    Log messages are only logged to the main log file, operation.log, and to standard error. Nothing from the log is printed to stdout. For information about log levels, refer to Logging Options.

Counting data in DSE tables

  1. You can use the dsbulk count command to return information about the loaded data.
    For example, the following command returns information about the partition data used in the cycling.comments table. The results are organized as follows:
    1. Left column: partition key value
    2. Middle column: number of rows using that partition key value
    3. Right column: the partition's percentage of rows compared to the total number of rows in the table
    dsbulk count -k cycling -t comments --stats.modes partitions --stats.numPartitions 50
                            Operation directory: /home/automaton/cycling/logs/COUNT_20190424-213840-954894
                            total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms
                            31 |      0 |     74 | 0.00 |   0.02 | 27.59 | 31.33 |  31.33
                            Operation COUNT_20190424-213840-954894 completed successfully in 0 seconds.
                            fb372533-eb95-4bb4-8685-6ef61e994caa 5 16.13
                            8566eb59-07df-43b1-a21b-666a3c08c08a 4 12.90
                            c7fceba0-c141-4207-9494-a29f9809de6f 4 12.90
                            e7ae5cf3-d358-4d99-b900-85902fda9bb0 4 12.90
                            6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 3 9.68
                            9011d3be-d35c-4a8d-83f7-a3c543789ee7 2 6.45
                            95addc4c-459e-4ed7-b4b5-472f19a67995 2 6.45
                            38ab64b6-26cc-4de9-ab28-c257cf011659 2 6.45
                            5b6962dd-3f90-4c93-8f61-eabfa4a803e2 1 3.23
                            c4b65263-fe58-4846-83e8-f0e1c13d518f 1 3.23
                            e7cd5752-bc0d-4157-a80f-7523add8dbcd 1 3.23
                            6d5f1663-89c0-45fc-8cfd-60a373b01622 1 3.23
                            220844bf-4860-49d6-9a4b-6b5d3a79cbfb 1 3.23
    Additional options are provided with the dsbulk count command. Refer to Count options.