Loading data examples

This topic presents examples of using the dsbulk load command to write CSV or JSON data to a database table.

Databases supported by DataStax Bulk Loader

DataStax Bulk Loader® supports the use of the dsbulk load, dsbulk unload, and dsbulk count commands with:

  • DataStax Astra DB

  • DataStax Enterprise (DSE) 5.1 and 6.8 databases

  • Open source Apache Cassandra® 2.1 and later databases

dsbulk load examples

Load data from CSV data read from stdin to table table1 in keyspace ks1:

dsbulk load -k ks1 -t table1

Load a configuration file from /tmp/dsbulk_load.conf to use for loading the file export.csv to table table1 in keyspace ks1:

dsbulk load -f /tmp/dsbulk_load.conf --connector.csv.url export.csv -k ks1 -t table1

Load the file export.csv to table table1 in keyspace ks1 using the short form option for url:

dsbulk load -url export.csv -k ks1 -t table1

Load the file export.csv to table table1 in keyspace ks1 using the short form option for url and the tab character as a field delimiter:

dsbulk load -k ks1 -t table1 -url export.csv -delim '\t'

Load the file /tmp/export.csv to table table1 in keyspace ks1 using the long form option for url:

dsbulk load --connector.csv.url file:///tmp/export.csv -k ks1 -t table1

file:///tmp/export.csv is loading from localhost, hence the empty host in the file:// designation.

Load table table1 in keyspace ks1 from a gzipped CSV file by unzipping it to stdout and piping to stdin of dsbulk:

gzcat table1.csv.gz | dsbulk load -k ks1 -t table1

Specify a few hosts (initial contact points) that belong to the desired cluster and load from a local file, without headers. Map field indices of the input to table columns with -m:

dsbulk load -url ~/export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -header false -m '0=col1,1=col3'

Specify port 9876 for the cluster hosts and load from an external source URL:

dsbulk load -url https://192.168.1.100/data/export.csv -k ks1 -t table1 -h '10.200.1.3,10.200.1.4' -port 9876

Load all csv files from a directory. The files do not have a header row, -header false. Map field indices of the input to table columns with -m:

dsbulk load -url ~/export-dir -k ks1 -t table1 -header false -m '0=col1,1=col3'

Specify a file that contains a list of multiple, well-formed URLs for the CSV or JSON data files to load:

dsbulk load --connector.json.urlfile "my/local/multiple-input-data-urls.txt" -k ks1 -t table1 -h '10.200.1.3'

Load data using a CQL query and include TTL values from the input data:

dsbulk load -query
  "INSERT INTO dsbulkblog.iris_with_id
    (sepal_length,sepal_width,petal_length,petal_width,species,id)
    VALUES (:sepal_length,:sepal_width,:petal_length,:petal_width,:species,:id)
    USING TTL :ttl_to_use"

Load data using a field-to-column mapping. The example also shows how a function call can be used to apply a function to one or more fields before mapping to a single column named odd_column:

dsbulk load
  -url /tmp/dsbulkblog/iris.csv
  -k dsbulkblog
  -t iris_with_id
  --schema.mapping "sepal_length = sepal_length, sepal_width = sepal_width, petal_length = petal_length,
    petal_width = petal_width, species = species, id = id, max(sepal_length, petal_length) = odd_column"

With default port for cluster hosts, keyspace, table, and mapping set in conf/application.conf:

dsbulk load -url https://192.168.1.100/data/export.csv -h '10.200.1.3,10.200.1.4'

Load table table1 in keyspace ks1 from a CSV file, where double-quote characters in fields are escaped with a double-quote; for example, "f1","value with ""quotes"" and more" is a line in the CSV file:

dsbulk load -url ~/export.csv -k ks1 -t table1 -escape '\"'

Loading collections to a table has some specific helpful simplification. Collections inside a CSV file can contain valid JSON or simpler non-compliant JSON. For example, for the following table:

CREATE TABLE t1 (col1 set<int> PRIMARY KEY, col2 list<int>, col3 map<text,int>);

This pipe-delimited CSV file contains valid JSON and could be loaded:

col1|col2|col3
[1,2,3]|[1,2,3]|{"key1":1,"key2":2}

Or, you could load the same CSV file as follows:

col1|col2|col3
1,2,3|1,2,3|"key1":1,"key2":2

Notice that the surrounding brackets and braces are omitted from the valid JSON in the third column.

Simplification of the JSON data does not work for nested collections. If your table has a column col4 of type list<list<int>>, only the outermost structure can omit the surrounding characters.

For example:

col1,col2,col3,col4
[1,2,3]|[1,2,3]|{"key1":1,"key2":2}|[[1,2,3],[4,5,6]]

You can simplify the prior example, as follows:

col1,col2,col3,col4
1,2,3|1,2,3|"key1":1,"key2":2|[1,2,3],[4,5,6]

However, the inner-list items cannot be simplified further.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com