Load data

Loading is the process of using the dsbulk load command to write data from a CSV or JSON file into a database table. The load and unload commands share many of the same options.

When loaded from a CSV file, column names in the CSV data map to column names in the table. When loaded from a JSON file, field names in the JSON data map to column names in the table.

Load locally without authentication

At minimum, you must specify the file (-url) you want to import, and the keyspace (-k, --schema.keyspace), and table (-t, --schema.table) where you want to load the data. The minimal command loads data to a local cluster (localhost contact point) without authentication:

dsbulk load -url filename.csv -k ks1 -t table1

Load with authentication

Pass the relevant options with your load commands if your cluster requires authentication or uses SSL encryption.

You can pass the values directly or use a configuration file.

DSE, HCD, and Cassandra
Astra DB

If your database requires authentication, provide the username and password using the -u and -p options, respectively:

dsbulk load -url filename.csv -k ks1 -t table1 \
-u username -p password

If the cluster is remote, include driver options like host, contact points, and port:

dsbulk load -url filename.csv -k ks1 -t table1 \
-u username -p password -h '10.200.1.3, 10.200.1.4' -port 9876

If your cluster has both authentication and SSL enabled, pass -u, -p, and the SSL options. For example:

dsbulk load -h '["fe80::f861:3eff:fe1d:9d7a"]' -u myaccount -p mypassword
      --driver.auth.provider DsePlainTextAuthProvider
      --driver.ssl.provider JDK
      --driver.ssl.keystore.path /etc/dse/keystores/client.keystore
      --driver.ssl.keystore.password mysslkspass
      --driver.ssl.truststore.path /etc/dse/keystores/client.truststore
      --driver.ssl.truststore.password myssltrustpass
      -k mykeyspace -t mytable -url filename.csv

dsbulk load -url filename.csv -k ks1 -t table1 \
-b "path/to/SCB.zip" -u token -p AstraCS:...

For Astra DB, the expected connection credentials are:

-b: Provide the path to the database’s Secure Connect Bundle (SCB) zip file. The SCB includes certificates and key files for SSL-encrypted connections as well as information about the database’s contact points.
-u: Set to the literal string token.
-p: Provide an application token. DataStax recommends using secure references to tokens, such as environment variables, rather than specifying them directly on the command line.

Set the load source

Use the -url option to specify the path and filename from which you want to load data. This can be a path on the local file system or a valid URL. You can specify a filename only, path only, or both filename and path.

Load a local file stored relative to your DSBulk installation directory:
```
dsbulk load -url myData.csv -k ks1 -t table1
```
When loading a local file, the -url unfurls to the long form of file://hostname/path/to/file. For example, loading from the tmp directory on localhost would be file:///tmp/export.csv.
Load a file from a different directory on the local file system:
```
dsbulk load -url ~/data-export/myData.csv -k ks1 -t table1
```
Load remote files by providing a valid URL, such as an S3 path or remote hostname and path, such as -url https://hostname/path/to/filename.csv:
```
dsbulk load -url https://192.168.1.100/data/export.csv -h '10.200.1.3,10.200.1.4'
```

Load multiple files

Upload multiple files from a list of URLs: Specify a file that contains a list of multiple, well-formed URLs to CSV or JSON data files to load all of those files:
```
dsbulk load --connector.json.urlfile "multiple-input-data-urls.txt" -k ks1 -t table1
```

Load all files from a directory:

dsbulk load -url ~/export-dir -k ks1 -t table1

Load CSV data from stdin

You can use -url stdin:/ to load CSV data from stdin as it is generated from a script.

This example runs a script called generate_data and pipes it to a dsbulk load command:

generate_data | dsbulk load -url stdin:/ -k ks1 -t table1

stdin is also the default import location if -url is omitted.

Load a compressed file

There are two ways to load compressed files:

For supported file types, use the --connector.csv|json.compression option to specify the compression type to allow DSBulk to decompress the file as it is loaded.
Extract the compressed file to stdout, and then pipe it from stdin to your dsbulk load command:
```
gzcat table1.csv.gz | dsbulk load -k ks1 -t table1
```

Set headers or mappings

By default, DSBulk attempts to read the column or field names from a header row in the input file.

Use -header true|false to indicate whether the first row of a CSV file contains column names.

Use -m (--schema.mapping) to map field names or indices in the input file to column names in the table:

Map CSV columns to table columns:

dsbulk load -url export.csv -k ks1 -t table1 -header false -m '0=col1,1=col3'

Map JSON fields to table columns.

The following example also shows how a function call can be used to apply a function to one or more fields before mapping to a single column named odd_column.

dsbulk load
  -url /tmp/dsbulkblog/iris.csv
  -k dsbulkblog
  -t iris_with_id
  --schema.mapping "sepal_length = sepal_length, sepal_width = sepal_width, petal_length = petal_length,
    petal_width = petal_width, species = species, id = id, max(sepal_length, petal_length) = odd_column"

Set a delimiter

Use -delim to specify a field delimiter character such as '\t' for Tab:

dsbulk load -k ks1 -t table1 -url export.csv -delim '\t'

Set escaping

Use -escape to set handling for escaped characters.

The following example uses -escape '\"' to indicate that double-quoted characters in fields are escaped with a double-quote, such as the line "f1","value with ""quotes"" and more" in the CSV file:

dsbulk load -url ~/export.csv -k ks1 -t table1 -escape '\"'

Set TTL

For more complex handling, you can pass CQL INSERT statements in the -query option. A valuable way to use this option is to set time-to-live (TTL) values for the data you are loading. For example:

dsbulk load -query
  "INSERT INTO dsbulkblog.iris_with_id
    (sepal_length,sepal_width,petal_length,petal_width,species,id)
    VALUES (:sepal_length,:sepal_width,:petal_length,:petal_width,:species,:id)
    USING TTL :ttl_to_use"

Load CSV data into a graph vertex label table

For DSE Graph tables, pass graph and vertex options with your load command.

The following example loads graph data to graph graph1 and table vertex_label1:

dsbulk load -url ~/data/vertices/person.csv -g graph1 -v vertex_label1 \
-delim '|' -header true --schema.allowMissingFields true

Load collections

DSBulk includes helpful simplification when loading collection data (map, list, set data types). Collections inside a CSV file can contain valid JSON or simpler non-compliant JSON.

For example, given the following table:

CREATE TABLE t1 (col1 set<int> PRIMARY KEY, col2 list<int>, col3 map<text,int>);

The following pipe-delimited CSV file contains valid JSON, and you can load it into the table:

col1|col2|col3
[1,2,3]|[1,2,3]|{"key1":1,"key2":2}

The same CSV data can also be loaded in the following format, without the square brackets and curly braces:

col1|col2|col3
1,2,3|1,2,3|"key1":1,"key2":2

Simplification doesn’t work for nested collections.

For example, if a column has the type list<list<int>>, then only the outermost structure can omit the surrounding characters. The following CSV file can be loaded as is:

col1,col2,col3,col4
[1,2,3]|[1,2,3]|{"key1":1,"key2":2}|[[1,2,3],[4,5,6]]

Or it can be simplified as follows:

col1,col2,col3,col4
1,2,3|1,2,3|"key1":1,"key2":2|[1,2,3],[4,5,6]

No further simplification is possible because the fourth column is a nested collection.

Load data into static and non-static columns

Use these steps to load database tables that contain static and non-static columns:

Create a dataset containing only the partition key columns and the static columns. For example, create a keyspace and a table:

CREATE KEYSPACE demo WITH replication =
       {'class': 'SimpleStrategy', 'replication_factor': '1'};

Results

CREATE TABLE demo.mytable (
       pk1 text, pk2 text,
       cc1 text, cc2 text,
       static1 text static, static2 text static,
       nonstatic1 text, nonstatic2 text,
       PRIMARY KEY ((pk1,pk2), cc1, cc2));

Create a file containing the static data to load. For example:
static-data.csv
```
pk1,pk2,static1,static2
a,b,y,z
```
By default, DataStax Bulk Loader considers the first row in the file as a CSV header.

Load static data with DataStax Bulk Loader using the -query parameter and the following format:

-query "INSERT INTO t1 (pk1, pk2, ..., static1, static2, ...)"

For example:

dsbulk load -k demo -url static-data.csv \
  -query "INSERT INTO demo.mytable (pk1,pk2,static1,static2) \
  VALUES (:pk1,:pk2,:static1,:static2)"

Create another dataset containing the non-static data, including the entire primary key (partition keys and clustering columns) and regular columns. For example:
non-static-data.csv
```
pk1,pk2,cc1,cc2,nonstatic1,nonstatic2
a,b,c,d,e,f
```

Load non-static data using the -query parameter and the following format:

-query "INSERT INTO t1 (pk1, pk2, .. cc1, cc2 ..., nonstatic1, nonstatic2, ...)"

For example:

dsbulk load -k demo -url non-static-data.csv \
  -query "INSERT INTO demo.mytable (pk1,pk2,cc1,cc2,nonstatic1,nonstatic2)  \
          VALUES (:pk1,:pk2,:cc1,:cc2,:nonstatic1,:nonstatic2)"

With the data loading operations complete, you can now query the table and retrieve the static and nonstatic data:

SELECT * from demo.mytable;

Result

 pk1 | pk2 | cc1 | cc2 | static1 | static2 | nonstatic1 | nonstatic2
-----+-----+-----+-----+---------+---------+------------+------------
   a |   b |   c |   d |       y |       z |          e |          f

Load vector data

See Load and unload vector data.

Load data

Load locally without authentication

Load with authentication

Set the load source

Load multiple files

Load CSV data from stdin

Load a compressed file

Set headers or mappings

Set a delimiter

Set escaping

Set TTL

Load CSV data into a graph vertex label table

Load collections

Load data into static and non-static columns

Load vector data

Was this helpful?

Give Feedback