Load data
Loading is the process of using the dsbulk load
command to write data from a CSV or JSON file into a database table.
The load
and unload
commands share many of the same options.
When loaded from a CSV file, column names in the CSV data map to column names in the table. When loaded from a JSON file, field names in the JSON data map to column names in the table.
Load locally without authentication
At minimum, you must specify the file (-url) you want to import, and the keyspace (-k
, --schema.keyspace
), and table (-t
, --schema.table
) where you want to load the data.
The minimal command loads data to a local cluster (localhost
contact point) without authentication:
dsbulk load -url filename.csv -k ks1 -t table1
Load with authentication
Pass the relevant options with your load
commands if your cluster requires authentication or uses SSL encryption.
You can pass the values directly or use a configuration file.
-
DSE, HCD, and Cassandra
-
Astra DB
If your database requires authentication, provide the username and password using the -u
and -p
options, respectively:
dsbulk load -url filename.csv -k ks1 -t table1 \
-u username -p password
If the cluster is remote, include driver options like host, contact points, and port:
dsbulk load -url filename.csv -k ks1 -t table1 \
-u username -p password -h '10.200.1.3, 10.200.1.4' -port 9876
If your cluster has both authentication and SSL enabled, pass -u
, -p
, and the SSL options.
For example:
dsbulk load -h '["fe80::f861:3eff:fe1d:9d7a"]' -u myaccount -p mypassword
--driver.auth.provider DsePlainTextAuthProvider
--driver.ssl.provider JDK
--driver.ssl.keystore.path /etc/dse/keystores/client.keystore
--driver.ssl.keystore.password mysslkspass
--driver.ssl.truststore.path /etc/dse/keystores/client.truststore
--driver.ssl.truststore.password myssltrustpass
-k mykeyspace -t mytable -url filename.csv
dsbulk load -url filename.csv -k ks1 -t table1 \
-b "path/to/SCB.zip" -u token -p AstraCS:...
For Astra DB, the expected connection credentials are:
-
-b
: Provide the path to the database’s Secure Connect Bundle (SCB) zip file. The SCB includes certificates and key files for SSL-encrypted connections as well as information about the database’s contact points. -
-u
: Set to the literal stringtoken
. -
-p
: Provide an application token. DataStax recommends using secure references to tokens, such as environment variables, rather than specifying them directly on the command line.
Set the load source
Use the -url
option to specify the path and filename from which you want to load data.
This can be a path on the local file system or a valid URL.
You can specify a filename only, path only, or both filename and path.
-
Load a local file stored relative to your DSBulk installation directory:
dsbulk load -url myData.csv -k ks1 -t table1
When loading a local file, the
-url
unfurls to the long form offile://hostname/path/to/file
. For example, loading from thetmp
directory onlocalhost
would befile:///tmp/export.csv
. -
Load a file from a different directory on the local file system:
dsbulk load -url ~/data-export/myData.csv -k ks1 -t table1
-
Load remote files by providing a valid URL, such as an S3 path or remote hostname and path, such as
-url https://hostname/path/to/filename.csv
:dsbulk load -url https://192.168.1.100/data/export.csv -h '10.200.1.3,10.200.1.4'
Load multiple files
-
Upload multiple files from a list of URLs: Specify a file that contains a list of multiple, well-formed URLs to CSV or JSON data files to load all of those files:
dsbulk load --connector.json.urlfile "multiple-input-data-urls.txt" -k ks1 -t table1
-
Load all files from a directory:
dsbulk load -url ~/export-dir -k ks1 -t table1
Load CSV data from stdin
You can use -url stdin:/
to load CSV data from stdin
as it is generated from a script.
This example runs a script called generate_data
and pipes it to a dsbulk load
command:
generate_data | dsbulk load -url stdin:/ -k ks1 -t table1
stdin
is also the default import location if -url
is omitted.
Load a compressed file
There are two ways to load compressed files:
-
For supported file types, use the
--connector.csv|json.compression
option to specify the compression type to allow DSBulk to decompress the file as it is loaded. -
Extract the compressed file to
stdout
, and then pipe it fromstdin
to yourdsbulk load
command:gzcat table1.csv.gz | dsbulk load -k ks1 -t table1
Set headers or mappings
By default, DSBulk attempts to read the column or field names from a header row in the input file.
Use -header true|false
to indicate whether the first row of a CSV file contains column names.
Use -m
(--schema.mapping
) to map field names or indices in the input file to column names in the table:
-
Map CSV columns to table columns:
dsbulk load -url export.csv -k ks1 -t table1 -header false -m '0=col1,1=col3'
-
Map JSON fields to table columns.
The following example also shows how a function call can be used to apply a function to one or more fields before mapping to a single column named
odd_column
.dsbulk load -url /tmp/dsbulkblog/iris.csv -k dsbulkblog -t iris_with_id --schema.mapping "sepal_length = sepal_length, sepal_width = sepal_width, petal_length = petal_length, petal_width = petal_width, species = species, id = id, max(sepal_length, petal_length) = odd_column"
Set a delimiter
Use -delim
to specify a field delimiter character such as '\t'
for Tab:
dsbulk load -k ks1 -t table1 -url export.csv -delim '\t'
Set escaping
Use -escape
to set handling for escaped characters.
The following example uses -escape '\"'
to indicate that double-quoted characters in fields are escaped with a double-quote, such as the line "f1","value with ""quotes"" and more"
in the CSV file:
dsbulk load -url ~/export.csv -k ks1 -t table1 -escape '\"'
Set TTL
For more complex handling, you can pass CQL INSERT
statements in the -query
option.
A valuable way to use this option is to set time-to-live (TTL) values for the data you are loading.
For example:
dsbulk load -query
"INSERT INTO dsbulkblog.iris_with_id
(sepal_length,sepal_width,petal_length,petal_width,species,id)
VALUES (:sepal_length,:sepal_width,:petal_length,:petal_width,:species,:id)
USING TTL :ttl_to_use"
Load CSV data into a graph vertex label table
For DSE Graph tables, pass graph and vertex options with your load
command.
The following example loads graph data to graph graph1
and table vertex_label1
:
dsbulk load -url ~/data/vertices/person.csv -g graph1 -v vertex_label1 \
-delim '|' -header true --schema.allowMissingFields true
Load collections
DSBulk includes helpful simplification when loading collection data (map
, list
, set
data types).
Collections inside a CSV file can contain valid JSON or simpler non-compliant JSON.
For example, given the following table:
CREATE TABLE t1 (col1 set<int> PRIMARY KEY, col2 list<int>, col3 map<text,int>);
The following pipe-delimited CSV file contains valid JSON, and you can load it into the table:
col1|col2|col3
[1,2,3]|[1,2,3]|{"key1":1,"key2":2}
The same CSV data can also be loaded in the following format, without the square brackets and curly braces:
col1|col2|col3
1,2,3|1,2,3|"key1":1,"key2":2
Simplification doesn’t work for nested collections. For example, if a column has the type
Or it can be simplified as follows:
No further simplification is possible because the fourth column is a nested collection. |
Load data into static and non-static columns
Use these steps to load database tables that contain static and non-static columns:
-
Create a dataset containing only the partition key columns and the static columns. For example, create a keyspace and a table:
CREATE KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
Results
CREATE TABLE demo.mytable ( pk1 text, pk2 text, cc1 text, cc2 text, static1 text static, static2 text static, nonstatic1 text, nonstatic2 text, PRIMARY KEY ((pk1,pk2), cc1, cc2));
-
Create a file containing the static data to load. For example:
static-data.csvpk1,pk2,static1,static2 a,b,y,z
By default, DataStax Bulk Loader considers the first row in the file as a CSV header.
-
Load static data with DataStax Bulk Loader using the
-query
parameter and the following format:-query "INSERT INTO t1 (pk1, pk2, ..., static1, static2, ...)"
For example:
dsbulk load -k demo -url static-data.csv \ -query "INSERT INTO demo.mytable (pk1,pk2,static1,static2) \ VALUES (:pk1,:pk2,:static1,:static2)"
-
Create another dataset containing the non-static data, including the entire primary key (partition keys and clustering columns) and regular columns. For example:
non-static-data.csvpk1,pk2,cc1,cc2,nonstatic1,nonstatic2 a,b,c,d,e,f
-
Load non-static data using the
-query
parameter and the following format:-query "INSERT INTO t1 (pk1, pk2, .. cc1, cc2 ..., nonstatic1, nonstatic2, ...)"
For example:
dsbulk load -k demo -url non-static-data.csv \ -query "INSERT INTO demo.mytable (pk1,pk2,cc1,cc2,nonstatic1,nonstatic2) \ VALUES (:pk1,:pk2,:cc1,:cc2,:nonstatic1,:nonstatic2)"
-
With the data loading operations complete, you can now query the table and retrieve the static and nonstatic data:
SELECT * from demo.mytable;
Result
pk1 | pk2 | cc1 | cc2 | static1 | static2 | nonstatic1 | nonstatic2 -----+-----+-----+-----+---------+---------+------------+------------ a | b | c | d | y | z | e | f