Common options
Some options are commonly required to use dsbulk
.
In the following list, required options are designated.
The options can be used in short form (-k keyspace_name
) or long form (--schema.keyspace keyspace_name
).
--version
Show the program’s version number and exit.
Default: unspecified
-f filename
Load options from the given file rather than from dsbulk_home/conf/application.conf
.
Default: unspecified
-c, --connector.name, --dsbulk.connector.name { csv | json }
The name of the connector to use.
Supported: dsbulk load and dsbulk unload operations.
Default: csv
-b, --driver.basic.cloud.secure-connect-bundle secure-connect-database-name.zip
Specifies the path to a secure connect bundle used to connect with a DataStax Astra database. The specified location must be a path on the local filesystem or a valid URL. Download the secure connect bundle for a DataStax Astra database from the DataStax Cloud console.
The following examples show different methods of indicating the path to the secure connect bundle:
"/path/to/secure-connectdatabase-name.zip" # Path on *Nix systems
"./path/to/secure-connectdatabase-name.zip" # Path on *Nix relative to working directory
"~/path/to/secure-connectdatabase-name.zip" # Path on *Nix relative to home directory
"C:\\path\\to\\secure-connectdatabase-name.zip" # Path on Microsoft Windows systems
# You must escape backslashes in HOCON
"file:/path/to/secure-connectdatabase-name.zip" # URL with file protocol
"http://host.com/secure-connectdatabase-name.zip" # URL with HTTP protocol
If a secure connect bundle is specifiedusing this parameter, any of the following options are ignored and a warning is logged: |
-
Contact points
-
Consistency level other than
LOCAL_QUORUM
(only for loading operations) -
SSL configurations
Default: none
-k, --schema.keyspace, --dsbulk.schema.keyspace string
Keyspace used for loading or unloading data.
Do not quote keyspace names, and note that they are case sensitive.
|
Either keyspace (this option) or, for graph data, the graph option is required if query is not specified or it is not qualified with a keyspace name.
Default: null
-m, --schema.mapping, --dsbulk.schema.mapping string
The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:
-
Indexed data sources:
0 = col1, 1 = col2, 2 = col3
, where0
,1
,2
, are the zero-based indices of fields in the source data; andcol1
,col2
,col3
are bound variable names in the insert statement. -
A shortcut to map the first
n
fields is to simply specify the destination columns:col1, col2, col3
. -
Mapped data sources:
fieldA = col1, fieldB = col2, fieldC = col3
, wherefieldA
,fieldB
,fieldC
are field names in the source data; andcol1
,col2
,col3
are bound variable names in the insert statement. -
A shortcut to map fields named like columns is to simply specify the destination columns:
col1, col2, col3
. To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columnsttl()
andwritetime()
:fieldA = writetime(), fieldB = ttl()
.
Starting in DataStax Bulk Loader 1.8.0, the special tokens |
Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit with codec.epoch. The latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.
To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g.
now() = c4
).
Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g.
field1=now()
).
Function calls can also be qualified by a keyspace name: field1 = keyspace1.max(c1, c2)
.
In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified.
For example, if a source row has fields c1
, c2
, c3
, and c5
, and the table has columns c1
, c2
, c3
, c4
, one can map all like-named columns and specify that c5
in the source maps to c4
in the table as follows: * = *, c5 = c4
.
To specify that all like-named fields be mapped, except for c2
, use: * = -c2
.
To skip c2
and c3
, use: * = [-c2, -c3]
.
Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL: "Field ""A""" = "Column 2"
(to escape a double-quote, simply double it).
Unlike CQL grammar, unquoted identifiers will not be lowercased by DataStax Bulk Loader.
An identifier such as |
The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector’s documentation to know which kinds of mapping it supports.
Default: null
-url, --connector.{csv|json}.url, --dsbulk.connector.{csv|json}.url string
The URL or path of the resources to read from or write to.
Possible options are -
(representing stdin for reading and stdout for writing) and file
(filepath).
File URLs can also be expressed as simple paths without the file
prefix.
A directory of files can also be specified.
The following examples provide descriptions of using this parameter in different ways:
Specify a few hosts (initial contact points) that belong to the desired cluster and load from a local file, without headers.
Map field indices of the input to table columns with -m
:
dsbulk load -url ~/export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -header false -m '0=col1,1=col3'
Specify port 9876 for the cluster hosts and load from an external source URL:
dsbulk load -url https://192.168.1.100/data/export.csv -k ks1 -t table1 -h '10.200.1.3,10.200.1.4' -port 9876
Load all csv files from a directory.
The files do not have a header row, -header false
.
Map field indices of the input to table columns with -m
:
dsbulk load -url ~/export-dir -k ks1 -t table1 -header false -m '0=col1,1=col3'
See Loading data examples for more examples.
Default: -
-delim, --connector.csv.delimiter, --dsbulk.connector.csv.delimiter string
The character or characters to use as field delimiter.
Field delimiters containing more than one character are accepted, such as '||'
.
Default: ,
(a comma)
-header, --connector.csv.header, --dsbulk.connector.csv.header { true|false }
Enable or disable whether the files to read or write begin with a header line.
If enabled for loading, the first non-empty line in every file will assign field names for each record column, in lieu of schema.mapping, fieldA = col1, fieldB = col2, fieldC = col3
.
If disabled for loading, records will not contain fields names, only field indexes, 0 = col1, 1 = col2, 2 = col3
.
For unloading, if this setting is enabled, each file begins with a header line, and if disabled, each file does not contain a header line.
This option will apply to all files loaded or unloaded. |
Default: true
-h, --driver.basic.contact-points, --datastax-java-driver.basic.contact-points host_name(s)
+ The contact points to use for the initial connection to the cluster.
These are addresses of Cassandra nodes that the driver uses to discover the cluster topology. Only one contact point is required (the driver retrieves the address of the other nodes automatically), but it is a good practice to provide more than one contact point. If a single contact point is unavailable, the driver cannot initialize itself correctly.
This must be a list of strings with each contact point specified as host
or host:port
.
If the host is specified without a port, the default port specified in basic.default-port
will be used.
Apache Cassandra 3.0 and earlier and DataStax Enterprise (DSE) 6.7 and earlier require all nodes in a cluster to share the same port.
Valid examples of contact points are:
-
IPv4 addresses with ports:
[ "192.168.0.1:9042", "192.168.0.2:9042" ]
-
IPv4 addresses without ports:
[ "192.168.0.1", "192.168.0.2" ]
-
IPv6 addresses with ports:
[ "fe80:0:0:0:f861:3eff:fe1d:9d7b:9042", "fe80:0:0:f861:3eff:fe1d:9d7b:9044:9042" ]
-
IPv6 addresses without ports:
[ "fe80:0:0:0:f861:3eff:fe1d:9d7b", "fe80:0:0:f861:3eff:fe1d:9d7b:9044" ]
-
Host names with ports:
[ "host1.com:9042", "host2.com:9042" ]
-
Host names without ports:
[ "host1.com", "host2.com:" ]
If the host is a DNS name that resolves to multiple A-records, all the corresponding addresses will be used. Do not uselocalhost
as a host-name (because it resolves to both IPv4 and IPv6 addresses on some platforms). The port for all hosts must be specified withdriver.port
.
Be sure to enclose address strings that contain special characters in quotes, as shown in these examples: |
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7a"]' -query "SELECT * from foo.bar;"
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7b","fe80::f861:3eff:fe1d:9d7c"]'
-query "SELECT * from foo1.bar1;"
The heuristic to determine whether a contact point is in the form host
or host:port
is not 100% accurate for some IPv6 addresses.
Avoid ambiguous IPv6 addresses such as fe80::f861:3eff:fe1d:1234
because such a string could be interpreted as a combination of IP fe80::f861:3eff:fe1d
with port 1234
, or as IP fe80::f861:3eff:fe1d:1234
without port.
In such cases, DataStax Bulk Loader does not change the contact point.
To avoid this issue, provide IPv6 addresses in full form.
For example, instead of fe80::f861:3eff:fe1d:1234
, provide fe80:0:0:0:0:f861:3eff:fe1d:1234
, so that the string is parsed as IP fe80:0:0:0:0:f861:3eff:fe1d
with port 1234
.
On cloud deployments, DataStax Bulk Loader automatically sets this option to an empty list, because contact points are not allowed to be explicitly provided when connecting to DataStax Astra databases. |
Default: 127.0.0.1
-port, --driver.basic.default-port, --datastax-java-driver.basic.default-port port_number
The port to use for basic.contact-points
, when a host is specified without a port.
All nodes in a cluster must accept connections on the same port number.
Default: 9042