Common options
Some options are commonly required to use dsbulk
.
In the following list, required options are designated.
The options can be used in short form (-k keyspace_name
) or long form (--schema.keyspace keyspace_name
).
-
--version
Show the program’s version number and exit.
Default: unspecified
-
-f filename
Load options from the given file rather than from
dsbulk_home/conf/application.conf
.Default: unspecified
-
-c, --connector.name, --dsbulk.connector.name { csv | json }
The name of the connector to use.
Supported: dsbulk load and dsbulk unload operations.
Default:
csv
-
-b, --driver.basic.cloud.secure-connect-bundle secure-connect-database-name.zip
Specifies the path to a secure connect bundle used to connect with a DataStax Astra database. The specified location must be a path on the local filesystem or a valid URL. Download the secure connect bundle for a DataStax Astra database from the DataStax Cloud console.
The following examples show different methods of indicating the path to the secure connect bundle:
"/path/to/secure-connectdatabase-name.zip" # Path on *Nix systems "./path/to/secure-connectdatabase-name.zip" # Path on *Nix relative to working directory "~/path/to/secure-connectdatabase-name.zip" # Path on *Nix relative to home directory "C:\\path\\to\\secure-connectdatabase-name.zip" # Path on Microsoft Windows systems # You must escape backslashes in HOCON "file:/path/to/secure-connectdatabase-name.zip" # URL with file protocol "http://host.com/secure-connectdatabase-name.zip" # URL with HTTP protocol
If a secure connect bundle is specifiedusing this parameter, any of the following options are ignored and a warning is logged:
-
Contact points
-
Consistency level other than
LOCAL_QUORUM
(only for loading operations) -
SSL configurations
Default: none
-
-
-k, --schema.keyspace, --dsbulk.schema.keyspace string
Keyspace used for loading or unloading data.
Do not quote keyspace names, and note that they are case sensitive.
MyKeyspace
will match a keyspace namedMyKeyspace
but notmykeyspace
.Either keyspace (this option) or, for graph data, the graph option is required if query is not specified or it is not qualified with a keyspace name.
Default:
null
-
-t, --schema.table, --dsbulk.schema.table string
Table used for loading or unloading data.
Do not quote table names, and note that they are case sensitive.
MyTable
will match a table namedMyTable
but notmytable
.Either table (this option), vertex, or, for graph data, the edge option is required if query is not specified.
Default:
null
-
-m, --schema.mapping, --dsbulk.schema.mapping string
The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader for Apache Cassandra applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:
-
Indexed data sources:
0 = col1, 1 = col2, 2 = col3
, where0
,1
,2
, are the zero-based indices of fields in the source data; andcol1
,col2
,col3
are bound variable names in the insert statement. -
A shortcut to map the first
n
fields is to simply specify the destination columns:col1, col2, col3
. -
Mapped data sources:
fieldA = col1, fieldB = col2, fieldC = col3
, wherefieldA
,fieldB
,fieldC
are field names in the source data; andcol1
,col2
,col3
are bound variable names in the insert statement. -
A shortcut to map fields named like columns is to simply specify the destination columns:
col1, col2, col3
. To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columnsttl()
andwritetime()
:fieldA = writetime(), fieldB = ttl()
.Starting in DataStax Bulk Loader 1.8.0, the special tokens
timestamp
andttl
are deprecated (but still honored) . If used, a warning message is logged. When you can, replace anytimestamp
andttl
tokens withwritetime()
andttl()
, respectively.Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit with codec.epoch. The latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.
To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g.
now() = c4
). Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g.field1=now()
). Function calls can also be qualified by a keyspace name:field1 = keyspace1.max(c1, c2)
.In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified. For example, if a source row has fields
c1
,c2
,c3
, andc5
, and the table has columnsc1
,c2
,c3
,c4
, one can map all like-named columns and specify thatc5
in the source maps toc4
in the table as follows:* = *, c5 = c4
.To specify that all like-named fields be mapped, except for
c2
, use:* = -c2
. To skipc2
andc3
, use:* = [-c2, -c3]
.Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL:
"Field ""A""" = "Column 2"
(to escape a double-quote, simply double it).Unlike CQL grammar, unquoted identifiers will not be lowercased by DataStax Bulk Loader. An identifier such as
MyColumn1
will match a column namedMyColumn1
, but will not matchmycolumn1
.The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector’s documentation to know which kinds of mapping it supports.
Default:
null
-
-
-url, --connector.{csv|json}.url, --dsbulk.connector.{csv|json}.url string*
The URL or path of the resources to read from or write to. Possible options are
-
(representing stdin for reading and stdout for writing) andfile
(filepath).File URLs can also be expressed as simple paths without the
file
prefix. A directory of files can also be specified. The following examples provide descriptions of using this parameter in different ways:Specify a few hosts (initial contact points) that belong to the desired cluster and load from a local file, without headers. Map field indices of the input to table columns with
-m
:dsbulk load -url ~/export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -header false -m '0=col1,1=col3'
Specify port 9876 for the cluster hosts and load from an external source URL:
dsbulk load -url https://192.168.1.100/data/export.csv -k ks1 -t table1 -h '10.200.1.3,10.200.1.4' -port 9876
Load all csv files from a directory. The files do not have a header row,
-header false
. Map field indices of the input to table columns with-m
:dsbulk load -url ~/export-dir -k ks1 -t table1 -header false -m '0=col1,1=col3'
See Loading data examples for more examples.
Default:
-
-
-delim, --connector.csv.delimiter, --dsbulk.connector.csv.delimiter string
The character or characters to use as field delimiter. Field delimiters containing more than one character are accepted, such as
'||'
.Default:
,
(a comma) -
-header, --connector.csv.header, --dsbulk.connector.csv.header { true|false }
Enable or disable whether the files to read or write begin with a header line. If enabled for loading, the first non-empty line in every file will assign field names for each record column, in lieu of schema.mapping,
fieldA = col1, fieldB = col2, fieldC = col3
. If disabled for loading, records will not contain fields names, only field indexes,0 = col1, 1 = col2, 2 = col3
. For unloading, if this setting is enabled, each file begins with a header line, and if disabled, each file does not contain a header line.This option will apply to all files loaded or unloaded.
Default:
true
-
-h, --driver.basic.contact-points, --datastax-java-driver.basic.contact-points host_name(s)
The contact points to use for the initial connection to the cluster.
These are addresses of Cassandra nodes that the driver uses to discover the cluster topology. Only one contact point is required (the driver retrieves the address of the other nodes automatically), but it is a good practice to provide more than one contact point. If a single contact point is unavailable, the driver cannot initialize itself correctly.
This must be a list of strings with each contact point specified as
host
orhost:port
. If the host is specified without a port, the default port specified inbasic.default-port
will be used. Apache Cassandra 3.0 and earlier and DataStax Enterprise (DSE) 6.7 and earlier require all nodes in a cluster to share the same port.Valid examples of contact points are:
-
IPv4 addresses with ports:
[ "192.168.0.1:9042", "192.168.0.2:9042" ]
-
IPv4 addresses without ports:
[ "192.168.0.1", "192.168.0.2" ]
-
IPv6 addresses with ports:
[ "fe80:0:0:0:f861:3eff:fe1d:9d7b:9042", "fe80:0:0:f861:3eff:fe1d:9d7b:9044:9042" ]
-
IPv6 addresses without ports:
[ "fe80:0:0:0:f861:3eff:fe1d:9d7b", "fe80:0:0:f861:3eff:fe1d:9d7b:9044" ]
-
Host names with ports:
[ "host1.com:9042", "host2.com:9042" ]
-
Host names without ports:
[ "host1.com", "host2.com:" ]
If the host is a DNS name that resolves to multiple A-records, all the corresponding addresses will be used. Do not uselocalhost
as a host-name (because it resolves to both IPv4 and IPv6 addresses on some platforms). The port for all hosts must be specified withdriver.port
.Be sure to enclose address strings that contain special characters in quotes, as shown in these examples:
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7a"]' -query "SELECT * from foo.bar;"
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7b","fe80::f861:3eff:fe1d:9d7c"]' -query "SELECT * from foo1.bar1;"
The heuristic to determine whether a contact point is in the form
host
orhost:port
is not 100% accurate for some IPv6 addresses. Avoid ambiguous IPv6 addresses such asfe80::f861:3eff:fe1d:1234
because such a string could be interpreted as a combination of IPfe80::f861:3eff:fe1d
with port1234
, or as IPfe80::f861:3eff:fe1d:1234
without port. In such cases, DataStax Bulk Loader for Apache Cassandra does not change the contact point. To avoid this issue, provide IPv6 addresses in full form. For example, instead offe80::f861:3eff:fe1d:1234
, providefe80:0:0:0:0:f861:3eff:fe1d:1234
, so that the string is parsed as IPfe80:0:0:0:0:f861:3eff:fe1d
with port1234
.On cloud deployments, DataStax Bulk Loader for Apache Cassandra automatically sets this option to an empty list, because contact points are not allowed to be explicitly provided when connecting to DataStax Astra databases.
Default:
127.0.0.1
-
-
-port, --driver.basic.default-port, --datastax-java-driver.basic.default-port port_number
The port to use for
basic.contact-points
, when a host is specified without a port. All nodes in a cluster must accept connections on the same port number.Default:
9042