Common options

Common options for the dsbulk command

Some options are commonly required to use dsbulk. In the following list, required options are designated.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

--version

Show the program's version number and exit.

Default: unspecified

-f filename

Load options from the given file rather than from dsbulk_home/conf/application.conf.

Default: unspecified

-c,--connector.name { csv | json }

The name of the connector to use.

Supported: dsbulk load and dsbulk unload operations.

Default: csv

-b,--driver.basic.cloud.secureConnectBundle secure-connect-database-name.zip

Specifies the path to a secure connect bundle used to connect with a DataStax Apollo database. The specified location must be a path on the local filesystem or a valid URL. Download the secure connect bundle for a DataStax Apollo database from the DataStax Constellation console.

The following examples show different methods of indicating the path to the secure connect bundle:

"/path/to/secure-connectdatabase-name.zip"        # Path on *Nix systems
"./path/to/secure-connectdatabase-name.zip"       # Path on *Nix relative to working directory
"~/path/to/secure-connectdatabase-name.zip"       # Path on *Nix relative to home directory
"C:\\path\\to\\secure-connectdatabase-name.zip"   # Path on Microsoft Windows systems
                                                  # You must escape backslashes in HOCON
"file:/path/to/secure-connectdatabase-name.zip"   # URL with file protocol
"http://host.com/secure-connectdatabase-name.zip" # URL with HTTP protocol
Note: If a secure connect bundle is specified using this parameter, any of the following options are ignored and a warning is logged:
  • Contact points
  • Consistency level other than LOCAL_QUORUM (only for loading operations)
  • SSL configurations

Default: none

-k,--schema.keyspace string

Keyspace used for loading, unloading, or counting data. Keyspace names should not be quoted and are case-sensitive. MyKeyspace will match a keyspace named MyKeyspace but not mykeyspace. Required option if schema.query is not specified; otherwise, optional.

Default: unspecified

-t,--schema.table string

Table used for loading, unloading, or counting data. Table names should not be quoted and are case-sensitive. MyTable will match a table named MyTable but not mytable. Required option if schema.query is not specified; otherwise, optional.

Default: unspecified

-m, --schema.mapping string
The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:
  • Indexed data sources: 0 = col1, 1 = col2, 2 = col3, where 0, 1, 2, are the zero-based indices of fields in the source data; and col1, col2, col3 are bound variable names in the insert statement.
  • A shortcut to map the first n fields is to simply specify the destination columns: col1, col2, col3.
  • Mapped data sources: fieldA = col1, fieldB = col2, fieldC = col3, where fieldA, fieldB, fieldC are field names in the source data; and col1, col2, col3 are bound variable names in the insert statement.
  • A shortcut to map fields named like columns is to simply specify the destination columns: col1, col2, col3.

To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columns __ttl and __timestamp: fieldA = __timestamp, fieldB = __ttl. Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit + codec.epoch; the latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.

To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g. now() = c4). Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g. field1=now()). Function calls can also be qualified by a keyspace name: field1 = keyspace1.max(c1, c2).

In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified. For example, if a source row has fields c1, c2, c3, and c5, and the table has columns c1, c2, c3, c4, one can map all like-named columns and specify that c5 in the source maps to c4 in the table as follows: * = *, c5 = c4.

To specify that all like-named fields be mapped, except for c2, use: * = -c2. To skip c2 and c3, use: * = [-c2, -c3].

Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL: "Field ""A""" = "Column 2" (to escape a double-quote, simply double it).
Note: Unlike CQL grammar, unquoted identifiers will not be lowercased by DataStax Bulk Loader. An identifier such as MyColumn1 will match a column named MyColumn1, but will not match mycolumn1.

The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector's documentation to know which kinds of mapping it supports.

Default: null

-url,--connector.{csv | json}.url string
The URL or path of the resources to read from or write to. Possible options are - (representing stdin for reading and stdout for writing) and file (filepath).

File URLs can also be expressed as simple paths without the file prefix. A directory of files can also be specified. The following examples provide descriptions of using this parameter in different ways:

Specify a few hosts (initial contact points) that belong to the desired cluster and load from a local file, without headers. Map field indices of the input to table columns with -m:
dsbulk load -url ~/export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -header false -m '0=col1,1=col3'
Specify port 9876 for the cluster hosts and load from an external source URL:
dsbulk load -url https://192.168.1.100/data/export.csv -k ks1 -t table1 -h '10.200.1.3,10.200.1.4' -port 9876
Load all csv files from a directory. The files do not have a header row, -header false. Map field indices of the input to table columns with -m:
dsbulk load -url ~/export-dir -k ks1 -t table1 -header false -m '0=col1,1=col3'

See Loading data examples for more examples.

Default: -

-delim,--connector.csv.delimiter string

The character to use as field delimiter.

Default: , (a comma)

-header,--connector.csv.header { true | false }
Enable or disable whether the files to read or write begin with a header line. If enabled for loading, the first non-empty line in every file will assign field names for each record column, in lieu of schema.mapping, fieldA = col1, fieldB = col2, fieldC = col3. If disabled for loading, records will not contain fields names, only field indexes, 0 = col1, 1 = col2, 2 = col3. For unloading, if this setting is enabled, each file will begin with a header line, and if disabled, each file will not contain a header line.
Note: This option will apply to all files loaded or unloaded.

Default: true

-h, --driver.basic.contact-points, --datastax-java-driver.basic.contact-points host_name(s)

The contact points to use for the initial connection to the cluster. This must be a list of strings with each contact point specified as host or host:port. If the host is specified without a port, the default port specified in basic.default-port will be used. Apache Cassandra 3.0 and earlier and DataStax Enterprise (DSE) 6.7 and earlier require all nodes in a cluster to share the same port.

If the host is a DNS name that resolves to multiple A-records, all the corresponding addresses will be used. Do not use localhost as a host-name (because it resolves to both IPv4 and IPv6 addresses on some platforms). The port for all hosts must be specified with driver.port.
Note: Be sure to enclose address strings that contain special characters in quotes, as shown in these examples:
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7a"]' -query "SELECT * from foo.bar;" 
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7b","fe80::f861:3eff:fe1d:9d7c"]' 
              -query "SELECT * from foo1.bar1;"

Default: 127.0.0.1

-port, --driver.basic.default-port, --datastax-java-driver.basic.default-port port_number

The port to use for basic.contact-points, when a host is specified without a port. All nodes in a cluster must accept connections on the same port number.

Default: 9042