Common options

Common options for the dsbulk command

Some options are commonly required to use dsbulk. In the following list, required options are designated.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-f filename

Load options from the given file rather than from dsbulk_home/conf/application.conf.

Default: unspecified

-c,--connector.name csv | json

The name of the connector to use.

Default: csv

-k,--schema.keyspace string

Keyspace used for loading, unloading, or counting data. Keyspace names should not be quoted and are case-sensitive. MyKeyspace will match a keyspace named MyKeyspace but not mykeyspace. Required option if schema.query is not specified; otherwise, optional.

Default: unspecified

-t,--schema.table string

Table used for loading, unloading, or counting data. Table names should not be quoted and are case-sensitive. MyTable will match a table named MyTable but not mytable. Required option if schema.query is not specified; otherwise, optional.

Default: unspecified

-m, --schema.mapping string
The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:
  • Indexed data sources: 0 = col1, 1 = col2, 2 = col3, where 0, 1, 2, are the zero-based indices of fields in the source data; and col1, col2, col3 are bound variable names in the insert statement.
  • A shortcut to map the first n fields is to simply specify the destination columns: col1, col2, col3.
  • Mapped data sources: fieldA = col1, fieldB = col2, fieldC = col3, where fieldA, fieldB, fieldC are field names in the source data; and col1, col2, col3 are bound variable names in the insert statement.
  • A shortcut to map fields named like columns is to simply specify the destination columns: col1, col2, col3.

To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columns __ttl and __timestamp: fieldA = __timestamp, fieldB = __ttl. Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit + codec.epoch; the latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.

To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g. now() = c4). Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g. field1=now()). Function calls can also be qualified by a keyspace name: field1 = keyspace1.max(c1, c2).

In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified. For example, if a source row has fields c1, c2, c3, and c5, and the table has columns c1, c2, c3, c4, one can map all like-named columns and specify that c5 in the source maps to c4 in the table as follows: * = *, c5 = c4.

To specify that all like-named fields be mapped, except for c2, use: * = -c2. To skip c2 and c3, use: * = [-c2, -c3].

Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL: "Field ""A""" = "Column 2" (to escape a double-quote, simply double it).
Note: Unlike CQL grammar, unquoted identifiers will not be lowercased by DataStax Bulk Loader. An identifier such as MyColumn1 will match a column named MyColumn1, but will not match mycolumn1.

The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector's documentation to know which kinds of mapping it supports.

Default: null

-url,--connector.(csv|json).url string
The URL or path of the resource(s) to read from or wrote to. Possible options are - (representing stdin for reading and stdout for writing) and file (filepath). File URLs can also be expressed as simple paths without the file prefix. A directory of files can also be specified.

Default: -

-delim,--connector.csv.delimiter string

The character to use as field delimiter.

Default: , (a comma)

-header,--connector.csv.header ( true | false )
Enable or disable whether the files to read or write begin with a header line. If enabled for loading, the first non-empty line in every file will assign field names for each record column, in lieu of schema.mapping, fieldA = col1, fieldB = col2, fieldC = col3. If disabled for loading, records will not contain fields names, only field indexes, 0 = col1, 1 = col2, 2 = col3. For unloading, if this setting is enabled, each file will begin with a header line, and if disabled, each file will not contain a header line.
Note: This option will apply to all files loaded or unloaded.

Default: true

-h,--driver.hosts host_name(s)
The contact points to use for the initial connection to the cluster. This must be a comma-separated list of hosts, each specified by a host-name or IP address. If the host is a DNS name that resolves to multiple A-records, all the corresponding addresses will be used. Do not use localhost as a host-name (because it resolves to both IPv4 and IPv6 addresses on some platforms). The port for all hosts must be specified with driver.port.
Note: Be sure to enclose address strings that contain special characters in quotes, as shown in these examples:
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7a"]' -query "SELECT * from foo.bar;" 
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7b","fe80::f861:3eff:fe1d:9d7c"]' 
              -query "SELECT * from foo1.bar1;"

Default: 127.0.0.1

-port,--driver.port port_number

The port to connect to at initial contact points. Note that all nodes in a cluster must accept connections on the same port number.

Default: 9042

--version

Show program's version number and exit.

Default: unspecified