Common options

Some options are commonly required to use dsbulk. In the following list, required options are designated.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

--version

Show the program’s version number and exit.

Default: unspecified

-f filename

Load options from the given file rather than from dsbulk_home/conf/application.conf.

Default: unspecified

-c, --connector.name, --dsbulk.connector.name { csv | json }

The name of the connector to use.

Supported: dsbulk load and dsbulk unload operations.

Default: csv

-b, --driver.basic.cloud.secure-connect-bundle secure-connect-database-name.zip

Specifies the path to a secure connect bundle used to connect with a DataStax Astra database. The specified location must be a path on the local filesystem or a valid URL. Download the secure connect bundle for a DataStax Astra database from the DataStax Cloud console.

The following examples show different methods of indicating the path to the secure connect bundle:

"/path/to/secure-connectdatabase-name.zip"        # Path on *Nix systems
"./path/to/secure-connectdatabase-name.zip"       # Path on *Nix relative to working directory
"~/path/to/secure-connectdatabase-name.zip"       # Path on *Nix relative to home directory
"C:\\path\\to\\secure-connectdatabase-name.zip"   # Path on Microsoft Windows systems
                                                  # You must escape backslashes in HOCON
"file:/path/to/secure-connectdatabase-name.zip"   # URL with file protocol
"http://host.com/secure-connectdatabase-name.zip" # URL with HTTP protocol

If a secure connect bundle is specifiedusing this parameter, any of the following options are ignored and a warning is logged:

  • Contact points

  • Consistency level other than LOCAL_QUORUM (only for loading operations)

  • SSL configurations

Default: none

-k, --schema.keyspace, --dsbulk.schema.keyspace string

Keyspace used for loading or unloading data.

Do not quote keyspace names, and note that they are case sensitive. MyKeyspace will match a keyspace named MyKeyspace but not mykeyspace.

Either keyspace (this option) or, for graph data, the graph option is required if query is not specified or it is not qualified with a keyspace name.

Default: null

-t, --schema.table, --dsbulk.schema.table string

Table used for loading or unloading data.

Do not quote table names, and note that they are case sensitive. MyTable will match a table named MyTable but not mytable.

Either table (this option), vertex, or, for graph data, the edge option is required if query is not specified.

Default: null

-m, --schema.mapping, --dsbulk.schema.mapping string

The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:

  • Indexed data sources: 0 = col1, 1 = col2, 2 = col3, where 0, 1, 2, are the zero-based indices of fields in the source data; and col1, col2, col3 are bound variable names in the insert statement.

  • A shortcut to map the first n fields is to simply specify the destination columns: col1, col2, col3.

  • Mapped data sources: fieldA = col1, fieldB = col2, fieldC = col3, where fieldA, fieldB, fieldC are field names in the source data; and col1, col2, col3 are bound variable names in the insert statement.

  • A shortcut to map fields named like columns is to simply specify the destination columns: col1, col2, col3. To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columns ttl() and writetime(): fieldA = writetime(), fieldB = ttl().

Starting in DataStax Bulk Loader 1.8.0, the special tokens \\__timestamp and \\__ttl are deprecated (but still honored) . If used, a warning message is logged. When you can, replace any \\__timestamp and \\__ttl tokens with writetime (*) and ttl(*), respectively.

Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit with codec.epoch. The latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.

To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g. now() = c4). Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g. field1=now()). Function calls can also be qualified by a keyspace name: field1 = keyspace1.max(c1, c2).

In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified. For example, if a source row has fields c1, c2, c3, and c5, and the table has columns c1, c2, c3, c4, one can map all like-named columns and specify that c5 in the source maps to c4 in the table as follows: * = *, c5 = c4.

To specify that all like-named fields be mapped, except for c2, use: * = -c2. To skip c2 and c3, use: * = [-c2, -c3].

Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL: "Field ""A""" = "Column 2" (to escape a double-quote, simply double it).

Unlike CQL grammar, unquoted identifiers will not be lowercased by DataStax Bulk Loader. An identifier such as MyColumn1 will match a column named MyColumn1, but will not match mycolumn1.

The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector’s documentation to know which kinds of mapping it supports.

Default: null

-url, --connector.{csv|json}.url, --dsbulk.connector.{csv|json}.url string

The URL or path of the resources to read from or write to. Possible options are - (representing stdin for reading and stdout for writing) and file (filepath).

File URLs can also be expressed as simple paths without the file prefix. A directory of files can also be specified. The following examples provide descriptions of using this parameter in different ways:

Specify a few hosts (initial contact points) that belong to the desired cluster and load from a local file, without headers. Map field indices of the input to table columns with -m:

dsbulk load -url ~/export.csv -k ks1 -t table1 -h '10.200.1.3, 10.200.1.4' -header false -m '0=col1,1=col3'

Specify port 9876 for the cluster hosts and load from an external source URL:

dsbulk load -url https://192.168.1.100/data/export.csv -k ks1 -t table1 -h '10.200.1.3,10.200.1.4' -port 9876

Load all csv files from a directory. The files do not have a header row, -header false. Map field indices of the input to table columns with -m:

dsbulk load -url ~/export-dir -k ks1 -t table1 -header false -m '0=col1,1=col3'

See Loading data examples for more examples.

Default: -

-delim, --connector.csv.delimiter, --dsbulk.connector.csv.delimiter string

The character or characters to use as field delimiter. Field delimiters containing more than one character are accepted, such as '||'.

Default: , (a comma)

-header, --connector.csv.header, --dsbulk.connector.csv.header { true|false }

Enable or disable whether the files to read or write begin with a header line. If enabled for loading, the first non-empty line in every file will assign field names for each record column, in lieu of schema.mapping, fieldA = col1, fieldB = col2, fieldC = col3. If disabled for loading, records will not contain fields names, only field indexes, 0 = col1, 1 = col2, 2 = col3. For unloading, if this setting is enabled, each file begins with a header line, and if disabled, each file does not contain a header line.

This option will apply to all files loaded or unloaded.

Default: true

-h, --driver.basic.contact-points, --datastax-java-driver.basic.contact-points host_name(s)

+ The contact points to use for the initial connection to the cluster.

These are addresses of Cassandra nodes that the driver uses to discover the cluster topology. Only one contact point is required (the driver retrieves the address of the other nodes automatically), but it is a good practice to provide more than one contact point. If a single contact point is unavailable, the driver cannot initialize itself correctly.

This must be a list of strings with each contact point specified as host or host:port. If the host is specified without a port, the default port specified in basic.default-port will be used. Apache Cassandra 3.0 and earlier and DataStax Enterprise (DSE) 6.7 and earlier require all nodes in a cluster to share the same port.

Valid examples of contact points are:

  • IPv4 addresses with ports: [ "192.168.0.1:9042", "192.168.0.2:9042" ]

  • IPv4 addresses without ports: [ "192.168.0.1", "192.168.0.2" ]

  • IPv6 addresses with ports: [ "fe80:0:0:0:f861:3eff:fe1d:9d7b:9042", "fe80:0:0:f861:3eff:fe1d:9d7b:9044:9042" ]

  • IPv6 addresses without ports: [ "fe80:0:0:0:f861:3eff:fe1d:9d7b", "fe80:0:0:f861:3eff:fe1d:9d7b:9044" ]

  • Host names with ports: [ "host1.com:9042", "host2.com:9042" ]

  • Host names without ports: [ "host1.com", "host2.com:" ] If the host is a DNS name that resolves to multiple A-records, all the corresponding addresses will be used. Do not use localhost as a host-name (because it resolves to both IPv4 and IPv6 addresses on some platforms). The port for all hosts must be specified with driver.port.

Be sure to enclose address strings that contain special characters in quotes, as shown in these examples:

dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7a"]' -query "SELECT * from foo.bar;"
dsbulk unload -h '["fe80::f861:3eff:fe1d:9d7b","fe80::f861:3eff:fe1d:9d7c"]'
              -query "SELECT * from foo1.bar1;"

The heuristic to determine whether a contact point is in the form host or host:port is not 100% accurate for some IPv6 addresses. Avoid ambiguous IPv6 addresses such as fe80::f861:3eff:fe1d:1234 because such a string could be interpreted as a combination of IP fe80::f861:3eff:fe1d with port 1234, or as IP fe80::f861:3eff:fe1d:1234 without port. In such cases, DataStax Bulk Loader does not change the contact point. To avoid this issue, provide IPv6 addresses in full form. For example, instead of fe80::f861:3eff:fe1d:1234, provide fe80:0:0:0:0:f861:3eff:fe1d:1234, so that the string is parsed as IP fe80:0:0:0:0:f861:3eff:fe1d with port 1234.

On cloud deployments, DataStax Bulk Loader automatically sets this option to an empty list, because contact points are not allowed to be explicitly provided when connecting to DataStax Astra databases.

Default: 127.0.0.1

-port, --driver.basic.default-port, --datastax-java-driver.basic.default-port port_number

The port to use for basic.contact-points, when a host is specified without a port. All nodes in a cluster must accept connections on the same port number.

Default: 9042

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com