Connector options

Connectors allow different types of data to be loaded and unloaded using dsbulk. The general format for connector options is:

--connector.name.option { string | number }

For example:

dsbulk load -k ks1 -t table1 --connector.json.urlfile "my/local/multiple-input-data-urls.txt"

The available URL protocols depend on which URL stream handlers have been installed. At a minimum, the file protocol is supported for reading and writing, and the HTTP/HTTPS protocols are supported for reading.

The file protocol can be used with all supported file systems, local or remote.

  • When reading: the URL can point to a single file or to an existing directory. In the case of a directory, you can specify the file name pattern option to filter the files to be read. And you can use the --connector.{csv | json}.recursive option to control whether the connector should also look for files in subdirectories.

  • When writing: the URL is treated as a directory. If it does not exist, DataStax Bulk Loader attempts to create it. If successful, CSV files are created in this directory, and you can set the file names to be used with the --connector.{csv | json}.fileNameFormat option.

    If the value specified does not have a protocol, it is assumed to be a file protocol. Relative URLs are resolved against the current working directory. Also, for your convenience, if the path begins with a tilde (~), that symbol is expanded to the current user’s home directory.

  • If you have URLs of multiple CSV or JSON data files to load, you can create a file that contains the list of well-formed URLs, and specify the single file with --connector.{csv | json}.urlfile.

  • Another option is to use --connector.{csv | json}.compression, --dsbulk.connector.{csv| json}.compression string to load or unload data from or to a compressed file.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-c, --connector.name, --dsbulk.connector.name {csv | json }

The name of the connector to use.

Supported: dsbulk load and dsbulk unload operations.

Default: csv

Common to both CSV and JSON connectors

--connector.{csv | json}.compression, --dsbulk.connector.{csv| json}.compression string

Specify the compression type to use when loading or unloading data from or to a compressed file. Two examples:

dsbulk unload -k test -t table1 --connector.json.compression gzip -url mydir
dsbulk load -k test -t table1 --connector.json.compression gzip -url mydir

The default output from the dsbulk unload command, with compression and the first counter, is output-000001.csv.gz.

Refer to connector file name format for details on dsbulk unload output file naming.

Supported compressed file types for dsbulk load and dsbulk unload operations:

  • bzip2

  • deflate

  • gzip

  • lzma

  • lz4

  • snappy

  • xz

  • zstd

Supported compressed file types for only dsbulk load:

  • brotli

  • deflate64

  • z

DataStax Bulk Loader automatically adjusts the default file pattern while searching for the file to load, by appending the default compression file extension (such as .gz, for a gzip compression type) to the defined filename pattern. Also, when unloading data, the default file extension for a given compression is automatically appended to the value of the fileNameFormat option, if the compressed file extension ends with .csv or .json.

Supported: dsbulk load and dsbulk unload operations.

Default: none

--connector.{csv | json}.fileNameFormat, --dsbulk.connector.{csv | json}.fileNameFormat string

The file name format to use when writing during unloading. This option is ignored when reading and for any URL that is not a file. The file name must comply with the formatting rules of String.format(), and must contain a %d format specifier that is used to increment file name counters.

Supported: Only dsbulk unload operations.

Default: output-%06d.{csv | json}

--connector.{csv| json}.fileNamePattern, --dsbulk.connector.{csv| json}.fileNamePattern string

The glob pattern to use when searching for files to read. The syntax to use is the glob syntax, as described in java.nio.file.FileSystem.getPathMatcher(). This option is ignored when writing and for URLs that are not files. Only applicable when the --connector.{csv | json}.url option points to a directory on a known file system; ignored otherwise.

Supported: Only dsbulk load operations.

Default: **/*.{csv | json}

-encoding, --connector.{csv | json}.encoding, --dsbulk.connector.{csv | json}.encoding string

The file encoding to use for all read or written files.

Supported: dsbulk load and dsbulk unload operations.

Default: UTF-8

-maxConcurrentFiles, --connector.{csv| json}.maxConcurrentFiles, --dsbulk.connector.{csv| json}.maxConcurrentFiles string

The maximum number of files that can be read or written simultaneously. Applies to unload operations and (starting in 1.6.0) to load operations. The special syntax NC can be used to specify a number of threads that is a multiple of the number of available cores. For example, if the number of cores is 8, then 0.5C = 0.5 * 8 = 4 threads. With the default of AUTO, the connector estimates an optimal number of files.

When loading, it may be beneficial to lower the number of files being read in parallel if the disk is slow, especially SAN disks. Excessive disk IO may be worse than reading input data one file at a time. Setting maxConcurrentFiles to 1 achieves that effect. If diagnostic tools like iostat show too much time spent on disk IO, consider adjusting maxConcurrentFiles to a lower value or use AUTO (default). Very large rows (more than 10KB) may benefit from a lower maxConcurrentFiles value.

Default: AUTO

-maxRecords, --connector.{csv| json}.maxRecords, --dsbulk.connector.{csv| json}.maxRecords number

The maximum number of records to read from or write to each file. When reading, all records past this number are discarded. When writing, a file contains at most this number of records; if more records remain to be written, a new file is created using --connector.{csv | json}.fileNameFormat.

When writing to anything other than a directory, this option is ignored.

For CSV, --connector.csv.maxRecords takes into account --connector.csv.header. If a file begins with a header line, that line is not counted as a record. This feature is disabled by default.

Supported: dsbulk load and dsbulk unload operations.

Default: -1

--connector.{csv | json}.recursive, --dsbulk.connector.{csv | json}.recursive { true | false }

Enable or disable scanning for files in the root’s subdirectories. Only applicable when url is set to a directory on a known filesystem.

Supported: Only dsbulk load operations.

Default: false

-skipRecords, --connector.{csv | json}.skipRecords, --dsbulk.connector.{csv | json}.skipRecords number

The number of records to skip from each input file before the parser can begin to execute. Note that if the file contains a header line (for CSV), that line is not counted as a valid record.

Supported: Only dsbulk load operations.

Default: 0z

-url, --connector.{csv | json}.url, --dsbulk.connector.{csv | json}.url string

The URL or path of the resource(s) to read from or write to. Possible options are - (representing stdin for reading and stdout for writing) and file (filepath). File URLs can also be expressed as simple paths without the file prefix. A directory of files can also be specified.

Supported: dsbulk load and dsbulk unload operations.

Default: -

--connector.{csv | json}.urlfile, --dsbulk.connector.{csv | json}.urlfile string

The URL or path of the local file that contains a list of CSV or JSON data files from which to read during dsbulk load operations.

This connector.{csv | json}.urlfile option and the connector.{csv | json}.url option are mutually exclusive. If both are defined and not empty, the connector.{csv | json}.urlfile option takes precedence.

In the file with URLs:

  • Encode in UTF-8.

  • Each line should contain one path and a valid URL to load.

  • You do not need to escape characters in the path.

  • The format rules documented in this topic, including rules for fileNamePattern, recursive, and fileNameFormat, also apply to connector.{csv | json}.urlfile.

  • You can use the # character to comment out a line.

  • DataStax Bulk Loader removes any leading or trailing white space from each line. Supported: Only dsbulk load operations.

Do not use connector.{csv | json}.urlfile with dsbulk unload; doing so results in a fatal error.

Default: unspecified.

CSV connector options

-comment,--connector.csv.comment, --dsbulk.connector.csv.comment string

The character that represents a line comment when found in the beginning of a line of text. Only one character can be specified.

Supported: dsbulk load and dsbulk unload operations.

Default: disabled by default and indicated with a null character value "\u0000"

-delim,--connector.csv.delimiter, --dsbulk.connector.csv.delimiter string

The character to use as field delimiter.

Supported: dsbulk load and dsbulk unload operations.

Default: , (a comma)

--connector.csv.emptyValue, --dsbulk.connector.csv.emptyValue string

Sets the String representation of an empty value. When reading, if the parser does not read any character from the input, and the input is within quotes, this value is used instead. When writing, if the writer has an empty string to write to the output, this value is used instead. The default value is AUTO, which means that, when reading, the parser emits an empty string, and when writing, the writer writes a quoted empty field to the output.

See also connector.csv.nullValue. When reading from CSV files, here is how a line such as a,,"" is parsed given the following scenarios:

nullValue = AUTO
emptyValue = AUTO
a,,"" => ["a", null, ""]

nullValue = NULL
emptyValue = EMPTY
a,,"" => ["a", "NULL", "EMPTY"]

nullValue = FOO
emptyValue = BAR
a,,"" => ["a", "FOO", "BAR"]

Supported: dsbulk load and dsbulk unload operations.

Default: AUTO (empty string)

-escape,--connector.csv.escape, --dsbulk.connector.csv.escape string

The character used for escaping quotes inside an already quoted value. Only one character can be specified. Note that this option applies to all files to be read or written.

Supported: Only dsbulk load operations.

Default: \

-header,--connector.csv.header, --dsbulk.connector.csv.header { true | false }

Whether the files to read or write begin with a header line. If enabled for loading, the first non-empty line in every file assigns field names for each record column, in lieu of schema.mapping, fieldA = col1, fieldB = col2, fieldC = col3. If disabled for loading, records do not contain fields names, only field indexes, 0 = col1, 1 = col2, 2 = col3. For unloading, if this option is enabled, each file begins with a header line, and if disabled, each file does not contain a header line.

Supported: dsbulk load and dsbulk unload operations.

Default: true

-newline,--connector.csv.newline, --dsbulk.connector.csv.newline { auto | string }

The string of one or two characters that represents a line ending or default value of auto. If set to the default value, the system’s line separator (in Java, System.lineSeparator() determines the strings) is used for writing, and auto-detection of line endings are enabled for reading. Typical line separator characters need to be escaped. For example, the common line ending on Microsoft Windows is a carriage return followed by a newline, or \r\n.

Supported: dsbulk load and dsbulk unload operations.

Default: auto

--connector.csv.nullValue, --dsbulk.connector.csv.nullValue string

Sets the String representation of a null value. When reading, if the parser does not read any character from the input, this value is used instead. When writing, if the writer has a null object to write to the output, this value is used instead. The default value is AUTO, which means that, when reading, the parser emits a null, and when writing, the writer does not write any character at all to the output.

See also connector.csv.emptyValue. When reading from CSV files, here is how a line such as a,,"" is parsed given the following scenarios:

nullValue = AUTO
emptyValue = AUTO
a,,"" => ["a", null, ""]

nullValue = NULL
emptyValue = EMPTY
a,,"" => ["a", "NULL", "EMPTY"]

nullValue = FOO
emptyValue = BAR
a,,"" => ["a", "FOO", "BAR"]

Supported: dsbulk load and dsbulk unload operations.

Default: AUTO

--connector.csv.normalizeLineEndingsInQuotes, --dsbulk.connector.csv.normalizeLineEndingsInQuotes boolean

Defines whether the system’s line separator (in Java, System.lineSeparator() determines the strings) is replaced by a normalized line separator \n in quoted values.

On Microsoft Windows, the detection mechanism for line endings may not function properly when this option is false, due to a defect in the CSV parsing library. If problems arise, set this value to true.

Supported: dsbulk load and dsbulk unload operations.

Default: false

--connector.csv.ignoreLeadingWhitespaces, --dsbulk.connector.csv.ignoreLeadingWhitespaces boolean

Defines whether to skip leading whitespaces from values being read or written in files. Used for both loading and unloading.

+ Supported: Only dsbulk load operations.

Default: false

--connector.csv.ignoreLeadingWhitespacesInQuotes, --dsbulk.connector.csv.ignoreLeadingWhitespacesInQuotes boolean

Defines whether to skip leading whitespaces in quoted values being read.

Supported: Only dsbulk load operations.

Default: false

--connector.csv.ignoreTrailingWhitespaces, --dsbulk.connector.csv.ignoreTrailingWhitespaces boolean

Defines whether to skip trailing whitespaces from values being read or written in files.

Supported: dsbulk load and dsbulk unload operations.

Default: false

--connector.csv.ignoreTrailingWhitespacesInQuotes, --dsbulk.connector.csv.ignoreTrailingWhitespacesInQuotes boolean

Defines whether to skip trailing whitespaces in quoted values being read.

Supported: Only dsbulk load operations.

Default: false

--connector.csv.maxCharsPerColumn, --dsbulk.connector.csv.maxCharsPerColumn number

The maximum number of characters that a field can contain. This option is used to size internal buffers and to avoid out-of-memory problems. If set to -1, internal buffers are resized dynamically. While convenient, this can lead to memory problems. It can also hurt throughput, if some large fields require constant resizing; if this is the case, set this value to a fixed positive number that is big enough to contain all field values.

Supported: dsbulk load and dsbulk unload operations.

Default: 4096

--connector.csv.maxColumns, --dsbulk.connector.csv.maxColumns number

The maximum number of columns that a record can contain. This option is used to size internal buffers and to avoid out-of-memory (OOM) problems.

Supported: dsbulk load and dsbulk unloadoperations.

Default: 512

--connector.csv.quote, --dsbulk.connector.csv.quote character

The character used for quoting fields when the field delimiter is part of the field value. Only one character can be specified. Note that this option applies to all files to be read or written.

Supported: dsbulk load and dsbulk unloadoperations.

Default: \

JSON connector options

--connector.json.mode, --dsbulk.connector.json.mode { SINGLE_DOCUMENT | MULTI_DOCUMENT }

The mode for loading and unloading JSON documents. Valid values are:

  • MULTI_DOCUMENT: Each resource may contain an arbitrary number of successive JSON documents to be mapped to records. For example the format of each JSON document is a single document: {doc1}. The root directory for the JSON documents can be specified with url and the documents can be read recursively by option connector.json.recursive to true.

  • SINGLE_DOCUMENT: Each resource contains a root array whose elements are JSON documents to be mapped to records. For example, the format of the JSON document is an array with embedded JSON documents: [ {doc1}, {doc2}, {doc3} ].

Supported: dsbulk load and dsbulk unloadoperations.

Default: MULTI_DOCUMENT

--connector.json.parserFeatures, --dsbulk.connector.json.parserFeatures map

JSON parser features to enable. Valid values are all the enum constants defined incom.fasterxml.jackson.core.JsonParser.Feature. For example, a value of { ALLOW_COMMENTS : true, ALLOW_SINGLE_QUOTES : true } configures the parser to allow the use of comments and single-quoted strings in JSON data.

Supported: Only dsbulk load operations.

Default: { }

--connector.json.generatorFeatures, --dsbulk.connector.json.generatorFeatures map

JSON generator features to enable. Valid values are all the enum constants defined in com.fasterxml.jackson.core.JsonGenerator.Feature. For example, a value of { ESCAPE_NON_ASCII : true, QUOTE_FIELD_NAMES : true } configures the generator to escape all characters beyond 7-bit ASCII and quote field names when writing JSON output.

Supported: Only dsbulk unload operations.

Default: { }

--connector.json.serializationFeatures, --dsbulk.connector.json.serializationFeatures map

A map of JSON serialization features to set. Map keys should be enum constants defined in Enum SerializationFeature.

Supported: Only dsbulk unload operations.

Default: { }

--connector.json.deserializationFeatures, --dsbulk.connector.json.deserializationFeatures map

A map of JSON deserialization features to set. Map keys should be enum constants defined in Enum DeserializationFeature. The default value is the only way to guarantee that floating point numbers do not have their precision truncated when parsed, but can result in slightly slower parsing.

Supported: Only dsbulk load operations.

Default: { USE_BIG_DECIMAL_FOR_FLOATS : true }

--connector.json.serializationStrategy, --dsbulk.connector.json.serializationStrategy string

The strategy for filtering out entries when formatting output. Valid values are enum constants defined in com.fasterxml.jackson.annotation.JsonInclude.Include.

The CUSTOM strategy is not supported.

Supported: Only dsbulk unload operations.

Default: ALWAYS

--connector.json.prettyPrint, --dsbulk.connector.json.prettyPrint { true | false }

Enable or disable pretty printing. When enabled, JSON records are written with indents.

Using this option results in much bigger records.

Supported: Only dsbulk unload operations.

Default: false

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com