Connector options

Connector options for the dsbulk command

Connectors allow different types of data to be loaded and unloaded using dsbulk. The general format for connector options is:
--connector.name.option { string | number }
For example:
dsbulk load -k ks1 -t table1 --connector.json.urlfile "my/local/multiple-input-data-urls.txt"

The available URL protocols depend on which URL stream handlers have been installed. At a minimum, the file protocol is supported for reading and writing, and the HTTP/HTTPS protocols are supported for reading.

The file protocol can be used with all supported file systems, local or remote.
  • When reading: the URL can point to a single file or to an existing directory. In the case of a directory, you can specify the --connector.{csv | json}.fileNamePattern option to filter the files to be read. And you can use the --connector.{csv | json}.recursive option to control whether the connector should also look for files in subdirectories.
  • When writing: the URL is treated as a directory. If it doesn't exist, DataStax Bulk Loader attempts to create it. If successful, CSV files are created in this directory, and you can set the file names to be used with the --connector.{csv | json}.fileNameFormat option.
    Note: If the value specified does not have a protocol, it is assumed to be a file protocol. Relative URLs will be resolved against the current working directory. Also, for your convenience, if the path begins with a tilde (~), that symbol will be expanded to the current user's home directory.
  • If you have URLs of multiple CSV or JSON data files to load, you can create a file that contains the list of well-formed URLs, and specify the single file with --connector.{csv | json}.urlfile.
  • Another option is to use --connector.{csv | json}.compressed to load or unload data from/to a compressed file.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-c,--connector.name { csv | json }

The name of the connector to use.

Supported: dsbulk load and dsbulk unload operations.

Default: csv

Common to both CSV and JSON connectors

--connector.{csv | json}.compression string
Specify the compression type to use when loading or unloading data from/to a compressed file. Two examples:
dsbulk unload -k test -t table1 --connector.json.compression gzip -url mydir
dsbulk load -k test -t table1 --connector.json.compression gzip -url mydir
The default output from the dsbulk unload command, with compression and the first counter, is output-000001.csv.gz. Refer to --connector.{csv | json}.fileNameFormat string for details on dsbulk unload output file naming.
Supported compressed file types for dsbulk load and dsbulk unload operations:
  • bzip2
  • deflate
  • gzip
  • lzma
  • lz4
  • snappy
  • xz
  • zstd
Supported compressed file types for only dsbulk load:
  • brotli
  • deflate64
  • z
Note: DataStax Bulk Loader automatically adjusts the default file pattern while searching for the file to load, by appending the default compression file extension (such as .gz, for a gzip compression type) to the defined filename pattern. Also, when unloading data, the default file extension for a given compression is automatically appended to the value of the fileNameFormat option, if the compressed file extension ends with .csv or .json.

Supported: dsbulk load and dsbulk unload operations.

Default: none

--connector.{csv | json}.fileNameFormat string

The file name format to use when writing during unloading. This option is ignored when reading and for any URL that is not a file. The file name must comply with the formatting rules of String.format(), and must contain a %d format specifier that will be used to increment file name counters.

Supported: Only dsbulk unload operations.

Default: output-%06d.{csv | json}

--connector.{csv | json}.fileNamePattern string

The glob pattern to use when searching for files to read. The syntax to use is the glob syntax, as described in java.nio.file.FileSystem.getPathMatcher(). This option is ignored when writing and for URLs that are not files. Only applicable when the --connector.{csv | json}.url option points to a directory on a known file system; ignored otherwise.

Supported: Only dsbulk load operations.

Default: **/*.{csv | json}

-encoding,--connector.{csv | json}.encoding string

The file encoding to use for all read or written files.

Supported: dsbulk load and dsbulk unload operations.

Default: UTF-8

-maxConcurrentFiles,--connector.{csv | json}.maxConcurrentFiles string

The maximum number of files that can be written simultaneously. This option is ignored when loading and when the output URL is anything other than a directory on a filesystem. The special syntax NC can be used to specify a number of threads that is a multiple of the number of available cores, e.g. if the number of cores is 8, then 0.5C = 0.5 * 8 = 4 threads.

Supported: Only dsbulk unload operations.

Default: 0.25C

-maxRecords,--connector.{csv | json}.maxRecords number

The maximum number of records to read from or write to each file. When reading, all records past this number will be discarded. When writing, a file will contain at most this number of records; if more records remain to be written, a new file will be created using --connector.{csv | json}.fileNameFormat. Note that when writing to anything other than a directory, this option is ignored. For CSV, --connector.csv.maxRecords takes into account --connector.csv.header. If a file begins with a header line, that line is not counted as a record. This feature is disabled by default.

Supported: dsbulk load and dsbulk unload operations.

Default: -1

--connector.{csv | json}.recursive { true | false }

Enable or disable scanning for files in the root's subdirectories. Only applicable when url is set to a directory on a known filesystem.

Supported: Only dsbulk load operations.

Default: false

-skipRecords,--connector.{csv | json}.skipRecords number

The number of records to skip from each input file before the parser can begin to execute. Note that if the file contains a header line (for CSV), that line is not counted as a valid record.

Supported: Only dsbulk load operations.

Default: 0z

-url,--connector.{csv | json}.url string
The URL or path of the resource(s) to read from or write to. Possible options are - (representing stdin for reading and stdout for writing) and file (filepath). File URLs can also be expressed as simple paths without the file prefix. A directory of files can also be specified.

Supported: dsbulk load and dsbulk unload operations.

Default: -

--connector.{csv | json}.urlfile string
The URL or path of the local file that contains a list of CSV or JSON data files from which to read during dsbulk load operations.
This connector.{csv | json}.urlfile option and the connector.{csv | json}.url option are mutually exclusive. If both are defined and not empty, the connector.{csv | json}.urlfile option takes precedence.
In the file with URLs:
  • Encode in UTF-8.
  • Each line should contain one path and a valid URL to load.
  • You do not need to escape characters in the path.
  • The format rules documented in this topic, including rules for fileNamePattern, recursive, and fileNameFormat, also apply to connector.{csv | json}.urlfile.
  • You can use the # character to comment out a line.
  • DataStax Bulk Loader removes any leading or trailing white space from each line.

Supported: Only dsbulk load operations.

CAUTION: Do not use connector.{csv | json}.urlfile with dsbulk unload; doing so results in a fatal error.

Default: unspecified.

CSV connector options

-comment,--connector.csv.comment string

The character that represents a line comment when found in the beginning of a line of text. Only one character can be specified.

Supported: dsbulk load and dsbulk unload operations.

Default: disabled by default and indicated with a null character value "\u0000"

-delim,--connector.csv.delimiter string

The character to use as field delimiter.

Supported: dsbulk load and dsbulk unload operations.

Default: , (a comma)

--connector.csv.emptyValue string

Sets the string representation of an empty value. When loading, if the parser does not read any character from the quoted input, the value of this option is used.

Supported: Only dsbulk load operations.

Default: "" (empty string)

-escape,--connector.csv.escape string

The character used for escaping quotes inside an already quoted value. Only one character can be specified. Note that this option applies to all files to be read or written.

Supported: Only dsbulk load operations.

Default: \

-header,--connector.csv.header { true | false }

Whether the files to read or write will begin with a header line. If enabled for loading, the first non-empty line in every file will assign field names for each record column, in lieu of schema.mapping, fieldA = col1, fieldB = col2, fieldC = col3. If disabled for loading, records will not contain fields names, only field indexes, 0 = col1, 1 = col2, 2 = col3. For unloading, if this option is enabled, each file will begin with a header line, and if disabled, each file will not contain a header line.

Supported: dsbulk load and dsbulk unload operations.

Default: true

-newline,--connector.csv.newline { auto | string }

The string of one or two characters that represents a line ending or default value of auto. If set to the default value, the system's line separator (in Java, System.lineSeparator() determines the strings) is used for writing, and auto-detection of line endings are enabled for reading. Typical line separator characters need to be escaped. For example, the common line ending on Microsoft Windows is a carriage return followed by a newline, or \r\n.

Supported: dsbulk load and dsbulk unload operations.

Default: auto

--connector.csv.nullValue string

Sets the string representation of a null value. The value of this option is used when loading, where the parser does not read any character in the input, or when unloading, where the writer has a null object written to the output. The default value will emit a null when loading, and will not write any character to the output when unloading.

Supported: dsbulk load and dsbulk unload operations.

Default: null

--connector.csv.normalizeLineEndingsInQuotes boolean
Defines whether the system's line separator (in Java, System.lineSeparator() determines the strings) is replaced by a normalized line separator \n in quoted values.
Important: On Microsoft Windows, the detection mechanism for line endings may not function properly when this option is false, due to a defect in the CSV parsing library. If problems arise, set this vale to true.

Supported: dsbulk load and dsbulk unload operations.

Default: false

--connector.csv.ignoreLeadingWhitespaces boolean

Defines whether to skip leading whitespaces from values being read or written in files. Used for both loading and unloading.

Supported: Only dsbulk load operations.

Default: false

--connector.csv.ignoreLeadingWhitespacesInQuotes boolean

Defines whether to skip leading whitespaces in quoted values being read.

Supported: Only dsbulk load operations.

Default: false

--connector.csv.ignoreTrailingWhitespaces boolean

Defines whether to skip trailing whitespaces from values being read or written in files.

Supported: dsbulk load and dsbulk unload operations.

Default: false

--connector.csv.ignoreTrailingWhitespacesInQuotes boolean

Defines whether to skip trailing whitespaces in quoted values being read.

Supported: Only dsbulk load operations.

Default: false

--connector.csv.maxCharsPerColumn number

The maximum number of characters that a field can contain. This option is used to size internal buffers and to avoid out-of-memory problems. If set to -1, internal buffers will be resized dynamically. While convenient, this can lead to memory problems. It could also hurt throughput, if some large fields require constant resizing; if this is the case, set this value to a fixed positive number that is big enough to contain all field values.

Supported: dsbulk load and dsbulk unload operations.

Default: 4096

--connector.csv.maxColumns number

The maximum number of columns that a record can contain. This option is used to size internal buffers and to avoid out-of-memory (OOM) problems.

Supported: dsbulk load and dsbulk unloadoperations.

Default: 512

--connector.csv.quote character

The character used for quoting fields when the field delimiter is part of the field value. Only one character can be specified. Note that this option applies to all files to be read or written.

Supported: dsbulk load and dsbulk unloadoperations.

Default: \

JSON connector options

--connector.json.mode { SINGLE_DOCUMENT | MULTI_DOCUMENT }
The mode for loading and unloading JSON documents. Valid values are:
  • MULTI_DOCUMENT: Each resource may contain an arbitrary number of successive JSON documents to be mapped to records. For example the format of each JSON document is a single document: {doc1}. The root directory for the JSON documents can be specified with url and the documents can be read recursively by option connector.json.recursive to true.
  • SINGLE_DOCUMENT: Each resource contains a root array whose elements are JSON documents to be mapped to records. For example, the format of the JSON document is an array with embedded JSON documents: [ {doc1}, {doc2}, {doc3} ].

Supported: dsbulk load and dsbulk unloadoperations.

Default: MULTI_DOCUMENT

--connector.json.parserFeatures map

JSON parser features to enable. Valid values are all the enum constants defined in com.fasterxml.jackson.core.JsonParser.Feature. For example, a value of { ALLOW_COMMENTS : true, ALLOW_SINGLE_QUOTES : true } will configure the parser to allow the use of comments and single-quoted strings in JSON data.

Supported: Only dsbulk load operations.

Default: { }

--connector.json.generatorFeatures map

JSON generator features to enable. Valid values are all the enum constants defined in com.fasterxml.jackson.core.JsonGenerator.Feature. For example, a value of { ESCAPE_NON_ASCII : true, QUOTE_FIELD_NAMES : true } will configure the generator to escape all characters beyond 7-bit ASCII and quote field names when writing JSON output.

Supported: Only dsbulk unload operations.

Default: { }

--connector.json.serializationFeatures map

A map of JSON serialization features to set. Map keys should be enum constants defined in com.fasterxml.jackson.databind.SerializationFeature.

Supported: Only dsbulk unload operations.

Default: { }

--connector.json.deserializationFeatures map

A map of JSON deserialization features to set. Map keys should be enum constants defined in com.fasterxml.jackson.databind.DeserializationFeature. The default value is the only way to guarantee that floating point numbers will not have their precision truncated when parsed, but can result in slightly slower parsing.

Supported: Only dsbulk load operations.

Default: { USE_BIG_DECIMAL_FOR_FLOATS : true }

--connector.json.serializationStrategy string
The strategy for filtering out entries when formatting output. Valid values are enum constants defined in com.fasterxml.jackson.annotation.JsonInclude.Include.
CAUTION: The CUSTOM strategy is not supported.

Supported: Only dsbulk unload operations.

Default: ALWAYS

--connector.json.prettyPrint { true | false }
Enable or disable pretty printing. When enabled, JSON records are written with indents.
Attention: Using this option results in much bigger records.

Supported: Only dsbulk unload operations.

Default: false