Connector options

Connector options for the dsbulk command

Connectors allow different types of data to be loaded and unloaded using dsbulk. The general format for connector options is:
--connector.name.option ( string | number )
. An example is connector.csv.url.

Which URL protocols are available depend on which URL stream handlers have been installed, but at least the file protocol is guaranteed to be supported for reading and writing, and the HTTP/HTTPS protocols are guaranteed to be supported for reading.

The file protocol can be used with all supported file systems, local or not.
  • When reading: the URL can point to a single file, or to an existing directory; in case of a directory, the fileNamePattern setting can be used to filter files to read, and the recursive setting can be used to control whether or not the connector should look for files in subdirectories as well.
  • When writing: the URL will be treated as a directory; if it doesn't exist, the loader will attempt to create it; CSV files will be created inside this directory, and their names can be controlled with the fileNameFormat setting.
Note that if the value specified here does not have a protocol, then it is assumed to be a file protocol. Relative URLs will be resolved against the current working directory. Also, for convenience, if the path begins with a tilde (~), that symbol will be expanded to the current user's home directory.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-c,--connector.name csv | json

The name of the connector to use.

Default: csv

Common to both CSV and JSON connectors

--connector.(csv|json).fileNameFormat string

The file name format to use when writing during unloading. This setting is ignored when reading and for non-file URLs. The file name must comply with the formatting rules of String.format(), and must contain a %d format specifier that will be used to increment file name counters.

Default: output-%0,6d.csv (CSV); output-%0,6d.json (JSON)

--connector.(csv|json).fileNamePattern string

The glob pattern to use when searching for files to read. The syntax to use is the glob syntax, as described in java.nio.file.FileSystem.getPathMatcher(). This setting is ignored when writing and for non-file URLs. Only applicable when the url setting points to a directory on a known filesystem, ignored otherwise.

Default: **/*.csv (CSV); **/*.json (JSON)

-encoding,--connector.(csv|json).encoding string

The file encoding to use for all read or written files.

Default: UTF-8

-maxConcurrentFiles,--connector.(csv|json).maxConcurrentFiles string

The maximum number of files that can be written simultaneously. This setting is ignored when loading and when the output URL is anything other than a directory on a filesystem. The special syntax NC can be used to specify a number of threads that is a multiple of the number of available cores, e.g. if the number of cores is 8, then 0.5C = 0.5 * 8 = 4 threads. Used for unloading only.

Default: 0.25C

-maxRecords,--connector.(csv|json).maxRecords number

The maximum number of records to read from or write to each file. When reading, all records past this number will be discarded. When writing, a file will contain at most this number of records; if more records remain to be written, a new file will be created using the fileNameFormat setting. Note that when writing to anything other than a directory, this setting is ignored. For CSV, this setting takes into account the header setting: if a file begins with a header line, that line is not counted as a record. This feature is disabled by default (indicated by its -1 value).

Default: -1

--connector.(csv|json).recursive ( true | false )

Enable or disable scanning for files in the root's subdirectories. Only applicable when url is set to a directory on a known filesystem. Used for loading only.

Default: false

-skipRecords,--connector.(csv|json).skipRecords number

The number of records to skip from each input file before the parser can begin to execute. Note that if the file contains a header line (for CSV), that line is not counted as a valid record. Used for loading only.

Default: 0

-url,--connector.(csv|json).url string
The URL or path of the resource(s) to read from or wrote to. Possible options are - (representing stdin for reading and stdout for writing) and file (filepath). File URLs can also be expressed as simple paths without the file prefix. A directory of files can also be specified.

Default: -

CSV connector options

-comment,--connector.csv.comment string

The character that represents a line comment when found in the beginning of a line of text. Only one character can be specified. Note that this setting applies to all files to be read or written.

Default: disabled by default and indicated with a null character value "\u0000"

-delim,--connector.csv.delimiter string

The character to use as field delimiter.

Default: , (a comma)

--connector.csv.emptyValue string

Sets the string representation of an empty value. When loading, if the parser does not read any character from the quoted input, the value of this setting is used. Used only for loading.

Default: "" (empty string)

-escape,--connector.csv.escape string

The character used for escaping quotes inside an already quoted value. Only one character can be specified. Note that this setting applies to all files to be read or written.

Default: \

-header,--connector.csv.header ( true | false )
Enable or disable whether the files to read or write begin with a header line. If enabled for loading, the first non-empty line in every file will assign field names for each record column, in lieu of schema.mapping, fieldA = col1, fieldB = col2, fieldC = col3. If disabled for loading, records will not contain fields names, only field indexes, 0 = col1, 1 = col2, 2 = col3. For unloading, if this setting is enabled, each file will begin with a header line, and if disabled, each file will not contain a header line.
Note: This option will apply to all files loaded or unloaded.

Default: true

-newline,--connector.csv.newline ( auto | string)

The string of either one or two characters that represents a line ending or default setting of auto. If set to the default value, the system's line separator (in Java, System.lineSeparator() determines the strings) is used for writing, and auto-detection of line endings are enabled for reading. Typical line separator characters need to be escaped. For example, the common line ending on Microsoft Windows is a carriage return followed by a newline, or \r\n.

Default: auto

--connector.csv.nullValue string

Sets the string representation of a null value. The value of this setting is used either when loading, the parser does not read any character in the input, or when unloading, the writer has a null object written to the output. The default value will emit a null when loading, and will not write any character to the output when unloading.

Default: null

--connector.csv.normalizeLineEndingsInQuotes boolean
Defines whether the system's line separator (in Java, System.lineSeparator() determines the strings) is replaced by a normalized line separator \n in quoted values. Used for both loading and unloading.
Important: On Microsoft Windows, the detection mechanism for line endings may not function properly when this setting is false, due to a bug in the CSV parsing library. If problems arise, set this to true.

Default: false

--connector.csv.ignoreLeadingWhitespaces boolean

Defines whether to skip leading whitespaces from values being read or written in files. Used for both loading and unloading.

Default: false

--connector.csv.ignoreLeadingWhitespacesInQuotes boolean

Defines whether to skip leading whitespaces in quoted values being read. Used for loading, ignored for unloading.

Default: false

--connector.csv.ignoreTrailingWhitespaces boolean

Defines whether to skip trailing whitespaces from values being read or written in files. Used for both loading and unloading.

Default: false

--connector.csv.ignoreTrailingWhitespacesInQuotes boolean

Defines whether to skip trailing whitespaces in quoted values being read. Used for loading, ignored for unloading.

Default: false

--connector.csv.maxCharsPerColumn number

The maximum number of characters that a field can contain. This setting is used to size internal buffers and to avoid out-of-memory problems. If set to -1, internal buffers will be resized dynamically. While convenient, this can lead to memory problems. It could also hurt throughput, if some large fields require constant resizing; if this is the case, set this value to a fixed positive number that is big enough to contain all field values.

Default: 4096

--connector.csv.maxColumns number

The maximum number of columns that a record can contain. This setting is used to size internal buffers and to avoid out-of-memory (OOM) problems.

Default: 512

--connector.csv.quote character

The character used for quoting fields when the field delimiter is part of the field value. Only one character can be specified. Note that this setting applies to all files to be read or written.

Default: \

JSON connector options

--connector.json.mode ( SINGLE_DOCUMENT | MULTI_DOCUMENT )
The mode for loading and unloading JSON documents. Valid values are:
  • MULTI_DOCUMENT: Each resource may contain an arbitrary number of successive JSON documents to be mapped to records. For example the format of each JSON document is a single document: {doc1}. The root directory for the JSON documents can be specified with url and the documents can be read recursively by setting connector.json.recursive to true.
  • SINGLE_DOCUMENT: Each resource contains a root array whose elements are JSON documents to be mapped to records. For example, the format of the JSON document is an array with embedded JSON documents: [ {doc1}, {doc2}, {doc3} ].

Default: MULTI_DOCUMENT

--connector.json.parserFeatures map

JSON parser features to enable. Valid values are all the enum constants defined in com.fasterxml.jackson.core.JsonParser.Feature. For example, a value of { ALLOW_COMMENTS : true, ALLOW_SINGLE_QUOTES : true } will configure the parser to allow the use of comments and single-quoted strings in JSON data. Used for loading only.

Default: { }

--connector.json.generatorFeatures map

JSON generator features to enable. Valid values are all the enum constants defined in com.fasterxml.jackson.core.JsonGenerator.Feature. For example, a value of { ESCAPE_NON_ASCII : true, QUOTE_FIELD_NAMES : true } will configure the generator to escape all characters beyond 7-bit ASCII and quote field names when writing JSON output. Used for unloading only.

Default: { }

--connector.json.serializationFeatures map

A map of JSON serialization features to set. Map keys should be enum constants defined in com.fasterxml.jackson.databind.SerializationFeature. Used for unloading only.

Default: { }

--connector.json.deserializationFeatures map

A map of JSON deserialization features to set. Map keys should be enum constants defined in com.fasterxml.jackson.databind.DeserializationFeature. The default value is the only way to guarantee that floating point numbers will not have their precision truncated when parsed, but can result in slightly slower parsing. Used for loading only.

Default: { USE_BIG_DECIMAL_FOR_FLOATS : true }

--connector.json.serializationStrategy string

The strategy for filtering out entries when formatting output. Valid values are enum constants defined in com.fasterxml.jackson.annotation.JsonInclude.Include Beware that the CUSTOM strategy cannot be honored). Used for unloading only.

Default: ALWAYS

--connector.json.prettyPrint ( true | false )
Enable or disable pretty printing. When enabled, JSON records are written with indents. Used for unloading only.
Note: Can result in much bigger records.

Default: false