Codec options
Specify codec options for the dsbulk command, which determine how record fields are parsed for loading or how row cells are formatted for unloading. When counting, these settings are ignored.
The options can be used in short form (-locale
string) or in long form (--codec.locale
string).
--codec.binary, --dsbulk.codec.binary string
Strategy to use when converting binary data to strings.
Only applicable when unloading columns of CQL type blob, or unloading columns of a geometry type, if codec.geo
is WKB
.
For the latter, see the codec.geo
section.
Valid codec.binary
values are:
-
BASE64
: Encode the binary data into a Base-64 string. This is the default strategy. -
HEX
: Encode the binary data as CQL blob literals. CQL blob literals follow the general syntax:0[xX][0-9a-fA-F]+
, that is,0x
followed by hexadecimal characters, for example:0xcafebabe
. This format produces lengthier strings thanBASE64
, but is also the only format compatible withCQLSH
.
Default: BASE64
--codec.booleanNumbers, --dsbulk.codec.booleanNumbers [ true_value, false_value ]
Set how true and false representations of numbers are interpreted.
The representation is of the form true_value,false_value
.
The mapping is reciprocal, so that numbers are mapping to Boolean and vice versa.
All numbers unspecified in this setting are rejected.
Default: [1, 0]
--codec.booleanStrings, --dsbulk.codec.booleanStrings [ true_value:false_value, …]
Specify how true and false representations can be used by DataStax Bulk Loader.
Each representation is of the form true_value:false_value
, case-insensitive.
For loading, all representations are honored.
For unloading, the first representation is used and all others are ignored.
Ensure that your list of representations is inside quotes as a string. For example:
dsbulk unload -k keyspace1 -t javatime --codec.booleanStrings '["TRUE:FALSE"]'
Default: ["1:0","Y:N","T:F","YES:NO","TRUE:FALSE"]
--codec.date, --dsbulk.codec.date { formatter | string }
The temporal pattern to use for String
to CQL date conversion.
Valid choices:
-
A date-time pattern
-
A pre-defined formatter such as
ISO_LOCAL_DATE
For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format. |
Default: ISO_LOCAL_DATE
--codec.epoch, --dsbulk.codec.epoch
If codec.timestamp
, codec.date
, or codec.time
is set to UNITS_SINCE_EPOCH
, the epoch specified here determines the relative point in time to use when converting numeric data to and from temporals for the following cases:
-
Target column is of CQL
timestamp
,date
, ortime
type -
Loading data with a
USING TIMESTAMP
,USING DATE
, orUSING TIME
clause -
Unloading data with a
WRITETIME()
function call For example, if the input is123
and the epoch is2000-01-01T00:00:00Z
, the input is interpreted as Ncodec.unit
s since January 1st 2000.
When loading and the target CQL type is numeric, but the input is alphanumeric and represents a temporal literal, the codec.epoch
and codec.unit
values are used to convert the parsed temporal into a numeric value.
For example, if the input is 2020-02-03T19:32:45Z
and the epoch specified is 2000-01-01T00:00:00Z
, the parsed timestamp is converted to N codec.unit
s since January 1st 2000.
When parsing temporal literals, if the input does not contain a date part, then the date part of the instant specified here is used.
For example, if the input is 19:32:45
and the epoch specified is 2000-01-01T00:00:00Z
, then the input is interpreted as 2000-01-01T19:32:45Z
.
The value must be expressed in ISO_ZONED_DATE_TIME, as covered in the Oracle Java documentation.
Default: "1970-01-01T00:00:00Z
"
--codec.formatNumbers, --dsbulk.codec.formatNumbers ( true | false )
Whether to use the codec.number
pattern to format all numeric output.
When set to true
, the numeric pattern defined by codec.number
is applied.
This allows for nicely-formatted output, but may result in rounding (see codec.roundingStrategy
, or alteration of the original decimal’s scale.
When set to false
, numbers are stringified using the toString
method, and never result in rounding or scale alteration.
Only applicable when unloading, and only if the connector in use requires stringification, because the connector, such as the CSV connector, does not handle raw numeric data;
ignored otherwise.
Default: false
--codec.geo, --dsbulk.codec.geo string
Strategy to use when converting geometry types to strings.
Geometry types are only available in DataStax Enterprise (DSE) 5.0 or higher.
Only applicable when unloading columns of CQL type Point
, LineString
or Polygon
, and only if the connector in use requires stringification.
Valid values are:
-
WKT
: Encode the data in Well-known Text format. This is the default strategy. -
WKB
: Encode the data in Well-known Binary format. The actual encoding depends on the value chosen for thecodec.binary
setting (HEX
orBASE64
). -
JSON
: Encode the data inGeoJson
format. Default:"WKT"
-locale, --codec.locale, --dsbulk.codec.locale string
The locale to use for locale-sensitive conversions.
Default: en_US
-nullStrings, --codec.nullStrings, --dsbulk.codec.nullStrings list
Comma-separated list of strings that should be mapped to null
.
For loading, when a record field value exactly matches one of the specified strings, the value is replaced with null
before writing to DSE.
For unloading, this setting is only applicable for string-based connectors, such as the CSV connector: the first string specified is used to change a row cell containing null
to the specified string when written out.
By default, no strings are mapped to null
.
Regardless of this setting, DataStax Bulk Loader always converts empty strings to |
This setting is applied before schema.nullToUnset
, hence any null
produced by a null-string can still be left unset if required.
Default: [ ]
(no strings mapped to null
)
--codec.number, --dsbulk.codec.number string
The DecimalFormat
pattern to use for conversion between String
and CQL numeric types.
See java.text.DecimalFormat for details about the pattern syntax to use.
Most inputs are recognized: optional localized thousands separator, localized decimal separator, or optional exponent.
Using -locale en_US
, 1234, 1,234, 1234.5678, 1,234.5678 and 1,234.5678E2 are all valid.
For unloading and formatting, rounding may occur and cause precision loss.
See --codec.formatNumbers and --codec.roundingStrategy.
Default: #,###.##
--codec.overflowStrategy, --dsbulk.codec.overflowStrategy string
This setting can mean one of three possibilities:
-
The value is outside the range of the target CQL type. For example, trying to convert 128 to a CQL tinyint (max value of 127) results in overflow.
-
The value is decimal, but the target CQL type is integral. For example, trying to convert 123.45 to a CQL int results in overflow.
-
The value’s precision is too large for the target CQL type. For example, trying to insert 0.1234567890123456789 into a CQL double results in overflow, because there are too many significant digits to fit in a 64-bit double. Valid choices:
-
REJECT
: overflows are considered errors and the data is rejected. This is the default value. -
TRUNCATE
: the data is truncated to fit in the target CQL type.The truncation algorithm is similar to the narrowing primitive conversion defined in The Java Language Specification, Section 5.1.3, with the following exceptions:
-
If the value is too big or too small, it is rounded up or down to the maximum or minimum value allowed, rather than truncated at bit level. For example, 128 is rounded down to 127 to fit in a byte, whereas Java truncates the exceeding bits and converts them to -127 instead.
-
If the value is decimal, but the target CQL type is integral, it is first rounded to an integral using the defined rounding strategy, then narrowed to fit into the target type. This can result in precision loss and should be used with caution.
-
Only applicable for loading, when parsing numeric inputs; it does not apply for unloading, since formatting never results in overflow.
Default: REJECT
--codec.roundingStrategy, --dsbulk.codec.roundingStrategy string
The rounding strategy to use for conversions from CQL numeric types to String.
Valid choices: any java.math.RoundingMode
enum constant name, including: CEILING
, FLOOR
, UP
, DOWN
, HALF_UP
, HALF_EVEN
, HALF_DOWN
, and UNNECESSARY
.
The precision used when rounding is inferred from the numeric pattern declared under codec.number
.
For example, the default codec.number
`#,###.##` has a rounding precision of 2, and the number 123.456 is rounded to 123.46 if the --codec.rounding Strategy
was set to UP
.
The default value results in infinite precision, and ignores the --codec.number
setting.
Only applicable when unloading, if --codec.formatNumbers
is true
and if the connector in use requires stringification, because the connector, such as the CSV connector, does not handle raw numeric data;
ignored otherwise.
Default: UNECESSARY
--codec.time, --dsbulk.codec.time { formatter | string }
The temporal pattern to use for String
to CQL time conversion.
Valid choices:
-
A date-time pattern, such as
HH:mm:ss
. -
A pre-defined formatter such as
ISO_LOCAL_TIME
For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format. |
Default: ISO_LOCAL_TIME
--codec.timestamp, --dsbulk.codec.timestamp { formatter | string }
The temporal pattern to use for String
to CQL timestamp conversion.
Valid choices:
-
A date-time pattern
-
A pre-defined formatter such as
ISO_ZONED_DATE_TIME
orISO_INSTANT
, or any other public static field in java.time.format.DateTimeFormatter -
The special formatter
CQL_TIMESTAMP
, which is a special parser that accepts all valid CQL literal formats for thetimestamp
type. -
The special formatter
UNITS_SINCE_EPOCH
is required for datasets containing numeric data that are intended to be interpreted as units since a given epoch. Once set, DataStax Bulk Loader uses the --codec.unit and --codec.epoch settings to determine which unit and epoch to use.
For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format. |
When parsing, CQL_TIMESTAMP_FORMAT
recognizes most CQL temporal literals:
Type | Values |
---|---|
Local dates |
2020-01-01 |
Local times |
12:34 12:34:56 12:34:56.123 12:34:56.123456 12:34:56.123456789 |
Local date-times |
2020-01-01T12:34 2020-01-01T12:34:56 2020-01-01T12:34:56.123 2020-01-01T12:34:56.123456 |
Zoned date-times |
2020-01-01T12:34+01:00 2020-01-01T12:34:56+01:00 2020-01-01T12:34:56.123+01:00 2020-01-01T12:34:56.123456+01:00 2020-01-01T12:34:56.123456789+01:00 2020-01-01T12:34:56.123456789+01:00[Europe/Paris] |
When the input is a local date, the timestamp is resolved at midnight using the specified timeZone
. When the input is a local time, the timestamp is resolved using the time zone specified under timeZone
, and the date is inferred from the instant specified under epoch
(by default, January 1st 1970). When formatting, this format uses the ISO_OFFSET_DATE_TIME
pattern, which is compliant with both CQL and ISO-8601.
Default: CQL_TIMESTAMP
-timeZone, --codec.timeZone, --dsbulk.codec.timeZone string
The time zone to use for temporal conversions. When loading, the time zone is used to obtain a timestamp from inputs that do not convey any explicit time zone information. When unloading, the time zone is used to format all timestamps.
This option supports all ZoneId (Java Platform SE 8) formats.
Default: UTC
--codec.unit, --dsbulk.codec.unit
If codec.timestamp
, codec.date
, or codec.time
is set to UNITS_SINCE_EPOCH
, the time unit specified here is used to convert numeric data to and from temporals for the following cases:
-
Target column is of CQL
timestamp
,date
, ortime
type -
Loading data with a
USING TIMESTAMP
,USING DATE
, orUSING TIME
clause -
Unloading data with a
WRITETIME()
function call For example, if the input is123
and the time unit isSECONDS
, the input is interpreted as 123 seconds since a givencodec.epoch
.
When loading and the target CQL type is numeric, but the input is alphanumeric and represents a temporal literal, the time unit specified is used to convert the parsed temporal into a numeric value.
For example, if the input is 2019-02-03T19:32:45Z
and the time unit specified is SECONDS
, the parsed temporal is converted into the number of seconds since a given --codec.epoch
.
All TimeUnit
enum constants are valid choices.
Default: MILLISECONDS
--codec.uuidStrategy, --dsbulk.codec.uuidStrategy { RANDOM | FIXED | MIN | MAX }
Strategy to use when generating time-based (version 1) UUIDs from timestamps. Clock sequence and node ID parts of generated UUIDs are determined on a best-effort basis and are not fully compliant with RFC 4122. Valid values are:
-
RANDOM
: Generates UUIDs using a random number in lieu of the local clock sequence and node ID. This strategy ensures that the generated UUIDs are unique, even if the original timestamps are not guaranteed to be unique. -
FIXED
: Preferred strategy if original timestamps are guaranteed unique, since it is faster. Generates UUIDs using a fixed local clock sequence and node ID. -
MIN
: Generates the smallest possible type 1 UUID for a given timestamp.
This strategy does not guarantee uniquely generated UUIDs and should be used with caution. |
-
MAX
: Generates the biggest possible type 1 UUID for a given timestamp.
This strategy does not guarantee uniquely generated UUIDs and should be used with caution. |
Default: RANDOM