Codec options

Codec options for the dsbulk command

Specify codec options for the dsbulk command, which determine how record fields are parsed for loading or how row cells are formatted for unloading.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-locale,--codec.locale string

The locale to use for locale-sensitive conversions.

Default: en_US

-timeZone,--codec.timeZone string

The time zone to use for temporal conversions. When loading, the time zone will be used to obtain a timestamp from inputs that do not convey any explicit time zone information. When unloading, the time zone will be used to format all timestamps.

Default: UTC

-nullStrings,--codec.nullStrings string
Comma-separated list of strings that should be mapped to null. For loading, when a record field value exactly matches one of the specified strings, the value is replaced with null before writing to DSE. For unloading, this setting is only applicable for string-based connectors, such as the CSV connector: the first string specified will be used to change a row cell containing null to the specified string when written out. By default, no strings are mapped to null.
Note: Regardless of this setting, DataStax Bulk Loader will always convert empty strings to null when the target CQL type is not textual; that is, when the target is not text, varchar, or ascii.
This setting is applied before schema.nullToUnset, hence any null produced by a null-string can still be left unset if required.

Default: [ ] (no strings mapped to null)

--codec.booleanStrings [ true_value:false_value, ... ]

Specify how true and false representations can be used by DataStax Bulk Loader. Each representation is of the form true_value:false_value, case-insensitive. For loading, all representations are honored. For unloading, the first representation will be used and all others ignored.

Default: ["1:0","Y:N","T:F","YES:NO","TRUE:FALSE"]

--codec.booleanNumbers [ true_value, false_value ]

Set how true and false representations of numbers are interpreted. The representation is of the form true_value,false_value. The mapping is reciprocal, so that numbers are mapping to Boolean and vice versa. All numbers unspecified in this setting are rejected.

Default: [1, 0]

--codec.number string

The DecimalFormat pattern to use for conversion between String and CQL numeric types. See java.text.DecimalFormat for details about the pattern syntax to use. Most inputs are recognized: optional localized thousands separator, localized decimal separator, or optional exponent. Using locale en_US, 1234, 1,234, 1234.5678, 1,234.5678 and 1,234.5678E2 are all valid. For unloading and formatting, rounding may occur and cause precision loss. See codec.formatNumbers and codec.roundingStrategy.

Default: #,###.##

--codec.formatNumbers ( true | false )

Whether or not to use the codec.number pattern to format all numeric output. When set to true, the numeric pattern defined by codec.number will be applied. This allows for nicely-formatted output, but may result in rounding (see codec.roundingStrategy), or alteration of the original decimal's scale. When set to false, numbers will be stringified using the toString method, and will never result in rounding or scale alteration. Only applicable when unloading, and only if the connector in use requires stringification, because the connector, such as the CSV connector, does not handle raw numeric data; ignored otherwise.

Default: false

--codec.roundingStrategy string

The rounding strategy to use for conversions from CQL numeric types to String. Valid choices: any java.math.RoundingMode enum constant name, including: CEILING, FLOOR, UP, DOWN, HALF_UP, HALF_EVEN, HALF_DOWN, and UNNECESSARY. The precision used when rounding is inferred from the numeric pattern declared under codec.number. For example, the default codec.number (#,###.##) has a rounding precision of 2, and the number 123.456 would be rounded to 123.46 if the codec.rounding Strategy was set to UP. The default value will result in infinite precision, and ignore the codec.number setting. Only applicable when unloading, if codec.formatNumbers is true and if the connector in use requires stringification, because the connector, such as the CSV connector, does not handle raw numeric data; ignored otherwise.

Default: UNECESSARY

--codec.overflowStrategy string
This setting can mean one of three possibilities:
  • The value is outside the range of the target CQL type. For example, trying to convert 128 to a CQL tinyint (max value of 127) results in overflow.
  • The value is decimal, but the target CQL type is integral. For example, trying to convert 123.45 to a CQL int results in overflow.
  • The value's precision is too large for the target CQL type. For example, trying to insert 0.1234567890123456789 into a CQL double results in overflow, because there are too many significant digits to fit in a 64-bit double.
Valid choices:
  • REJECT: overflows are considered errors and the data is rejected. This is the default value.
  • TRUNCATE: the data is truncated to fit in the target CQL type.
    Note: The truncation algorithm is similar to the narrowing primitive conversion defined in The Java Language Specification, Section 5.1.3, with the following exceptions: (1) If the value is too big or too small, it is rounded up or down to the maximum or minimum value allowed, rather than truncated at bit level. For example, 128 would be rounded down to 127 to fit in a byte, whereas Java would have truncated the exceeding bits and converted to -127 instead. (2) If the value is decimal, but the target CQL type is integral, it is first rounded to an integral using the defined rounding strategy, then narrowed to fit into the target type. This can result in precision loss and should be used with caution.
Only applicable for loading, when parsing numeric inputs; it does not apply for unloading, since formatting never results in overflow.

Default: REJECT

--codec.timestamp { formatter | string }
The temporal pattern to use for String to CQL timestamp conversion. Valid choices:
  • A date-time pattern
  • A pre-defined formatter such as ISO_ZONED_DATE_TIME or ISO_INSTANT, or any other public static field in java.time.format.DateTimeFormatter
  • The special formatter CQL_TIMESTAMP, which is a special parser that accepts all valid CQL literal formats for the timestamp type.
  • The special formatter UNITS_SINCE_EPOCH is required for datasets containing numeric data that are intended to be interpreted as units since a given epoch. Once set, DataStax Bulk Loader uses the codec.unit and codec.epoch settings to determine which unit and epoch to use.
Note: For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.
When parsing, CQL_TIMESTAMP_FORMAT recognizes most CQL temporal literals:
Type Values
Local dates 2012-01-01
Local times

12:34

12:34:56

12:34:56.123

12:34:56.123456

12:34:56.123456789

Local date-times

2012-01-01T12:34

2012-01-01T12:34:56

2012-01-01T12:34:56.123

2012-01-01T12:34:56.123456

2012-01-01T12:34:56.123456789

Zoned date-times

2012-01-01T12:34+01:00

2012-01-01T12:34:56+01:00

2012-01-01T12:34:56.123+01:00

2012-01-01T12:34:56.123456+01:00

2012-01-01T12:34:56.123456789+01:00

2012-01-01T12:34:56.123456789+01:00[Europe/Paris]

When the input is a local date, the timestamp is resolved at midnight using the specified timeZone. When the input is a local time, the timestamp is resolved using the time zone specified under timeZone, and the date is inferred from the instant specified under epoch (by default, January 1st 1970). When formatting, this format uses the ISO_OFFSET_DATE_TIME pattern, which is compliant with both CQL and ISO-8601.

Default: CQL_TIMESTAMP

--codec.date { formatter | string }
The temporal pattern to use for String to CQL date conversion. Valid choices:
  • A date-time pattern
  • A pre-defined formatter such as ISO_LOCAL_DATE
Note: For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.

Default: ISO_LOCAL_DATE

--codec.time { formatter | string }
The temporal pattern to use for String to CQL time conversion. Valid choices:
  • A date-time pattern, such as HH:mm:ss.
  • A pre-defined formatter such as ISO_LOCAL_TIME
Note: For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.
Default: ISO_LOCAL_TIME
--codec.unit
If codec.timestamp, codec.date, or codec.time is set to UNITS_SINCE_EPOCH, the time unit specified here is used to convert numeric data to and from temporals for the following cases:
  • Target column is of CQL timestamp, date, or time type
  • Loading data with a USING TIMESTAMP, USING DATE, or USING TIME clause
  • Unloading data with a WRITETIME() function call
For example, if the input is 123 and the time unit is SECONDS, the input will be interpreted as 123 seconds since codec.epoch.

When loading, and the target CQL type is numeric, but the input is alphanumeric and represents a temporal literal, the time unit specified will be used to convert the parsed temporal into a numeric value. For example, if the input is 2019-02-03T19:32:45Z and the time unit specified is SECONDS, the parsed temporal will be converted into the number of seconds since codec.epoch.

All TimeUnit enum constants are valid choices.

Default: MILLISECONDS

--codec.epoch
If codec.timestamp, codec.date, or codec.time is set to UNITS_SINCE_EPOCH, the epoch specified here determines the relative point in time to use when converting numeric data to and from temporals for the following cases:
  • Target column is of CQL timestamp, date, or time type
  • Loading data with a USING TIMESTAMP, USING DATE, or USING TIME clause
  • Unloading data with a WRITETIME() function call
For example, if the input is 123 and the epoch is 2000-01-01T00:00:00Z, the input will be interpreted as N codec.units since January 1st 2000.

When loading, and the target CQL type is numeric, but the input is alphanumeric and represents a temporal literal, the codec.epoch and codec.unit values will be used to convert the parsed temporal into a numeric value. For example, if the input is 2019-02-03T19:32:45Z and the epoch specified is 2000-01-01T00:00:00Z, the parsed timestamp will be converted to N codec.units since January 1st 2000.

When parsing temporal literals, if the input does not contain a date part, then the date part of the instant specified here will be used. For example, if the input is 19:32:45 and the epoch specified is 2000-01-01T00:00:00Z, then the input will be interpreted 2000-01-01T19:32:45Z.

The value must be expressed in ISO_ZONED_DATE_TIME, as covered in the Oracle Java documentation.

Default: "1970-01-01T00:00:00Z"

--codec.uuidStrategy { RANDOM | FIXED | MIN | MAX }
Strategy to use when generating time-based (version 1) UUIDs from timestamps. Clock sequence and node ID parts of generated UUIDs are determined on a best-effort basis and are not fully compliant with RFC 4122. Valid values are:
  • RANDOM: Generates UUIDs using a random number in lieu of the local clock sequence and node ID. This strategy will ensure that the generated UUIDs are unique, even if the original timestamps are not guaranteed to be unique.
  • FIXED: Preferred strategy if original timestamps are guaranteed unique, since it is faster. Generates UUIDs using a fixed local clock sequence and node ID.
  • MIN: Generates the smallest possible type 1 UUID for a given timestamp.
    Warning: This strategy doesn't guarantee uniquely generated UUIDs and should be used with caution.
  • MAX: Generates the biggest possible type 1 UUID for a given timestamp.
    Warning: This strategy doesn't guarantee uniquely generated UUIDs and should be used with caution.

Default: RANDOM