Codec options

Specify codec options for the dsbulk command, which determine how record fields are parsed for loading or how row cells are formatted for unloading. When counting, these settings are ignored.

The options can be used in short form (-locale string) or in long form (--codec.locale string).

--codec.binary, --dsbulk.codec.binary string

Strategy to use when converting binary data to strings. Only applicable when unloading columns of CQL type blob, or unloading columns of a geometry type, if codec.geo is WKB. For the latter, see the codec.geo section.

Valid codec.binary values are:

  • BASE64: Encode the binary data into a Base-64 string. This is the default strategy.

  • HEX: Encode the binary data as CQL blob literals. CQL blob literals follow the general syntax: 0[xX][0-9a-fA-F]+, that is, 0x followed by hexadecimal characters, for example: 0xcafebabe. This format produces lengthier strings than BASE64, but is also the only format compatible with CQLSH.

Default: BASE64

--codec.booleanNumbers, --dsbulk.codec.booleanNumbers [ true_value, false_value ]

Set how true and false representations of numbers are interpreted. The representation is of the form true_value,false_value. The mapping is reciprocal, so that numbers are mapping to Boolean and vice versa. All numbers unspecified in this setting are rejected.

Default: [1, 0]

--codec.booleanStrings, --dsbulk.codec.booleanStrings [ true_value:false_value, …​]

Specify how true and false representations can be used by DataStax Bulk Loader. Each representation is of the form true_value:false_value, case-insensitive. For loading, all representations are honored. For unloading, the first representation is used and all others are ignored.

Ensure that your list of representations is inside quotes as a string. For example:

dsbulk unload -k keyspace1 -t javatime --codec.booleanStrings '["TRUE:FALSE"]'

Default: ["1:0","Y:N","T:F","YES:NO","TRUE:FALSE"]

--codec.date, --dsbulk.codec.date { formatter | string }

The temporal pattern to use for String to CQL date conversion. Valid choices:

  • A date-time pattern

  • A pre-defined formatter such as ISO_LOCAL_DATE

For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.

Default: ISO_LOCAL_DATE

--codec.epoch, --dsbulk.codec.epoch

If codec.timestamp, codec.date, or codec.time is set to UNITS_SINCE_EPOCH, the epoch specified here determines the relative point in time to use when converting numeric data to and from temporals for the following cases:

  • Target column is of CQL timestamp, date, or time type

  • Loading data with a USING TIMESTAMP, USING DATE, or USING TIME clause

  • Unloading data with a WRITETIME() function call For example, if the input is 123 and the epoch is 2000-01-01T00:00:00Z, the input is interpreted as N codec.units since January 1st 2000.

When loading and the target CQL type is numeric, but the input is alphanumeric and represents a temporal literal, the codec.epoch and codec.unit values are used to convert the parsed temporal into a numeric value. For example, if the input is 2020-02-03T19:32:45Z and the epoch specified is 2000-01-01T00:00:00Z, the parsed timestamp is converted to N codec.units since January 1st 2000.

When parsing temporal literals, if the input does not contain a date part, then the date part of the instant specified here is used. For example, if the input is 19:32:45 and the epoch specified is 2000-01-01T00:00:00Z, then the input is interpreted as 2000-01-01T19:32:45Z.

The value must be expressed in ISO_ZONED_DATE_TIME, as covered in the Oracle Java documentation.

Default: "1970-01-01T00:00:00Z"

--codec.formatNumbers, --dsbulk.codec.formatNumbers ( true | false )

Whether to use the codec.number pattern to format all numeric output. When set to true, the numeric pattern defined by codec.number is applied. This allows for nicely-formatted output, but may result in rounding (see codec.roundingStrategy, or alteration of the original decimal’s scale. When set to false, numbers are stringified using the toString method, and never result in rounding or scale alteration. Only applicable when unloading, and only if the connector in use requires stringification, because the connector, such as the CSV connector, does not handle raw numeric data; ignored otherwise.

Default: false

--codec.geo, --dsbulk.codec.geo string

Strategy to use when converting geometry types to strings. Geometry types are only available in DataStax Enterprise (DSE) 5.0 or higher. Only applicable when unloading columns of CQL type Point, LineString or Polygon, and only if the connector in use requires stringification. Valid values are:

  • WKT: Encode the data in Well-known Text format. This is the default strategy.

  • WKB: Encode the data in Well-known Binary format. The actual encoding depends on the value chosen for the codec.binary setting (HEX or BASE64).

  • JSON: Encode the data in GeoJson format. Default: "WKT"

-locale, --codec.locale, --dsbulk.codec.locale string

The locale to use for locale-sensitive conversions.

Default: en_US

-nullStrings, --codec.nullStrings, --dsbulk.codec.nullStrings list

Comma-separated list of strings that should be mapped to null. For loading, when a record field value exactly matches one of the specified strings, the value is replaced with null before writing to DSE. For unloading, this setting is only applicable for string-based connectors, such as the CSV connector: the first string specified is used to change a row cell containing null to the specified string when written out. By default, no strings are mapped to null.

Regardless of this setting, DataStax Bulk Loader always converts empty strings to null when the target CQL type is not textual; that is, when the target is not text, varchar, or ascii.

This setting is applied before schema.nullToUnset, hence any null produced by a null-string can still be left unset if required.

Default: [ ] (no strings mapped to null)

--codec.number, --dsbulk.codec.number string

The DecimalFormat pattern to use for conversion between String and CQL numeric types. See java.text.DecimalFormat for details about the pattern syntax to use. Most inputs are recognized: optional localized thousands separator, localized decimal separator, or optional exponent. Using -locale en_US, 1234, 1,234, 1234.5678, 1,234.5678 and 1,234.5678E2 are all valid. For unloading and formatting, rounding may occur and cause precision loss. See --codec.formatNumbers and --codec.roundingStrategy.

Default: #,###.##

--codec.overflowStrategy, --dsbulk.codec.overflowStrategy string

This setting can mean one of three possibilities:

  • The value is outside the range of the target CQL type. For example, trying to convert 128 to a CQL tinyint (max value of 127) results in overflow.

  • The value is decimal, but the target CQL type is integral. For example, trying to convert 123.45 to a CQL int results in overflow.

  • The value’s precision is too large for the target CQL type. For example, trying to insert 0.1234567890123456789 into a CQL double results in overflow, because there are too many significant digits to fit in a 64-bit double. Valid choices:

  • REJECT: overflows are considered errors and the data is rejected. This is the default value.

  • TRUNCATE: the data is truncated to fit in the target CQL type.

    The truncation algorithm is similar to the narrowing primitive conversion defined in The Java Language Specification, Section 5.1.3, with the following exceptions:

    1. If the value is too big or too small, it is rounded up or down to the maximum or minimum value allowed, rather than truncated at bit level. For example, 128 is rounded down to 127 to fit in a byte, whereas Java truncates the exceeding bits and converts them to -127 instead.

    2. If the value is decimal, but the target CQL type is integral, it is first rounded to an integral using the defined rounding strategy, then narrowed to fit into the target type. This can result in precision loss and should be used with caution.

Only applicable for loading, when parsing numeric inputs; it does not apply for unloading, since formatting never results in overflow.

Default: REJECT

--codec.roundingStrategy, --dsbulk.codec.roundingStrategy string

The rounding strategy to use for conversions from CQL numeric types to String.

Valid choices: any java.math.RoundingMode enum constant name, including: CEILING, FLOOR, UP, DOWN, HALF_UP, HALF_EVEN, HALF_DOWN, and UNNECESSARY.

The precision used when rounding is inferred from the numeric pattern declared under codec.number. For example, the default codec.number `#,###.##` has a rounding precision of 2, and the number 123.456 is rounded to 123.46 if the --codec.rounding Strategy was set to UP.

The default value results in infinite precision, and ignores the --codec.number setting. Only applicable when unloading, if --codec.formatNumbers is true and if the connector in use requires stringification, because the connector, such as the CSV connector, does not handle raw numeric data; ignored otherwise.

Default: UNECESSARY

--codec.time, --dsbulk.codec.time { formatter | string }

The temporal pattern to use for String to CQL time conversion. Valid choices:

  • A date-time pattern, such as HH:mm:ss.

  • A pre-defined formatter such as ISO_LOCAL_TIME

For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.

Default: ISO_LOCAL_TIME

--codec.timestamp, --dsbulk.codec.timestamp { formatter | string }

The temporal pattern to use for String to CQL timestamp conversion. Valid choices:

  • A date-time pattern

  • A pre-defined formatter such as ISO_ZONED_DATE_TIME or ISO_INSTANT, or any other public static field in java.time.format.DateTimeFormatter

  • The special formatter CQL_TIMESTAMP, which is a special parser that accepts all valid CQL literal formats for the timestamp type.

  • The special formatter UNITS_SINCE_EPOCH is required for datasets containing numeric data that are intended to be interpreted as units since a given epoch. Once set, DataStax Bulk Loader uses the --codec.unit and --codec.epoch settings to determine which unit and epoch to use.

For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.

When parsing, CQL_TIMESTAMP_FORMAT recognizes most CQL temporal literals:

Type Values

Local dates

2020-01-01

Local times

12:34

12:34:56

12:34:56.123

12:34:56.123456

12:34:56.123456789

Local date-times

2020-01-01T12:34

2020-01-01T12:34:56

2020-01-01T12:34:56.123

2020-01-01T12:34:56.123456

Zoned date-times

2020-01-01T12:34+01:00

2020-01-01T12:34:56+01:00

2020-01-01T12:34:56.123+01:00

2020-01-01T12:34:56.123456+01:00

2020-01-01T12:34:56.123456789+01:00

2020-01-01T12:34:56.123456789+01:00[Europe/Paris]

When the input is a local date, the timestamp is resolved at midnight using the specified timeZone. When the input is a local time, the timestamp is resolved using the time zone specified under timeZone, and the date is inferred from the instant specified under epoch (by default, January 1st 1970). When formatting, this format uses the ISO_OFFSET_DATE_TIME pattern, which is compliant with both CQL and ISO-8601.

Default: CQL_TIMESTAMP

-timeZone, --codec.timeZone, --dsbulk.codec.timeZone string

The time zone to use for temporal conversions. When loading, the time zone is used to obtain a timestamp from inputs that do not convey any explicit time zone information. When unloading, the time zone is used to format all timestamps.

This option supports all ZoneId (Java Platform SE 8) formats.

Default: UTC

--codec.unit, --dsbulk.codec.unit

If codec.timestamp, codec.date, or codec.time is set to UNITS_SINCE_EPOCH, the time unit specified here is used to convert numeric data to and from temporals for the following cases:

  • Target column is of CQL timestamp, date, or time type

  • Loading data with a USING TIMESTAMP, USING DATE, or USING TIME clause

  • Unloading data with a WRITETIME() function call For example, if the input is 123 and the time unit is SECONDS, the input is interpreted as 123 seconds since a given codec.epoch.

When loading and the target CQL type is numeric, but the input is alphanumeric and represents a temporal literal, the time unit specified is used to convert the parsed temporal into a numeric value. For example, if the input is 2019-02-03T19:32:45Z and the time unit specified is SECONDS, the parsed temporal is converted into the number of seconds since a given --codec.epoch.

All TimeUnit enum constants are valid choices.

Default: MILLISECONDS

--codec.uuidStrategy, --dsbulk.codec.uuidStrategy { RANDOM | FIXED | MIN | MAX }

Strategy to use when generating time-based (version 1) UUIDs from timestamps. Clock sequence and node ID parts of generated UUIDs are determined on a best-effort basis and are not fully compliant with RFC 4122. Valid values are:

  • RANDOM: Generates UUIDs using a random number in lieu of the local clock sequence and node ID. This strategy ensures that the generated UUIDs are unique, even if the original timestamps are not guaranteed to be unique.

  • FIXED: Preferred strategy if original timestamps are guaranteed unique, since it is faster. Generates UUIDs using a fixed local clock sequence and node ID.

  • MIN: Generates the smallest possible type 1 UUID for a given timestamp.

This strategy does not guarantee uniquely generated UUIDs and should be used with caution.

  • MAX: Generates the biggest possible type 1 UUID for a given timestamp.

This strategy does not guarantee uniquely generated UUIDs and should be used with caution.

Default: RANDOM

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com