Codec options
Specify codec options for the dsbulk command, which determine how record fields are parsed for loading or how row cells are formatted for unloading. When counting, these settings are ignored.
The options can be used in short form (-locale
string) or in long form (--codec.locale
string).
-
--codec.binary, --dsbulk.codec.binary string
Strategy to use when converting binary data to strings. Only applicable when unloading columns of CQL type blob, or unloading columns of a geometry type, if
codec.geo
isWKB
. For the latter, see thecodec.geo
section.Valid
codec.binary
values are:-
BASE64
: Encode the binary data into a Base-64 string. This is the default strategy. -
HEX
: Encode the binary data as CQL blob literals. CQL blob literals follow the general syntax:0[xX][0-9a-fA-F]+
, that is,0x
followed by hexadecimal characters, for example:0xcafebabe
. This format produces lengthier strings thanBASE64
, but is also the only format compatible withCQLSH
.Default:
BASE64
-
-
--codec.booleanNumbers, --dsbulk.codec.booleanNumbers [ true_value, false_value ]
Set how true and false representations of numbers are interpreted. The representation is of the form
true_value,false_value
. The mapping is reciprocal, so that numbers are mapping to Boolean and vice versa. All numbers unspecified in this setting are rejected.Default:
[1, 0]
-
--codec.booleanStrings, --dsbulk.codec.booleanStrings [ true_value:false_value, …]
Specify how true and false representations can be used by DataStax Bulk Loader. Each representation is of the form
true_value:false_value
, case-insensitive. For loading, all representations are honored. For unloading, the first representation is used and all others are ignored.Default:
["1:0","Y:N","T:F","YES:NO","TRUE:FALSE"]
-
--codec.date, --dsbulk.codec.date { formatter | string }
The temporal pattern to use for
String
to CQL date conversion. Valid choices:-
A date-time pattern
-
A pre-defined formatter such as
ISO_LOCAL_DATE
For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.
Default:
ISO_LOCAL_DATE
-
-
--codec.epoch, --dsbulk.codec.epoch
If
codec.timestamp
,codec.date
, orcodec.time
is set toUNITS_SINCE_EPOCH
, the epoch specified here determines the relative point in time to use when converting numeric data to and from temporals for the following cases:-
Target column is of CQL
timestamp
,date
, ortime
type -
Loading data with a
USING TIMESTAMP
,USING DATE
, orUSING TIME
clause -
Unloading data with a
WRITETIME()
function call For example, if the input is123
and the epoch is2000-01-01T00:00:00Z
, the input is interpreted as Ncodec.unit
s since January 1st 2000.When loading and the target CQL type is numeric, but the input is alphanumeric and represents a temporal literal, the
codec.epoch
andcodec.unit
values are used to convert the parsed temporal into a numeric value. For example, if the input is2020-02-03T19:32:45Z
and the epoch specified is2000-01-01T00:00:00Z
, the parsed timestamp is converted to Ncodec.unit
s since January 1st 2000.When parsing temporal literals, if the input does not contain a date part, then the date part of the instant specified here is used. For example, if the input is
19:32:45
and the epoch specified is2000-01-01T00:00:00Z
, then the input is interpreted as2000-01-01T19:32:45Z
.The value must be expressed in ISO_ZONED_DATE_TIME, as covered in the Oracle Java documentation.
Default: "
1970-01-01T00:00:00Z
"
-
-
--codec.formatNumbers, --dsbulk.codec.formatNumbers ( true | false )
Whether to use the
codec.number
pattern to format all numeric output. When set totrue
, the numeric pattern defined bycodec.number
is applied. This allows for nicely-formatted output, but may result in rounding (seecodec.roundingStrategy
, or alteration of the original decimal’s scale. When set tofalse
, numbers are stringified using thetoString
method, and never result in rounding or scale alteration. Only applicable when unloading, and only if the connector in use requires stringification, because the connector, such as the CSV connector, does not handle raw numeric data; ignored otherwise.Default:
false
-
--codec.geo, --dsbulk.codec.geo string
Strategy to use when converting geometry types to strings. Geometry types are only available in DataStax Enterprise (DSE) 5.0 or higher. Only applicable when unloading columns of CQL type
Point
,LineString
orPolygon
, and only if the connector in use requires stringification. Valid values are:-
WKT
: Encode the data in Well-known Text format. This is the default strategy. -
WKB
: Encode the data in Well-known Binary format. The actual encoding depends on the value chosen for thecodec.binary
setting (HEX
orBASE64
). -
JSON
: Encode the data inGeoJson
format. Default:"WKT"
-
-
-locale, --codec.locale, --dsbulk.codec.locale string
The locale to use for locale-sensitive conversions.
Default:
en_US
-
-nullStrings, --codec.nullStrings, --dsbulk.codec.nullStrings list
Comma-separated list of strings that should be mapped to
null
. For loading, when a record field value exactly matches one of the specified strings, the value is replaced withnull
before writing to DSE. For unloading, this setting is only applicable for string-based connectors, such as the CSV connector: the first string specified is used to change a row cell containingnull
to the specified string when written out. By default, no strings are mapped tonull
.Regardless of this setting, DataStax Bulk Loader for Apache Cassandra always converts empty strings to
null
when the target CQL type is not textual; that is, when the target is nottext
,varchar
, orascii
.This setting is applied before
schema.nullToUnset
, hence anynull
produced by a null-string can still be left unset if required.Default:
[ ]
(no strings mapped tonull
)
-
--codec.number, --dsbulk.codec.number string
The
DecimalFormat
pattern to use for conversion betweenString
and CQL numeric types. See java.text.DecimalFormat for details about the pattern syntax to use. Most inputs are recognized: optional localized thousands separator, localized decimal separator, or optional exponent. Using -localeen_US
, 1234, 1,234, 1234.5678, 1,234.5678 and 1,234.5678E2 are all valid. For unloading and formatting, rounding may occur and cause precision loss. See --codec.formatNumbers and --codec.roundingStrategy.Default: #,###.##
-
--codec.overflowStrategy, --dsbulk.codec.overflowStrategy string
This setting can mean one of three possibilities:
-
The value is outside the range of the target CQL type. For example, trying to convert 128 to a CQL tinyint (max value of 127) results in overflow.
-
The value is decimal, but the target CQL type is integral. For example, trying to convert 123.45 to a CQL int results in overflow.
-
The value’s precision is too large for the target CQL type. For example, trying to insert 0.1234567890123456789 into a CQL double results in overflow, because there are too many significant digits to fit in a 64-bit double. Valid choices:
-
REJECT
: overflows are considered errors and the data is rejected. This is the default value. -
TRUNCATE
: the data is truncated to fit in the target CQL type.The truncation algorithm is similar to the narrowing primitive conversion defined in The Java Language Specification, Section 5.1.3, with the following exceptions:
-
If the value is too big or too small, it is rounded up or down to the maximum or minimum value allowed, rather than truncated at bit level. For example, 128 is rounded down to 127 to fit in a byte, whereas Java truncates the exceeding bits and converts them to -127 instead.
-
If the value is decimal, but the target CQL type is integral, it is first rounded to an integral using the defined rounding strategy, then narrowed to fit into the target type. This can result in precision loss and should be used with caution.
Only applicable for loading, when parsing numeric inputs; it does not apply for unloading, since formatting never results in overflow.
Default:
REJECT
-
-
-
--codec.roundingStrategy, --dsbulk.codec.roundingStrategy string
The rounding strategy to use for conversions from CQL numeric types to String.
Valid choices: any
java.math.RoundingMode
enum constant name, including:CEILING
,FLOOR
,UP
,DOWN
,HALF_UP
,HALF_EVEN
,HALF_DOWN
, andUNNECESSARY
.The precision used when rounding is inferred from the numeric pattern declared under
codec.number
. For example, the defaultcodec.number
`#,###.##` has a rounding precision of 2, and the number 123.456 is rounded to 123.46 if the--codec.rounding Strategy
was set toUP
.The default value results in infinite precision, and ignores the
--codec.number
setting. Only applicable when unloading, if--codec.formatNumbers
istrue
and if the connector in use requires stringification, because the connector, such as the CSV connector, does not handle raw numeric data; ignored otherwise.Default:
UNECESSARY
-
--codec.time, --dsbulk.codec.time { formatter | string }
The temporal pattern to use for
String
to CQL time conversion. Valid choices:-
A date-time pattern, such as
HH:mm:ss
. -
A pre-defined formatter such as
ISO_LOCAL_TIME
For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.
Default:
ISO_LOCAL_TIME
-
-
--codec.timestamp, --dsbulk.codec.timestamp { formatter | string }
The temporal pattern to use for
String
to CQL timestamp conversion. Valid choices:-
A date-time pattern
-
A pre-defined formatter such as
ISO_ZONED_DATE_TIME
orISO_INSTANT
, or any other public static field in java.time.format.DateTimeFormatter -
The special formatter
CQL_TIMESTAMP
, which is a special parser that accepts all valid CQL literal formats for thetimestamp
type. -
The special formatter
UNITS_SINCE_EPOCH
is required for datasets containing numeric data that are intended to be interpreted as units since a given epoch. Once set, DataStax Bulk Loader uses the --codec.unit and --codec.epoch settings to determine which unit and epoch to use.For more information on patterns and pre-defined formatters, see Patterns for Formatting and Parsing in Oracle Java documentation. For more information about CQL date, time and timestamp literals, see Date, time, and timestamp format.
When parsing,
CQL_TIMESTAMP_FORMAT
recognizes most CQL temporal literals:Type Values Local dates
2020-01-01
Local times
12:34
12:34:56
12:34:56.123
12:34:56.123456
12:34:56.123456789
Local date-times
2020-01-01T12:34
2020-01-01T12:34:56
2020-01-01T12:34:56.123
2020-01-01T12:34:56.123456
Zoned date-times
2020-01-01T12:34+01:00
2020-01-01T12:34:56+01:00
2020-01-01T12:34:56.123+01:00
2020-01-01T12:34:56.123456+01:00
2020-01-01T12:34:56.123456789+01:00
2020-01-01T12:34:56.123456789+01:00[Europe/Paris]
When the input is a local date, the timestamp is resolved at midnight using the specified
timeZone
. When the input is a local time, the timestamp is resolved using the time zone specified undertimeZone
, and the date is inferred from the instant specified underepoch
(by default, January 1st 1970). When formatting, this format uses theISO_OFFSET_DATE_TIME
pattern, which is compliant with both CQL and ISO-8601.Default:
CQL_TIMESTAMP
-
-
-timeZone, --codec.timeZone, --dsbulk.codec.timeZone string
The time zone to use for temporal conversions. When loading, the time zone is used to obtain a timestamp from inputs that do not convey any explicit time zone information. When unloading, the time zone is used to format all timestamps.
Default:
UTC
-
--codec.unit, --dsbulk.codec.unit
If
codec.timestamp
,codec.date
, orcodec.time
is set toUNITS_SINCE_EPOCH
, the time unit specified here is used to convert numeric data to and from temporals for the following cases:-
Target column is of CQL
timestamp
,date
, ortime
type -
Loading data with a
USING TIMESTAMP
,USING DATE
, orUSING TIME
clause -
Unloading data with a
WRITETIME()
function call For example, if the input is123
and the time unit isSECONDS
, the input is interpreted as 123 seconds since a givencodec.epoch
.When loading and the target CQL type is numeric, but the input is alphanumeric and represents a temporal literal, the time unit specified is used to convert the parsed temporal into a numeric value. For example, if the input is
2019-02-03T19:32:45Z
and the time unit specified isSECONDS
, the parsed temporal is converted into the number of seconds since a given--codec.epoch
.All
TimeUnit
enum constants are valid choices.Default:
MILLISECONDS
-
-
--codec.uuidStrategy, --dsbulk.codec.uuidStrategy { RANDOM | FIXED | MIN | MAX }
Strategy to use when generating time-based (version 1) UUIDs from timestamps. Clock sequence and node ID parts of generated UUIDs are determined on a best-effort basis and are not fully compliant with RFC 4122. Valid values are:
-
RANDOM
: Generates UUIDs using a random number in lieu of the local clock sequence and node ID. This strategy ensures that the generated UUIDs are unique, even if the original timestamps are not guaranteed to be unique. -
FIXED
: Preferred strategy if original timestamps are guaranteed unique, since it is faster. Generates UUIDs using a fixed local clock sequence and node ID. -
MIN
: Generates the smallest possible type 1 UUID for a given timestamp.
This strategy does not guarantee uniquely generated UUIDs and should be used with caution.
-
MAX
: Generates the biggest possible type 1 UUID for a given timestamp.
This strategy does not guarantee uniquely generated UUIDs and should be used with caution.
Default:
RANDOM
-