Codec options
The codec options determine how DSBulk converts the values of individual fields or cells during load and unload operations.
For example, you can use these options to specify how the values in a CQL timestamp column are formatted when unloaded to a CSV or JSON file.
The same options also specify how values are loaded from a CSV or JSON file into a CQL timestamp column in your database.
|
In addition to the Particularly for time/date, null, empty, and numeric values, make sure you are aware of all relevant options. Most options have a default behavior that might affect your data. To test a |
The codec options don’t apply to count operations.
Synopsis
The standard form for codec options is --codec.KEY VALUE:
-
KEY: The specific option to configure, such as thelocaleoption. -
VALUE: The value for the option, such as a string, number, or Boolean.HOCON syntax rules apply unless otherwise noted. For more information, see Escape and quote DSBulk command line arguments.
Short and long forms
On the command line, you can specify options in short form (if available), standard form, or long form.
For all codec options, the long form is the standard form with a dsbulk. prefix, such as --dsbulk.codec.locale.
The following examples show the same command with different forms of the locale option:
# Short form
dsbulk load -locale en_US -url filename.csv -k ks1 -t table1
# Standard form
dsbulk load --codec.locale en_US -url filename.csv -k ks1 -t table1
# Long form
dsbulk load --dsbulk.codec.locale en_US -url filename.csv -k ks1 -t table1
In configuration files, you must use the long form with the dsbulk. prefix.
For example:
dsbulk.codec.locale = "en_US"
--codec.binary
For dsbulk unload only, set the strategy to use when converting binary data to strings:
-
BASE64(default): Encode the binary data into a Base-64 string. -
HEX: Encode the binary data as CQL blob literals.CQL blob literals are formatted as
0[xX][0-9a-fA-F]+, which is0xfollowed by hexadecimal characters. For example,0xcafebabe.HEXformat produces longer strings thanBASE64but it is the only format compatible withCQL shell.
This option is used only in the following scenarios:
-
Unloading columns of CQL type
blob -
Unloading columns of a geometry type with
--codec.geo "WKB".
--codec.booleanNumbers
A list of numbers that map to Boolean true and false values, formatted as [true_number,false_number], such as [1,0].
The mapping is reciprocal, so that numbers will map to Booleans, and Booleans will map to numbers.
If a Boolean field or cell contains a number that isn’t specified in this list, the value is rejected.
Default: [1,0] (1 maps to true, and 0 maps to false)
--codec.booleanStrings
A list of strings that map to Boolean true and false values.
Unlike --codec.booleanNumbers, this option allows multiple representations for true and false values.
Each representation is a pair of values separated by a colon, where the first value maps to true and the second value maps to false.
All values are case-insensitive.
Separate each pair with a comma, wrap each pair in double-quotes, and wrap the entire list in single quotes.
For example:
--codec.booleanStrings '["true_value1:false_value1","true_value2:false_value2"]'
For load operations, all representations are honored.
For unload operations, only the first representation is used, and all others are ignored.
Default: ["1:0","Y:N","T:F","YES:NO","TRUE:FALSE"]
--codec.date
The temporal pattern to use for String to CQL date conversion.
Accepts a string specifying one of the following:
-
Date-time pattern: Must be compatible with
DateTimeFormatter, such asyyyy-MM-dd. -
Pre-defined formatter: Must be compatible with
DateTimeFormatter, such asISO_LOCAL_DATE. -
UNITS_SINCE_EPOCH: A special parser that reads and writes local dates as numbers representing time units since a given epoch. Specify the unit and the epoch to use with--codec.unitand--codec.epoch.This pattern is required for datasets that contain numeric data intended to be interpreted as units since a given epoch.
Default: ISO_LOCAL_DATE
--codec.epoch
If --codec.timestamp, --codec.date, or --codec.time is set to UNITS_SINCE_EPOCH, then this epoch determines the relative point in time to use when converting numeric data to and from temporals.
This option is paired with --codec.unit, which determines the time unit for the conversion.
For example, if --codec.epoch "2000-01-01T00:00:00Z" and --codec.unit "SECONDS", the value 123 is interpreted as 123 seconds since January 1, 2000.
--codec.epoch applies to the following scenarios only:
-
When a
loadorunloadoperation targets a CQLtimestampcolumn and--codec.timestampis set toUNITS_SINCE_EPOCH. -
When a
loadorunloadoperation targets a CQLdatecolumn and--codec.dateis set toUNITS_SINCE_EPOCH. -
When a
loadorunloadoperation targets a CQLtimecolumn and--codec.timeis set toUNITS_SINCE_EPOCH. -
When a
loadoperation uses aUSING TIMESTAMPclause and--codec.timestampis set toUNITS_SINCE_EPOCH. -
When a
loadoperation uses aUSING DATEclause and--codec.dateis set toUNITS_SINCE_EPOCH. -
When a
loadoperation uses aUSING TIMEclause and--codec.timeis set toUNITS_SINCE_EPOCH. -
When an
unloadoperation uses awritetime()function call and--codec.timestampis set toUNITS_SINCE_EPOCH.
If the target column for a load operation is a numeric CQL type but the input is alphanumeric data that represents a temporal literal, then --codec.epoch and --codec.unit are used to convert the parsed temporal into a numeric value.
For example, if the input is 2018-12-10T19:32:45Z with --codec.epoch "2000-01-01T00:00:00Z" and --codec.unit "SECONDS", then the parsed timestamp will be converted to a number of seconds since January 1, 2000.
When parsing temporal literals, if the input doesn’t contain a date part, then the date part of --codec.epoch is used.
For example, if the input is 19:32:45 with --codec.epoch "2000-01-01T00:00:00Z", then the input is interpreted as 2000-01-01T19:32:45Z.
The value of --codec.epoch must be expressed in ISO_ZONED_DATE_TIME format.
Default: 1970-01-01T00:00:00Z
--codec.formatNumbers
Whether to use the --codec.number pattern to format all numeric output:
-
false(default): Numbers are stringified using thetoStringmethod without rounding or decimal scale alteration. -
true: Numbers are formatted using the specified--codec.numberpattern. This ensures consistent formatting for all numeric output but it can result in rounding (see--codec.roundingStrategy) or alteration of decimal scales.
Only applicable for unload operations where the connector requires stringification because the connectors don’t handle raw numeric data.
--codec.geo
A string containing the name of the strategy to use when converting geometry types to strings:
-
WKT(default): Encode the data in Well-Known Text (WKT) format. -
WKB: Encode the data in Well-Known Binary (WKB) format. The actual encoding depends on thecodec.binaryoption (HEXorBASE64). -
JSON: Encode the data in GeoJSON format.
This option is applicable only if all of the following conditions are met:
-
Unloading data from DSE 5.0 or later
-
Unloading CQL
Point,LineString, orPolygoncolumns. -
The connector requires stringification.
--codec.locale (-locale)
Set the locale to use for locale-sensitive conversions.
Default: en_US
--codec.nullStrings (-nullStrings)
A comma-separated list of strings to map to null, such as ["NULL","N/A","-"].
-
With
dsbulk load -
With
dsbulk unload
For load operations, when an input field’s value exactly matches one of the specified strings, the value is replaced with null before writing to the database.
The default value is an empty list ([]), which means no specific strings are mapped to null.
When loading to text, varchar, or ascii columns with the default mapping, input field values are unchanged by this option specifically.
For example, empty strings are written as "" (not null) in the absence of other modifying options, such as --connector.csv.emptyValue.
|
For unload operations, this option applies only if the connector requires stringification.
DSBulk uses only the first string in the list.
If a row cell contains null, it is converted to the first --codec.nullStrings entry in the output.
The default value is an empty list ([]), which means no specific strings are mapped to null.
With the default mapping, any unloaded cell containing a null value is unchanged, and the resulting output is null.
This assumes the absence of other modifying options, such as --connector.csv.emptyValue.
--codec.number
The DecimalFormat pattern to use for conversion between String and CQL numeric types.
Most inputs are recognized, including an optional localized thousands separator, localized decimal separator, or optional exponent.
For example, using the default --codec.number pattern and -locale en_US, then all of the following are valid: 1234, 1,234, 1234.5678, 1,234.5678, and 1,234.5678E2.
For unload operations, rounding can occur and cause precision loss, depending on the combined behaviors of --codec.number, --codec.formatNumbers, and --codec.roundingStrategy.
Default: #,###.##
--codec.overflowStrategy
This option applies when converting an input field value to a CQL numeric type results in overflow:
-
The value is outside the range of the target CQL type. For example, trying to convert
128to a CQLtinyintresults in overflow becausetinyinthas a maximum value of127. -
The value is decimal but the target CQL type is integral. For example, trying to convert
123.45to a CQLintresults in overflow. -
The value’s precision is too large for the target CQL type. For example, trying to insert
0.1234567890123456789into a CQLdoubleresults in overflow because there are too many significant digits to fit in a 64-bit double.
This option applies only when parsing numeric inputs during load operations.
It doesn’t apply to unload operations because output formatting never results in overflow.
Allowed values for --codec.overflowStrategy are as follows:
-
REJECT(default): Overflows are considered errors, and the data is rejected. -
TRUNCATE: The data is truncated to fit in the target CQL type.The truncation algorithm is similar to the narrowing primitive conversion defined in the Java Language Specification section 5.1.3 with the following exceptions:
-
If the value is too big, it is rounded down to the maximum value allowed, rather than truncated at bit level. For example, DSBulk rounds
128down to127to fit in a byte, whereas Java truncates the exceeding bits and converts them to-127instead. -
If the value is too small, it is rounded up to the minimum value allowed, rather than truncated at bit level.
-
If the value is decimal but the target CQL type is integral, it is first rounded to an integral using the defined rounding strategy, and then it is narrowed to fit into the target type.
Especially for decimal values,
TRUNCATEcan result in precision loss. Make sure your data can tolerate such loss before using this strategy.
-
--codec.roundingStrategy
Set the rounding strategy to use when converting CQL numeric types to String.
Only applies to dsbulk unload operations where --codec.formatNumbers true and the connector requires stringification because the connectors don’t handle raw numeric data.
Accepts the name of any java.math.RoundingMode enum constant, including CEILING, FLOOR, UP, DOWN, HALF_UP, HALF_EVEN, HALF_DOWN, and UNNECESSARY.
The precision used when rounding is inferred from the numeric pattern set in --codec.number.
For example, the default --codec.number pattern #,###.## has a rounding precision of two digits.
If --codec.roundingStrategy is set to UP, then the number 123.456 would be rounded to 123.46.
Default: UNNECESSARY (infinite precision, and --codec.number is ignored)
--codec.time
The temporal pattern to use for String to CQL time conversion.
Accepts a string specifying one of the following:
-
Date-time pattern: Must be compatible with
DateTimeFormatter, such asHH:mm:ss. -
Pre-defined formatter: Must be compatible with
DateTimeFormatter, such asISO_LOCAL_TIME. -
UNITS_SINCE_EPOCH: A special parser that reads and writes local times as numbers representing time units since a given epoch. Specify the unit and the epoch to use with--codec.unitand--codec.epoch.This pattern is required for datasets that contain numeric data intended to be interpreted as units since a given epoch.
Default: ISO_LOCAL_TIME
--codec.timestamp
The temporal pattern to use for String to CQL timestamp conversion.
Accepts a string specifying one of the following:
-
Date-time pattern: Must be compatible with
DateTimeFormatter, such asyyyy-MM-dd HH:mm:ss. -
Pre-defined formatter: Any public static field in
DateTimeFormatter, such asISO_ZONED_DATE_TIMEorISO_INSTANT. -
UNITS_SINCE_EPOCH: A special parser that reads and writes local times as numbers representing time units since a given epoch. Specify the unit and the epoch to use with--codec.unitand--codec.epoch.This pattern is required for datasets that contain numeric data intended to be interpreted as units since a given epoch.
-
CQL_TIMESTAMP: A special parser that accepts all valid CQL literal formats for thetimestamptype.For
loadoperations, if the input is a local date or date/time, then the timestamp is resolved using the time zone specified in--codec.timeZone. Forunloadoperations, the formatter uses theISO_OFFSET_DATE_TIMEpattern, which is compliant with CQL and ISO-8601.
Default: CQL_TIMESTAMP
--codec.timeZone (-timeZone)
The timezone to use for temporal conversions.
For load operations, -timeZone is used to obtain a timestamp from inputs that don’t convey any explicit timezone information.
For unload operations, -timeZone is used to format all timestamps.
This option accepts any ZoneId format.
Default: UTC
--codec.unit
If --codec.timestamp, --codec.date, or --codec.time is set to UNITS_SINCE_EPOCH, then this unit is used with --codec.epoch to convert numeric data to and from temporals.
--codec.unit defines the time unit for the conversion.
It is paired with --codec.epoch, which determines the starting epoch for the conversion.
For example, if --codec.epoch "2000-01-01T00:00:00Z" and --codec.unit "SECONDS", the value 123 is interpreted as 123 seconds since January 1, 2000.
--codec.unit and --codec.epoch are used in specific scenarios only.
For more information about these scenarios and edge case handling, see --codec.epoch.
--codec.unit accepts any TimeUnit enum constant.
Default: MILLISECONDS
--codec.uuidStrategy
Set the strategy to use when generating time-based version 1 UUIDs from timestamps.
|
-
RANDOM(default): Generates UUIDs using a random number in place of the local clock sequence and node ID. This strategy ensures that the generated UUIDs are unique, even if the original timestamps aren’t guaranteed to be unique. -
FIXED(recommended if original timestamps are unique): Generates UUIDs using a fixed local clock sequence and node ID.This strategy is recommended if your original timestamps are guaranteed to be unique because it’s faster than other strategies.
-
MIN: Generates the smallest possible type 1 UUID for a given timestamp. -
MAX: Generates the largest possible type 1 UUID for a given timestamp.