DataStax Bulk Loader release notes

Release notes for DataStax Bulk Loader.

Release notes for DataStax Bulk Loader.

DataStax Bulk Loader can load and unload data in CSV or JSON format in or out of:
  • DataStax Enterprise (DSE) 3.2 or later databases
  • DataStax Distribution of Apache Cassandra™ (DDAC) databases
To migrate data, CSV or JSON files from relational database exports and original data may be inserted into DSE or DDAC databases. The tool is supported for Linux and Windows platforms.

DataStax Bulk Loader 1.3.4 release notes

Release notes for DataStax Bulk Loader 1.3.4.

16 July 2019

DataStax Bulk Loader 1.3.4 release notes include:

1.3.4 Changes and enhancements

After upgrading to 1.3.4, be sure to review and adjust your scripts to use the changed settings.

  • The DataStax Bulk Loader Help provides an entry for --version. (DAT-383)
  • Improved error message provided when a row fails to decode. (DAT-411)

    In the DataStax Bulk Loader logging options, the format is: -maxErrors,--log.maxErrors ( number | "N%" )

    We have provided an updated explanation:

    The maximum number of errors to allow before aborting the entire operation. This setting may be expressed as:
    • An absolute number of errors; in which case, set this value to an integer greater than or equal to zero.
    • Or a percentage of the total rows processed so far; in which case, set this value to a string of the form "N%", where N is a decimal number between 0 and 100 exclusive. Example: -maxErrors "20%"
    Setting this value to any negative integer disables the feature, which is not recommended.
  • When a table contains static columns, it is possible that some partitions only contain static data. In this case, that data is exported as a pseudo row where all clustering columns and regular columns are null. Example:
    create table t1 (pk int, cs int static, cc int, v int, primary key (pk, cc));
    insert into t1 (pk, cs) values (1,1);
    select * from t1;
     pk | cc   | cs | v
      1 | null |  1 | null 
    In prior DataStax Bulk Loader releases, you could not import this type of static data, even though the query was valid. For example, the following query was rejected:
    INSERT INTO t1 (pk, cs) values (:pk, :cs);
    Operation LOAD_20190412-134352-912437 failed: Missing required primary key column conversation_id 
    from schema.mapping or schema.query.
    DataStax Bulk Loader now allows this valid query. (DAT-414)
  • You can use the CQL date and time types with UNITS_SINCE_EPOCH, in addition to timestamp. Previously, you could only use the CQL timestamp type. On the dsbulk command, you can use codec.unit and codec.epoch to convert integers to, or from, these types. Refer to --codec.unit and --codec.epoch. (DAT-428)
  • You can use a new monitoring setting, monitoring.trackBytes, to enable or disable monitoring of DataStax Bulk Loader operations in bytes per second. Because this type of monitoring can consume excessive allocation resources, and in some cases excessive CPU cycles, the setting is disabled by default. If you want monitoring in bytes per second, you must enable it with monitoring.trackBytes. Test and compare the setting in your development environment. Disabling this setting may improve the allocation rate. If leaving the setting disabled improves throughput, consider disabling it (or keeping it disabled) in production. Or enable monitoring of bytes per second on an as-needed basis. (DAT-432)
  • The default output file name format, defined by the --connector.(csv|json).fileNameFormat string option, no longer includes the thousands separator. The prior default output file name format was:
    • output-%0,6d.csv
    • output-%0,6d.json
    The updated default format is:
    • output-%06d.csv
    • output-%06d.json

    Refer to --connector.(csv|json).fileNameFormat string (DAT-443)

  • In load operations, you can pass the URL of a CSV or JSON data file. In cases where you have multiple URLs, DataStax Bulk Loader 1.3.4 makes this task easier by providing the following command-line options. You can point to the file that contains all the URLs for the data files:
    • --connector.csv.urlfile string
    • --connector.json.urlfile string
    Refer to --connector.(csv|json).urlfile string. (DAT-445)
  • When using REPLICA_SET batch mode, the server may issue query warnings if the number of statements in a single batch exceeds unlogged_batch_across_partitions_warn_threshold. To avoid reporting excessive warning messages in stdout, DataStax Bulk Loader logs only one warning at the beginning of the operation. (DAT-451)

1.3.4 Resolved issues

  • DataStax Bulk Loader should reject CSV files containing invalid headers, such as headers that are empty or contain duplicate fields. (DAT-427)
  • Logging option -maxErrors 0 does not abort the operation. (DAT-333)
  • DataStax Bulk Loader should reject invalid execution IDs. An execution ID is used to create MBean names. DataStax Bulk Loader now validates user-provided IDs to ensure, for example, that an ID does not contain a comma. (DAT-441)

DataStax Bulk Loader 1.3.3 release notes

Resolved issue for DataStax Bulk Loader 1.3.3.

13 March 2019

DataStax Bulk Loader 1.3.3 release notes summarize the resolved issue.

1.3.3 Resolved issue

Export of varchar column containing JSON may truncate data. (DAT-400)

Columns of type varchar that contain JSON are now exported “as is,” meaning DataStax Bulk Loader does not attempt to parse the JSON payload.

For example, assume you had a column col1 whose value was:
This was previously exported as shown below. That is, the contents of the column were parsed into a JSON node:
col1 = {"foo":42}
In DataStax Bulk Loader 1.3.3, the JSON {"foo":42} in a varchar column is exported as a string:
col1 = "{\"foo\":42}"

DataStax Bulk Loader 1.3.2 release notes

Changes, enhancements, and resolved issues for DataStax Bulk Loader 1.3.2.

20 February 2019

DataStax Bulk Loader 1.3.2 release notes include:

1.3.2 Changes and enhancements

After upgrading to 1.3.2, be sure to review and adjust scripts to use changed settings.

  • Print basic information about the cluster. (DAT-340)

    Refer to Printing cluster information.

  • Unload timestamps as units since an epoch. (DAT-364)
    Note: Datasets containing numeric data that are intended to be interpreted as units since a given epoch require the setting codec.timestamp=UNITS_SINCE_EPOCH. Failing to specify this special format will result in all records being rejected due to an invalid timestamp format. Refer to Codec options.
  • Provide better documentation on how to choose the best batching strategy. (DAT-353)

    Refer to Batch options for the dsbulk command.

  • Implement unload and count for materialized views. (DAT-385)
  • Calculate batch size dynamically - Adaptive Batch Sizing. (DAT-352)
    The new setting, batch.maxSizeInBytes, defaults to -1 (unlimited).
    Note: batch.maxBatchSize is deprecated; instead, use batch.maxBatchStatements.
  • batch.bufferSize should be a multiple of batch.maxBatchStatements. (DAT-389)

    By default batch.bufferSize is set to 4 times batch.maxBatchStatements if its value is less than or equal to 0.

  • Improve support for lightweight transactions. (DAT-384)
    DataStax Bulk Loader can detect write failures due to a failed LWT write. Records that could not be inserted will appear in two new files:
    1. paxos.bad is a new "bad file" devoted to LWT write failures.
    2. paxos-erros.log is a new debug file devoted to LWT write failures.
    Note: DataStax Bulk Loader also writes any records from failed writes to a .bad file in the operation's directory, depending on when the failure occurred. For details, refer to Detection of write failures.
  • Extend DataStax Bulk Loader rate limiting capability to reads. (DAT-336)

    Previously, the rate limiter used by DataStax Bulk Loader, and adjustable via the --executor.maxPerSecond setting, only applied to writes. DataStax Bulk Loader extends the functionality to reads by making it consider the number of received rows instead of the number of requests sent.

  • Expose settings to control how to interpret empty fields in CSV files. (DAT-344)
    There are two new settings for the CSV connector:
    1. nullValue
    2. emptyValue
    Previously when reading a CSV file, the connector would emit an empty string when a field was empty and non-quoted. By default, starting with DataStax Bulk Loader 1.3.2, the CSV connector will return a null value in such situations, which may not make a difference in most cases. The only noticeable difference will be for columns of type VARCHAR or ASCII; the resulting stored value will be null instead of an empty string.
  • Allow functions to appear in mapping variables. (DAT-327)
    Previously for loads only, a mapping entry could contain a function on the right side of the assignment. This functionality has been extended to unloads. For example, loads may continue to use:
    now() = column1
    On load, the result of the now() function is inserted into column1 for every row.
    For unloads, you can export the result of now() as fieldA for every row read. For example:
    fieldA = now()
  • Detect writetime variable when unloading. (DAT-367)
    You can specify a writetime function in a mapping definition when unloading. For example:
    fieldA = column1, fieldB = writetime(column1)
    In this example, because the data type is detected, fieldB will be exported as a timestamp, not as an integer.
  • Relax constraints on queries for the Count workflow. (DAT-308)

    The schema.query setting can contain any SELECT clause when counting rows.

  • Automatically add token range restriction to WHERE clauses. (DAT-319)
    When a custom query is provided with --schema.query, to enable read parallelization, it is no longer necessary to provide a WHERE clause using the form:
    WHERE token(pk) > :start AND token(pk) <= :end
    If the query does not contain a WHERE clause, DataStax Bulk Loader will automatically generate that WHERE clause. However, if the query contains a WHERE clause, DataStax Bulk Loader will not be able to parallelize the read operations.
  • Should allow JSON array mapping with UDTs. (DAT-316)

    Previously, when loading User Defined Types (UDTs) it was required that the input be a JSON object to allow for field-by-field mapping. Starting with DataStax Bulk Loader 1.3.2, a JSON array can also be mapped to UDTs, in which case the mapping is based on field order.

  • Improve WHERE clause token range restriction detection. (DAT-372)
    When you provide a custom query for unloading, the token range restriction variables can have any name, not only start and end. For example, the following is valid:
    SELECT * FROM table1 WHERE token(pk) > :foo AND token(pk) <= :bar
  • Remove record location URI. (DAT-370)

    DataStax Bulk Loader previously provided a record's URI to uniquely identify the record. However, the URI was very long and difficult to read. You can instead identify a failed record by looking into the record's source statement or row.

  • Allow columns and fields to be mapped more than once. (DAT-373)
    It is possible to map a field/column more than once. The following rules apply:
    • When loading, a field can be mapped to 2 or more columns, but a column cannot be mapped to 2 or more fields. Thus the following mapping is correct: fieldA = column1, fieldA = column2.
    • When unloading, a column can be mapped to 2 or more fields, but a field cannot be mapped to 2 or more columns. Thus the following mapping is correct: fieldA = column1, fieldB = column1.
  • UDT and tuple codecs should respect allowExtraFields and allowMissingFields. (DAT-315)

    The settings schema.allowMissingFields and schema.allowExtraFields apply to UDTs and tuples. For example, if a tuple has three elements, but the JSON input only has two elements, this scenario results in an error if schema.allowMissingFields is false. However, this scenario is accepted if schema.allowMissingFields is true. The missing element in this example is assigned as null.

  • Add support for DataStax Enterprise 4.8 and lower. (DAT-312)

    DataStax Bulk Loader is compatible with C* 1.2 and later releases, and DataStax Enterprise 3.2 and later releases. All protocol versions are supported. Some features might not be available depending on the protocol version and server version.

    The schema.splits (default: 8C) setting was added to compensate for the absence of paging in C* 1.2. The token ring is split into small chunks and is controlled by this setting.

    For example:
    bin/dsbulk unload -url myData.csv --driver.pooling.local.connections 8 \
      --driver.pooling.local.requests 128 --driver.pooling.remote.requests 128 \
      --schema.splits 0.5C -k test -t test
    On --schema.splits, you can optionally use special syntax, nC, to specify a number that is a multiple of the available cores, resulting in a calculated number of splits. If the number of cores is 8, --schema.splits 0.5C = 0.5 * 8, which results in 4 splits. Refer to --schema.splits number.
  • Add support for keyspace-qualified UDFs in mappings. (DAT-378)

    If needed, you can qualify a user-defined function (UDF) with a keyspace name. For example: fieldA = ks1.func1(column1, column2)

  • Allow fields to appear as function parameters on the left side of mapping entries. (DAT-379)
    When loading, a mapping entry can contain a function on the left side that references fields of the dataset. For example, consider the case where:
    • A dataset has two fields, fieldA and fieldB
    • A table with three columns: colA, colB and colSum
    • A user-define function: sum(int, int)
    The following mapping works:
    fieldA = colA, fieldB = colB, sum(fieldA,fieldB)=colSum
    This will store the sum of fieldA and fieldB into colSum.
  • Improve handling of search queries. (DAT-309)
    You can supply a DataStax Enterprise search predicate using the solr_query mechanism. For example, assume you create a search index on the dsbulkblog.iris_with_id table:
    cqlsh -e "CREATE SEARCH INDEX IF NOT EXISTS ON dsbulkblog.iris_with_id"
    You can issue a query for just the Iris-setosa rows:
    dsbulk unload -query "SELECT id, petal_length, petal_width, \
     sepal_length, sepal_width, species FROM dsbulkblog.iris_with_id \
      WHERE solr_query = '{\\\"q\\\": \\\"species:Iris-setosa\\\"}'"
  • Ability to hard-limit the number of concurrent continuous paging sessions. (DAT-380)

    DataStax Bulk Loader adds a new setting: executor.continuousPaging.maxConcurrentQueries (Default: 60). It sets the maximum number of concurrent continuous paging queries that should be carried in parallel. Set this number to a value equal to or less than the value configured server-side for continuous_paging.max_concurrent_sessions in the cassandra.yaml configuration file, which is also 60 by default. Otherwise some requests may be rejected. You can disable executor.continuousPaging.maxConcurrentQueries by assigning any negative value or 0.

  • Ability to skip unloading or loading the solr_query column. (DAT-365)

    DataStax Bulk Loader will skip the solr_query column when loading and unloading.

1.3.2 Resolved issues

  • Setting executor.maxInFlight to a negative value triggers fatal error. (DAT-392)
  • Murmur3TokenRangeSplitter should allow long overflows when splitting ranges. (DAT-334)
  • CSV connector trims trailing white space when reading data. (DAT-339)
  • Avoid overflows in CodecUtils.numberToInstant. (DAT-368)
  • Call to ArrayBackedRow.toString() causes fatal NPE. (DAT-369)

DataStax Bulk Loader 1.2.0 release notes

Changes, enhancements, and resolved issues for DataStax Bulk Loader 1.2.0.

1 August 2018

DataStax Bulk Loader 1.2.0 release notes include:

1.2.0 Changes and enhancements

After upgrade to 1.2.0, be sure to review and adjust scripts to use changed settings.

  • Improve range split algorithm in multi-DC and vnodes environments. (DAT-252)
  • Support simplified notation for JSON arrays and objects in collection fields. (DAT-317)

1.2.0 Resolved issues:

  • CSVWriter trims leading/trailing whitespace in values. (DAT-302)
  • CSV connector fails when the number of columns in a record is greater than 512. (DAT-311)
  • Bulk Loader fails when mapping contains a primary key column mapped to a function. (DAT-326)

DataStax Bulk Loader 1.1.0 release notes

Changes, enhancements, and resolved issues for DataStax Bulk Loader 1.1.0.

18 June 2018

DataStax Bulk Loader 1.1.0 release notes include:

1.1.0 Changes and enhancements

After upgrade to 1.1.0, be sure to review and adjust scripts to use changed settings.

  • Combine batch.mode and batch.enabled into a single setting: batch.mode. If you are using the batch.enabled setting in scripts, change to batch.mode with value DISABLED. (DAT-287)
  • Improve handling of Univocity exceptions. (DAT-286)
  • Logging improvements. (DAT-290)
    • Log messages are logged only to operation.log. Logging does not print to stdout.
    • Configurable logging levels with the log.verbosity setting.
    • The setting log.ansiEnabled is changed to log.ansiMode.
  • New count workflow. (DAT-291, DAT-299)
    • Supports counting rows in a table.
    • Configurable counting mode.
    • When mode = partitions, configurable number of partitions to count. Support to count the number of rows for the N biggest partitions in a table.
  • Counter tables are supported for load and unload. (DAT-292)
  • Improve validation to include user-supplied queries and mappings. (DAT-294)
  • The codec.timestamp CQL_DATE_TIME setting is renamed to CQL_TIMESTAMP. Adjust scripts to use the new setting. (DAT-298)

1.1.0 Resolved issues:

  • Generated query does not contain all token ranges when a range wraps around the ring. (DAT-295)
  • Empty map values do not work when loading using dsbulk. (DAT-297)
  • DSBulk cannot handle columns of type list<timestamp>. (DAT-288)
  • Generated queries do not respect indexed mapping order. (DAT-289)
  • DSBulk fails to start with Java 10+. (DAT-300)

DataStax Bulk Loader 1.0.2 release notes

Release notes for DataStax Bulk Loader 1.0.2.

5 June 2018

DataStax Bulk Loader 1.0.2 release notes include:

1.0.2 Changes and enhancements

  • DataStax Bulk Loader 1.0.2 is bundled with DSE 6.0.1. (DSP-16206)
  • Configure whether to use ANSI colors and other escape sequences in log messages printed to standard output and standard error. (DAT-249)

DataStax Bulk Loader 1.0.1 release notes

Release notes for DataStax Bulk Loader 1.0.1.

17 April 2018

DataStax Bulk Loader 1.0.1 release notes include:

1.0.1 Changes and enhancements

  • DataStax Bulk Loader (dsbulk) version 1.0.1 is automatically installed with DataStax Enterprise, and can also be installed as a standalone tool. DataStax Bulk Loader 1.0.1 is supported for use with DSE 5.0 and later. (DSP-13999, DSP-15623)
  • Support to manage special characters on the command line and in the configuration file. (DAT-229)
  • Improve error messages for incorrect mapping. (DAT-235)
  • Improved monitoring options. (DAT-238)
  • Detect console width on Windows. (DAT-240)
  • Null words are supported by all connectors. The schema.nullStrings is changed to codec.nullWords. Renamed the convertTo and convertFrom methods. See Codec options and Schema options. (DAT-241)
  • Use Logback to improve filtering to make stack traces more readable and useful. On ANSI-compatible terminals, the date prints in green, the hour in cyan, the level is blue (INFO) or red (WARN), and the message prints in black. (DAT-242)
  • Improved messaging for completion with errors. (DAT-243)
  • Settings schema.allowExtraFields and schema.allowMissingFields are added to reference.conf. (DAT-244)
  • Support is dropped for using :port to specify the port to connect to. Specify the port for all hosts only with driver.port. (DAT-245)

1.0.1 Resolved issues

  • Numeric overflows should display the original input that caused the overflow. (DAT-237)
  • Null words are not supported by all connectors. (DAT-241)
  • Addresses might not be properly translated when cluster has custom native port. (DAT-245)