DataStax Bulk Loader release notes

Release notes for DataStax Bulk Loader.

Release notes for DataStax Bulk Loader.

DataStax Bulk Loader can migrate data in CSV or JSON format into DSE from another DSE or Apache CassandraTM cluster.
  • Can unload data from any Cassandra 2.1 or later data source
  • Can load data to DSE 5.0 or later

DataStax Bulk Loader 1.3.3 release notes

Resolved issue for DataStax Bulk Loader 1.3.3.

13 March 2019

DataStax Bulk Loader 1.3.3 release notes summarize the resolved issue.

1.3.3 Resolved issue

Export of varchar column containing JSON may truncate data. (DAT-400)

Columns of type varchar that contain JSON are now exported “as is,” meaning DataStax Bulk Loader does not attempt to parse the JSON payload.

For example, assume you had a column col1 whose value was:
'{"foo":42}'
This was previously exported as shown below. That is, the contents of the column were parsed into a JSON node:
col1 = {"foo":42}
In DataStax Bulk Loader 1.3.3, the JSON {"foo":42} in a varchar column is exported as a string:
col1 = "{\"foo\":42}"

DataStax Bulk Loader 1.3.2 release notes

Changes, enhancements, and resolved issues for DataStax Bulk Loader 1.3.2.

20 February 2019

DataStax Bulk Loader 1.3.2 release notes include:

1.3.2 Changes and enhancements

After upgrading to 1.3.2, be sure to review and adjust scripts to use changed settings.

  • Print basic information about the cluster. (DAT-340)

    Refer to Printing cluster information.

  • Unload timestamps as units since an epoch. (DAT-364)
    Note: Datasets containing numeric data that are intended to be interpreted as units since a given epoch require the setting codec.timestamp=UNITS_SINCE_EPOCH. Failing to specify this special format will result in all records being rejected due to an invalid timestamp format. Refer to Codec options.
  • Provide better documentation on how to choose the best batching strategy. (DAT-353)

    Refer to Batch options for the dsbulk command.

  • Implement unload and count for materialized views. (DAT-385)
  • Calculate batch size dynamically - Adaptive Batch Sizing. (DAT-352)
    The new setting, batch.maxSizeInBytes, defaults to -1 (unlimited).
    Note: batch.maxBatchSize is deprecated; instead, use batch.maxBatchStatements.
  • batch.bufferSize should be a multiple of batch.maxBatchStatements. (DAT-389)

    By default batch.bufferSize is set to 4 times batch.maxBatchStatements if its value is less than or equal to 0.

  • Improve support for lightweight transactions. (DAT-384)
    DataStax Bulk Loader can detect write failures due to a failed Compare and Set (CAS) write. Records that could not be inserted will appear in two new files:
    1. paxos.bad is a new "bad file" devoted to CAS write failures.
    2. paxos-erros.log is a new debug file devoted to CAS write failures.
  • Extend DataStax Bulk Loader rate limiting capability to reads. (DAT-336)

    Previously, the rate limiter used by DataStax Bulk Loader, and adjustable via the --executor.maxPerSecond setting, only applied to writes. DataStax Bulk Loader extends the functionality to reads by making it consider the number of received rows instead of the number of requests sent.

  • Expose settings to control how to interpret empty fields in CSV files. (DAT-344)
    There are two new settings for the CSV connector:
    1. nullValue
    2. emptyValue
    Previously when reading a CSV file, the connector would emit an empty string when a field was empty and non-quoted. By default, starting with DataStax Bulk Loader 1.3.2, the CSV connector will return a null value in such situations, which may not make a difference in most cases. The only noticeable difference will be for columns of type VARCHAR or ASCII; the resulting stored value will be null instead of an empty string.
  • Allow functions to appear in mapping variables. (DAT-327)
    Previously for loads only, a mapping entry could contain a function on the right side of the assignment. This functionality has been extended to unloads. For example, loads may continue to use:
    now() = column1
    On load, the result of the now() function is inserted into column1 for every row.
    For unloads, you can export the result of now() as fieldA for every row read. For example:
    fieldA = now()
  • Detect writetime variable when unloading. (DAT-367)
    You can specify a writetime function in a mapping definition when unloading. For example:
    fieldA = column1, fieldB = writetime(column1)
    In this example, because the data type is detected, fieldB will be exported as a timestamp, not as an integer.
  • Relax constraints on queries for the Count workflow. (DAT-308)

    The schema.query setting can contain any SELECT clause when counting rows.

  • Automatically add token range restriction to WHERE clauses. (DAT-319)
    When a custom query is provided with --schema.query, to enable read parallelization, it is no longer necessary to provide a WHERE clause using the form:
    WHERE token(pk) > :start AND token(pk) <= :end
    If the query does not contain a WHERE clause, DataStax Bulk Loader will automatically generate that WHERE clause. However, if the query contains a WHERE clause, DataStax Bulk Loader will not be able to parallelize the read operations.
  • Should allow JSON array mapping with UDTs. (DAT-316)

    Previously, when loading User Defined Types (UDTs) it was required that the input be a JSON object to allow for field-by-field mapping. Starting with DataStax Bulk Loader 1.3.2, a JSON array can also be mapped to UDTs, in which case the mapping is based on field order.

  • Improve WHERE clause token range restriction detection. (DAT-372)
    When you provide a custom query for unloading, the token range restriction variables can have any name, not only start and end. For example, the following is valid:
    SELECT * FROM table1 WHERE token(pk) > :foo AND token(pk) <= :bar
  • Remove record location URI. (DAT-370)

    DataStax Bulk Loader previously provided a record's URI to uniquely identify the record. However, the URI was very long and difficult to read. You can instead identify a failed record by looking into the record's source statement or row.

  • Allow columns and fields to be mapped more than once. (DAT-373)
    It is possible to map a field/column more than once. The following rules apply:
    • When loading, a field can be mapped to 2 or more columns, but a column cannot be mapped to 2 or more fields. Thus the following mapping is correct: fieldA = column1, fieldA = column2.
    • When unloading, a column can be mapped to 2 or more fields, but a field cannot be mapped to 2 or more columns. Thus the following mapping is correct: fieldA = column1, fieldB = column1.
  • UDT and tuple codecs should respect allowExtraFields and allowMissingFields. (DAT-315)

    The settings schema.allowMissingFields and schema.allowExtraFields apply to UDTs and tuples. For example, if a tuple has three elements, but the JSON input only has two elements, this scenario results in an error if schema.allowMissingFields is false. However, this scenario is accepted if schema.allowMissingFields is true. The missing element in this example is assigned as null.

  • Add support for DataStax Enterprise 4.8 and lower. (DAT-312)

    DataStax Bulk Loader is compatible with C* 1.2 and later releases, and DataStax Enterprise 3.2 and later releases. All protocol versions are supported. Some features might not be available depending on the protocol version and server version.

    The schema.tokenSplits (default:1000) setting was added to compensate for the absence of paging in C* 1.2. The token ring is split into small chunks and is controlled by this setting.

    For example:
    bin/dsbulk unload -url myData.csv --driver.pooling.local.connections 8 \
      --driver.pooling.local.requests 128 --driver.pooling.remote.requests 128 \
      --schema.tokenSplits 50000 -k test -t test
  • Add support for keyspace-qualified UDFs in mappings. (DAT-378)

    If needed, you can qualify a user-defined function (UDF) with a keyspace name. For example: fieldA = ks1.func1(column1, column2)

  • Allow fields to appear as function parameters on the left side of mapping entries. (DAT-379)
    When loading, a mapping entry can contain a function on the left side that references fields of the dataset. For example, consider the case where:
    • A dataset has two fields, fieldA and fieldB
    • A table with three columns: colA, colB and colSum
    • A user-define function: sum(int, int)
    The following mapping works:
    fieldA = colA, fieldB = colB, sum(fieldA,fieldB)=colSum
    This will store the sum of fieldA and fieldB into colSum.
  • Improve handling of search queries. (DAT-309)
    You can supply a DataStax Enterprise search predicate using the solr_query mechanism. For example, assume you create a search index on the dsbulkblog.iris_with_id table:
    cqlsh -e "CREATE SEARCH INDEX IF NOT EXISTS ON dsbulkblog.iris_with_id"
    You can issue a query for just the Iris-setosa rows:
    dsbulk unload -query "SELECT id, petal_length, petal_width, \
     sepal_length, sepal_width, species FROM dsbulkblog.iris_with_id \
      WHERE solr_query = '{\\\"q\\\": \\\"species:Iris-setosa\\\"}'"
  • Ability to hard-limit the number of concurrent continuous paging sessions. (DAT-380)

    DataStax Bulk Loader adds a new setting: executor.continuousPaging.maxConcurrentQueries (Default: 60). It sets the maximum number of concurrent continuous paging queries that should be carried in parallel. Set this number to a value equal to or less than the value configured server-side for continuous_paging.max_concurrent_sessions in the cassandra.yaml configuration file, which is also 60 by default. Otherwise some requests may be rejected. You can disable executor.continuousPaging.maxConcurrentQueries by assigning any negative value or 0.

  • Ability to skip unloading or loading the solr_query column. (DAT-365)

    DataStax Bulk Loader will skip the solr_query column when loading and unloading.

1.3.2 Resolved issues

  • Setting executor.maxInFlight to a negative value triggers fatal error. (DAT-392)
  • Murmur3TokenRangeSplitter should allow long overflows when splitting ranges. (DAT-334)
  • CSV connector trims trailing white space when reading data. (DAT-339)
  • Avoid overflows in CodecUtils.numberToInstant. (DAT-368)
  • Call to ArrayBackedRow.toString() causes fatal NPE. (DAT-369)

DataStax Bulk Loader 1.2.0 release notes

Changes, enhancements, and resolved issues for DataStax Bulk Loader 1.2.0.

1 August 2018

DataStax Bulk Loader 1.2.0 release notes include:

1.2.0 Changes and enhancements

After upgrade to 1.2.0, be sure to review and adjust scripts to use changed settings.

  • Improve range split algorithm in multi-DC and vnodes environments. (DAT-252)
  • Support simplified notation for JSON arrays and objects in collection fields. (DAT-317)

1.2.0 Resolved issues:

  • CSVWriter trims leading/trailing whitespace in values. (DAT-302)
  • CSV connector fails when the number of columns in a record is greater than 512. (DAT-311)
  • Bulk Loader fails when mapping contains a primary key column mapped to a function. (DAT-326)

DataStax Bulk Loader 1.1.0 release notes

Changes, enhancements, and resolved issues for DataStax Bulk Loader 1.1.0.

18 June 2018

DataStax Bulk Loader 1.1.0 release notes include:

1.1.0 Changes and enhancements

After upgrade to 1.1.0, be sure to review and adjust scripts to use changed settings.

  • Combine batch.mode and batch.enabled into a single setting: batch.mode. If you are using the batch.enabled setting in scripts, change to batch.mode with value DISABLED. (DAT-287)
  • Improve handling of Univocity exceptions. (DAT-286)
  • Logging improvements. (DAT-290)
    • Log messages are logged only to operation.log. Logging does not print to stdout.
    • Configurable logging levels with the log.verbosity setting.
    • The setting log.ansiEnabled is changed to log.ansiMode.
  • New count workflow. (DAT-291, DAT-299)
    • Supports counting rows in a table.
    • Configurable counting mode.
    • When mode = partitions, configurable number of partitions to count. Support to count the number of rows for the N biggest partitions in a table.
  • Counter tables are supported for load and unload. (DAT-292)
  • Improve validation to include user-supplied queries and mappings. (DAT-294)
  • The codec.timestamp CQL_DATE_TIME setting is renamed to CQL_TIMESTAMP. Adjust scripts to use the new setting. (DAT-298)

1.1.0 Resolved issues:

  • Generated query does not contain all token ranges when a range wraps around the ring. (DAT-295)
  • Empty map values do not work when loading using dsbulk. (DAT-297)
  • DSBulk cannot handle columns of type list<timestamp>. (DAT-288)
  • Generated queries do not respect indexed mapping order. (DAT-289)
  • DSBulk fails to start with Java 10+. (DAT-300)

DataStax Bulk Loader 1.0.2 release notes

Release notes for DataStax Bulk Loader 1.0.2.

5 June 2018

DataStax Bulk Loader 1.0.2 release notes include:

1.0.2 Changes and enhancements

  • DataStax Bulk Loader 1.0.2 is bundled with DSE 6.0.1. (DSP-16206)
  • Configure whether to use ANSI colors and other escape sequences in log messages printed to standard output and standard error. (DAT-249)

DataStax Bulk Loader 1.0.1 release notes

Release notes for DataStax Bulk Loader 1.0.1.

17 April 2018

DataStax Bulk Loader 1.0.1 release notes include:

1.0.1 Changes and enhancements

  • DataStax Bulk Loader (dsbulk) version 1.0.1 is automatically installed with DataStax Enterprise, and can also be installed as a standalone tool. DataStax Bulk Loader 1.0.1 is supported for use with DSE 5.0 and later. (DSP-13999, DSP-15623)
  • Support to manage special characters on the command line and in the configuration file. (DAT-229)
  • Improve error messages for incorrect mapping. (DAT-235)
  • Improved monitoring options. (DAT-238)
  • Detect console width on Windows. (DAT-240)
  • Null words are supported by all connectors. The schema.nullStrings is changed to codec.nullWords. Renamed the convertTo and convertFrom methods. See Codec options and Schema options. (DAT-241)
  • Use Logback to improve filtering to make stack traces more readable and useful. On ANSI-compatible terminals, the date prints in green, the hour in cyan, the level is blue (INFO) or red (WARN), and the message prints in black. (DAT-242)
  • Improved messaging for completion with errors. (DAT-243)
  • Settings schema.allowExtraFields and schema.allowMissingFields are added to reference.conf. (DAT-244)
  • Support is dropped for using :port to specify the port to connect to. Specify the port for all hosts only with driver.port. (DAT-245)

1.0.1 Resolved issues

  • Numeric overflows should display the original input that caused the overflow. (DAT-237)
  • Null words are not supported by all connectors. (DAT-241)
  • Addresses might not be properly translated when cluster has custom native port. (DAT-245)