DataStax Bulk Loader Release Notes

Welcome to the DataStax Bulk Loader release notes.

DataStax Bulk Loader can load and unload data in CSV or JSON format (or via CSV/JSON compressed files) into or out of:

  • DataStax Astra DB

  • Hyper-Converged Database (HCD) 1.0 databases

  • DataStax Enterprise (DSE) 5.1, 6.8, and 6.9 databases

  • Open source Apache Cassandra® 2.1 and later databases

DataStax Bulk Loader is supported on Linux, macOS, and Windows platforms.

For related information:

DataStax Bulk Loader 1.11 release notes

13 July 2023

DSBulk Loader 1.11 adds support for the vector<type, dimension> data type when used with Astra DB databases created with the Vector Search feature.

For implementation details, see the DSBulk Loader open-source GitHub repo.

DataStax Bulk Loader 1.10 release notes

12 August 2022

Enhancement

DataStax Bulk Loader 1.9.1 release notes

10 August 2022

Enhancement

  • Adjusted the default throughput from DataStax Bulk Loader to Astra DataBase to a more conservative setting in order to avoid triggering a rate limit exception.

As a reminder, you can adjust the client-side rate limit through either a configuration file or via the --engine.maxConcurrentQueries command line interface (CLI) option.

[[RNdsbulk190] == DataStax Bulk Loader 1.9.0 release notes

05 April 2022

Changes and enhancements

  • At the conclusion of a dsbulk run, results are printed from the Count workflow even if failures occur. In general, when querying billions of rows, some failures are expected. However, reporting these failures may indicate whether there is need to retry the dsbulk run command. BULK-18

  • Upgraded driver to 4.14.0.

  • When issuing a BATCH query to load data, DSBulk unwraps the statements of this query and incorporates them into its own batching function, creating a consolidated BATCH message. This avoids nesting BATCH statements inside protocol-level BATCH messages, which is forbidden. A consolidated BATCH message also greatly improves performance when loading with timestamp and TTL preservation enabled. BULK-23

  • Added support for Prometheus. BULK-26

  • When unloading data using the options -timestamp or -ttl (to automatically preserve cell timestamp and time-to-live (TTL)), the operation fails if the table being unloaded contains collections. Excluded unsupported types from an automatic timestamp and TTL unload, and log a warning explaining that some timestamps and TTLs may be lost. BULK-25

  • DSBulk distribution archives are now uploaded to Maven Central, allowing DSBulk distros to be downloaded from there as well from the DataStax downloads site). BULK-24

  • Removed the check for a primary key column in a record that is empty. This allows the server to handle BLOB types with empty buffers in any primary key column, as well as to handle a composite partition key that accepts empty blobs or strings. BULK-28

  • Added support for nested functions in DSBULK mappings. BULK-29

  • Added support for literal strings in mappings outside of function arguments. BULK-3

[[RNdsbulk180] == DataStax Bulk Loader 1.8.0 release notes

11 March 2021

Changes and enhancements

  • New support in DataStax Bulk Loader 1.8.0 for accepting Well-known Binary (WKB) geometry data, along with a new Codec setting.

    • When loading Geo data, in addition to existing support for Well-known Text (WKT) and GeoJson (JSON) data formats, DataStax Bulk Loader now also accepts WKB data.

    • Use the existing --codec.binary setting to encode any WKB data in HEX or BASE64.

    • A new setting, --codec.geo, has been added to declare the strategy used when converting geometry types to strings.

      There is an upgrade impact with Geo data. Starting in this 1.8.0 release, when unloading data with Geo types to JSON files, all Geo data is now encoded in WKT data format by default, instead of GeoJson. To achieve the pre-1.8.0 behavior, set the --codec.geo JSON option.

    • When unloading Geo data, use the new codec.geo setting to configure the desired output format.

          # Strategy to use when converting geometry types to strings. Geometry types are only available
          # in DataStax Enterprise (DSE) 5.0 or higher. Only applicable when unloading columns of CQL type
          # `Point`, `LineString` or `Polygon`, and only if the connector in use requires stringification.
          # Valid values are:
          #
          # - WKT: Encode the data in Well-known text format. This is the default strategy.
          # - WKB: Encode the data in Well-known binary format. The actual encoding will depend on the
          #   value chosen for the `codec.binary` setting (HEX or BASE64).
          # - JSON: Encode the data in GeoJson format.
          # Type: string
          # Default value: "WKT"
          # codec.geo = "WKT"
    • For details, see these sections of the Codec Options topic:

  • Added options to automatically preserve Time-To-Live (TTL) and timestamps.

    • Query generation

      Two new settings that allow for the transparent handling of cell timestamps and TTL:

      • schema.preserveTimestamp: when true, timestamps are preserved when loading and unloading. Default is false. See schema.preserveTimestamp

      • schema.preserveTtl: when true, TTLs are preserved when loading and unloading. Default is false. See schema.preserveTtl.

      • These settings work best when DataStax Bulk Loader is responsible for generating the queries. DataStax Bulk Loader will generate special queries that export and import all the required data. Overall, the new feature allows a table to be exported, then imported, while preserving all timestamps and TTLs; the heavy-lifting of generating the appropriate queries is performed entirely by DataStax Bulk Loader.

    • Mappings

      Some changes were also made to the schema.mapping grammar, to allow individual cell timestamps and TTLs to be easily mapped in user-supplied mappings:

      • When unloading, nothing changes in fact; the usual way to export timestamps and ttls remains the usage of the writetime and ttl functions, applied to a given column. For example, to export one column, its timestamp and its TTL to three different fields, you could use the following mapping:

        field1 = col1, field2 = writetime(col1), field3 = ttl(col1).
      • When loading however, it is now also possible to use the same writetime and ttl functions to map a field’s value to the timestamp or TTL of one or more columns in the table.
        For example, the following mapping would use field3’s value as the writetime of columns col1 and col2, and field4’s value as the TTL of those columns:

        field1 = col1, field2 = col2, field3 = writetime(col1,col2), field4 = ttl(col1,col2)
      • As a shortcut, when loading, you can also use the special syntax writetime() and ttl(): they mean the timestamp and TTL of all columns in the row, except those already mapped elsewhere.
        For example, consider the following mapping:

        field1 = col1, field2 = col2, field3 = col3, field4 = writetime(col1), field5 = writetime(*)

        This mapping would use field4’s value as the timestamp of column col1, and field4’s value as the timestamp of all remaining columns: that is, columns col2 and col3.


        Starting in DataStax Bulk Loader 1.8.0, the special tokens \\__timestamp and \\__ttl are deprecated (but still honored) . If used, a warning message is logged. When you can, replace any \\__timestamp and \\__ttl tokens with writetime (*) and ttl(*), respectively.

1.8.0 Resolved issues

  • Fixed issues identified with Geo data. An example: "Could not deserialize column s_geo of type Set(Custom(org.apache.cassandra.db.marshal.PointType), not frozen) as java.lang.String" was returned while unloading/loading a table with PointType. See the related 1.8.0 implementation of a new setting that addressed the issue, --codec.geo, which is also summarized.

  • Corrected several occurrences of a documentation typo that previously showed --stats.mode. The correct option for the dsbulk count command is --stats.modes or --dsbulk.stats.modes. Meaning, the plural form is correct. See Count options.

DataStax Bulk Loader 1.7.0 release notes

09 September 2020

Changes and enhancements

  • A new setting: --log.sources, --dsbulk.log.sources boolean

    Whether to print record sources in debug files, and enable "bad files" with load operations. See --dsbulk.log.sources.

  • A new setting: --monitoring.console, --dsbulk.monitoring.console boolean

    Enable or disable console reporting. See --dsbulk.monitoring.console.

  • A clarification regarding writing empty strings as quoted empty fields. To insert an empty string, with the intention to override a value that existed previously in the given column, you can insert an empty quoted field in the data file, as in this CSV example:

    foo,"",bar

    This example inserts an empty string into the column mapped to the second field, using v default settings. For this scenario, do not set --connector.csv.nullValue '""' because this setting is for empty non-quoted fields.

1.7.0 Resolved issue

  • Fixed null pointer exception when reading from Stdin.

DataStax Bulk Loader 1.6.0 release notes

21 July 2020

Changes and enhancements

  • Starting with version 1.6.0, DataStax Bulk Loader is available under the Apache-2.0 license as open-source software (OSS). This change and enhancement makes it possible for the open-source community of developers to contribute features that enable loading and unloading CSV/JSON data to and from Apache Cassandra® , DataStax Enterprise (DSE), and DataStax Astra databases.

  • The product’s official name is now DataStax Bulk Loader®.

  • The public GitHub repo is https://github.com/datastax/dsbulk.

  • Some features are specific to DSE environments, including parameters associated with DataStax Graph. In topics such as Schema options, Graph-only features are highlighted with an icon.

  • Ability to specify a list of allowed or denied hosts. Use the following settings:

  • A new setting in Engine Options: dsbulk.engine.maxConcurrentQueries.

    This setting regulates the number of queries executed in parallel. It applies to all types of dsbulk operations (load, unload, count). The setting requires a valid integer value > 0; the NC notation is also possible — for example, 2C means "twice the number of cores".

    See engine.maxConcurrentQueries, including important throughput considerations.

  • The setting executor.continuousPaging.maxConcurrentQueries is deprecated. Instead use --dsbulk.engine.maxConcurrentQueries.

  • In prior releases, the following settings were for dsbulk unload only:

    • connector.csv.maxConcurrentFiles

    • connector.json.maxConcurrentFiles In 1.6.0, you may also use them with dsbulk load, which sets the maximum number of files that can be read in parallel. For important considerations, see connector.{csv|json}.maxConcurrentFiles.

  • DataStax Bulk Loader accepts binary input in the following formats: BASE64 or HEX. For dsbulk unload only, you can choose the format when converting binary data to strings. See codec.binary in Codec options.

  • Raised the driver default timeouts to 5 minutes for the following:

    • datastax-java-driver.basic.request.timeout="5 minutes"

    • datastax-java-driver.advanced.continuous-paging.timeout.first-page="5 minutes"

    • datastax-java-driver.advanced.continuous-paging.timeout.other-pages="5 minutes"

  • Added support for multi-character delimiters.

    For example, to load a CSV file with '||', add a -delim '\|\|' parameter on the dsbulk load command.

Example CSV format:

Foo||bar||qix

1.6.0 Resolved issues

  • When providing a custom query to dsbulk count, only the stats.modes global is now accepted. The query is executed "as is." See Count options.

  • The ORDER BY, GROUP BY and LIMIT clauses now cause the query to be executed "as is," without parallelization. See Schema options.

  • Fixed an issue where a zero-length array was exported as "" (empty string). The issue was that an empty string, when reimported to a blob column, was interpreted as null, instead of as a zero-length array.

  • Per-file limits, maxRecords and skipRecords for CSV and JSON data, were not applied when there was more than one file to read.

DataStax Bulk Loader 1.5.0 release notes

26 March 2020

Changes and enhancements

DataStax Bulk Loader 1.5.0 added support for the following features.

  • Previous DataStax Bulk Loader releases included support for loading and unloading graph data using prior settings and workflows. Starting with this DataStax Bulk Loader 1.5.0 release, the product provides an improved user experience to set related DataStax Graph (DSG) properties, plus enhanced validation of requested DSG operations. The changes include new DataStax Bulk Loader schema settings that are specific to DSG operations. The new options are:

    • -g, --schema.graph, --dsbulk.schema.graph string

    • -e, --schema.edge, --dsbulk.schema.edge string

    • -v, --schema.vertex, --dsbulk.schema.vertex string

    • -from, --schema.from, --dsbulk.schema.from string

    • -to, --schema.to, --dsbulk.schema.to string For details, refer to Schema options.

  • With the enhanced support for DSG features, DataStax Bulk Loader displays metrics for graph data in vertices per second or edges per second. For non-graph data, the metrics continue to be displayed in rows per second. The metrics display type occurs automatically based on how the table DDL was created and whether DataStax Bulk Loader schema options include graph settings; that is, you do not need to configure graph-specific monitoring options for the metrics.

  • DataStax Bulk Loader 1.5.0 adds support for release 4.5.0 of the DataStax Java driver.

1.5.0 Resolved issue

DataStax addressed an issue with the deserialization of untrusted data by upgrading to the latest 2.9.10 release of the FasterXML jackson-databind library, although DataStax Bulk Loader was not directly affected.

DataStax Bulk Loader 1.4.1 release notes

16 December 2019

Changes and enhancements

DataStax Bulk Loader 1.4.1 adds support for using the dsbulk load command to write CSV/JSON data to open source Apache Cassandra® 2.1 and later database tables. Previously, you could only use dsbulk unload and dsbulk count commands with Apache Cassandra.

The new support in 1.4.1 is in addition to the existing functionality to use all dsbulk commands with:

For details about using the commands, refer to:

For information about downloading the secure connect bundle ZIP via the Astra Portal, in advance of entering the dsbulk command, refer to Working with Secure Connect Bundle in the Astra DB documentation.

1.4.1 Resolved issue

When exporting data, the \u0000 null character is now enclosed in quotes, so that the exported data can be loaded subsequently with the same DataStax Bulk Loader settings. By default, the null character is used as the comment character.

DataStax Bulk Loader 1.4.0 release notes

12 November 2019

Changes and enhancements

DataStax Bulk Loader 1.4.0 has been upgraded to use the latest 2.x version of the DataStax Java driver.

Before upgrading to DataStax Bulk Loader 1.4.0, note that as a result of the driver enhancements, this release supports DSE 4.7 and later, and Apache Cassandra® 2.1 and later. Prior releases of DSE and Apache Cassandra are not supported. If you are using earlier releases of DSE or Apache Cassandra, you must remain on DataStax Bulk Loader 1.3.4.

  • Many new driver options are available directly with dsbulk commands via the datastax-java-driver prefix.

    A number of previously available options have been deprecated, as indicated in the DataStax Bulk Loader reference topics. Those prior options are still supported, but may not be supported in a subsequent release. When you can, review and adjust your command scripts and configuration files to take advantage of the new options that use the datastax-java-driver prefix.

    For details about the DataStax Java driver enhancements, start with the Driver options topic. Also refer to the Executor options topic. Several of the Driver and Executor options have been deprecated and replaced by settings that use the datastax.java.driver prefix. In addition, the Security options topic has been removed, with the options moved into the SSL section of the Driver options topic.

  • You can connect DataStax Bulk Loader to a cloud-based DataStax Astra database by including the path to the secure connect bundle, and by specifying the username and password entered when the database was created.

    For information about downloading the secure connect bundle ZIP via the Astra Portal, in advance of entering the dsbulk command, refer to Working with Secure Connect Bundle in the Astra DB documentation.

    Also see the examples in the dsbulk load topic.

  • You can use DataStax Bulk Loader to load/unload your table data from/to compressed CSV or JSON files.

    For details, refer to the --connector.{csv | json}.compression parameter.

DataStax Bulk Loader 1.3.4 release notes

16 July 2019

1.3.4 Changes and enhancements

After upgrading to 1.3.4, be sure to review and adjust your scripts to use the changed settings.

  • The DataStax Bulk Loader Help provides an entry for --version.

  • Improved error message provided when a row fails to decode.

    In the DataStax Bulk Loader logging options, the format is: -maxErrors,--log.maxErrors ( number | "N%" )

    An updated explanation is provided:

    The maximum number of errors to allow before aborting the entire operation. This setting may be expressed as:

    • An absolute number of errors; in which case, set this value to an integer greater than or equal to zero.

    • Or a percentage of the total rows processed so far; in which case, set this value to a string of the form "N%", where N is a decimal number between 0 and 100 exclusive.

      Example: -maxErrors "20%" Setting this value to any negative integer disables the feature, which is not recommended.

  • When a table contains static columns, it is possible that some partitions only contain static data. In this case, that data is exported as a pseudo row where all clustering columns and regular columns are null. Example:

    create table t1 (pk int, cs int static, cc int, v int, primary key (pk, cc));
    insert into t1 (pk, cs) values (1,1);
    select * from t1;
     pk | cc   | cs | v
    ----+------+----+------
      1 | null |  1 | null

    In prior DataStax Bulk Loader releases, you could not import this type of static data, even though the query was valid. For example, the following query was rejected:

    INSERT INTO t1 (pk, cs) values (:pk, :cs);
    Operation LOAD_20190412-134352-912437 failed: Missing required primary key column conversation_id
    from schema.mapping or schema.query.

    DataStax Bulk Loader now allows this valid query.

  • You can use the CQL date and time types with UNITS_SINCE_EPOCH, in addition to timestamp. Previously, you could only use the CQL timestamp type. On the dsbulk command, you can use codec.unit and codec.epoch to convert integers to, or from, these types. Refer to --codec.unit, --dsbulk.codec.unit and --codec.epoch, --dsbulk.codec.epoch.

  • You can use a new monitoring setting, monitoring.trackBytes, to enable or disable monitoring of DataStax Bulk Loader operations in bytes per second. Because this type of monitoring can consume excessive allocation resources, and in some cases excessive CPU cycles, the setting is disabled by default. If you want monitoring in bytes per second, you must enable it with monitoring.trackBytes. Test and compare the setting in your development environment. Disabling this setting may improve the allocation rate. If leaving the setting disabled improves throughput, consider disabling it (or keeping it disabled) in production. Or enable monitoring of bytes per second on an as-needed basis.

  • The default output file name format, defined by the --connector.(csv|json).fileNameFormat string option, no longer includes the thousands separator. The prior default output file name format was:

  • In load operations, you can pass the URL of a CSV or JSON data file. In cases where you have multiple URLs, DataStax Bulk Loader 1.3.4 makes this task easier by providing the following command-line options. You can point to the file that contains all the URLs for the data files:

  • When using REPLICA_SET batch mode, the server may issue query warnings if the number of statements in a single batch exceeds unlogged_batch_across_partitions_warn_threshold. To avoid reporting excessive warning messages in stdout, DataStax Bulk Loader logs only one warning at the beginning of the operation.

1.3.4 Resolved issues

  • DataStax Bulk Loader should reject CSV files containing invalid headers, such as headers that are empty or contain duplicate fields.

  • Logging option -maxErrors 0 does not abort the operation.

  • DataStax Bulk Loader should reject invalid execution IDs. An execution ID is used to create MBean names. DataStax Bulk Loader now validates user-provided IDs to ensure, for example, that an ID does not contain a comma.

DataStax Bulk Loader 1.3.3 release notes

13 March 2019

1.3.3 Resolved issue

Export of varchar column containing JSON may truncate data.

Columns of type varchar that contain JSON are now exported “as is,” meaning DataStax Bulk Loader does not attempt to parse the JSON payload.

For example, assume you had a column col1 whose value was:

'{"foo":42}'

This was previously exported as shown below. That is, the contents of the column were parsed into a JSON node:

col1 = {"foo":42}

In DataStax Bulk Loader 1.3.3, the JSON {"foo":42} in a varchar column is exported as a string:

col1 = "{\"foo\":42}"

DataStax Bulk Loader 1.3.2 release notes

20 February 2019

1.3.2 Changes and enhancements

After upgrading to 1.3.2, be sure to review and adjust scripts to use changed settings.

  • Print basic information about the cluster.

  • Unload timestamps as units since an epoch.

    Datasets containing numeric data that are intended to be interpreted as units since a given epoch require the setting codec.timestamp=UNITS_SINCE_EPOCH. Failing to specify this special format will result in all records being rejected due to an invalid timestamp format. Refer to Codec options.

  • Provide better documentation on how to choose the best batching strategy.

    Refer to Batch options for the dsbulk command.

  • Implement unload and count for materialized views.

  • Calculate batch size dynamically - Adaptive Batch Sizing.

    The new setting, batch.maxSizeInBytes, defaults to -1 (unlimited).

    batch.maxBatchSize is deprecated; instead, use batch.maxBatchStatements.

  • batch.bufferSize should be a multiple of batch.maxBatchStatements.

    By default batch.bufferSize is set to 4 times batch.maxBatchStatements if its value is less than or equal to 0.

  • Improve support for lightweight transactions.

    DataStax Bulk Loader can detect write failures due to a failed LWT write. Records that could not be inserted will appear in two new files:

    1. paxos.bad is a new "bad file" devoted to LWT write failures.

    2. paxos-errors.log is a new debug file devoted to LWT write failures.

      DataStax Bulk Loader also writes any records from failed writes to a .bad file in the operation’s directory, depending on when the failure occurred. For details, refer to Detection of write failures.

  • Extend DataStax Bulk Loader rate limiting capability to reads.

    Previously, the rate limiter used by DataStax Bulk Loader, and adjustable via the --executor.maxPerSecond setting, only applied to writes. DataStax Bulk Loader extends the functionality to reads by making it consider the number of received rows instead of the number of requests sent.

  • Expose settings to control how to interpret empty fields in CSV files.

    There are two new settings for the CSV connector:

    1. nullValue

    2. emptyValue Previously when reading a CSV file, the connector would emit an empty string when a field was empty and non-quoted. By default, starting with DataStax Bulk Loader 1.3.2, the CSV connector will return a null value in such situations, which may not make a difference in most cases. The only noticeable difference will be for columns of type VARCHAR or ASCII; the resulting stored value will be null instead of an empty string.

  • Allow functions to appear in mapping variables.

    Previously for loads only, a mapping entry could contain a function on the right side of the assignment. This functionality has been extended to unloads. For example, loads may continue to use:

    now() = column1

    On load, the result of the now() function is inserted into column1 for every row.

    For unloads, you can export the result of now() as fieldA for every row read. For example:

    fieldA = now()
  • Detect writetime variable when unloading.

    You can specify a writetime function in a mapping definition when unloading. For example:

    fieldA = column1, fieldB = writetime(column1)

    In this example, because the data type is detected, fieldB is exported as a timestamp, not as an integer.

  • Relax constraints on queries for the Count workflow.

    The schema.query setting can contain any SELECT clause when counting rows.

  • Automatically add token range restriction to WHERE clauses.

    When a custom query is provided with --schema.query, to enable read parallelization, it is no longer necessary to provide a WHERE clause using the form:

    WHERE token(pk) > :start AND token(pk) <= :end

    If the query does not contain a WHERE clause, DataStax Bulk Loader automatically generates that WHERE clause. However, if the query contains a WHERE clause, DataStax Bulk Loader is not be able to parallelize the read operations.

  • Should allow JSON array mapping with UDTs.

    Previously, when loading User Defined Types (UDTs) it was required that the input be a JSON object to allow for field-by-field mapping. Starting with DataStax Bulk Loader 1.3.2, a JSON array can also be mapped to UDTs, in which case the mapping is based on field order.

  • Improve WHERE clause token range restriction detection.

    When you provide a custom query for unloading, the token range restriction variables can have any name, not only start and end. For example, the following is valid:

    SELECT * FROM table1 WHERE token(pk) > :foo AND token(pk) <= :bar
  • Remove record location URI.

    DataStax Bulk Loader previously provided a record’s URI to uniquely identify the record. However, the URI was very long and difficult to read. You can instead identify a failed record by looking into the record’s source statement or row.

  • Allow columns and fields to be mapped more than once.

    It is possible to map a field/column more than once. The following rules apply:

    • When loading, a field can be mapped to 2 or more columns, but a column cannot be mapped to 2 or more fields. Thus the following mapping is correct: fieldA = column1, fieldA = column2.

    • When unloading, a column can be mapped to 2 or more fields, but a field cannot be mapped to 2 or more columns. Thus the following mapping is correct: fieldA = column1, fieldB = column1.

  • UDT and tuple codecs should respect allowExtraFields and allowMissingFields.

    The settings schema.allowMissingFields and schema.allowExtraFields apply to UDTs and tuples. For example, if a tuple has three elements, but the JSON input only has two elements, this scenario results in an error if schema.allowMissingFields is false. However, this scenario is accepted if schema.allowMissingFields is true. The missing element in this example is assigned as null.

  • Add support for DataStax Enterprise 4.8 and lower.

    DataStax Bulk Loader is compatible with C* 1.2 and later releases, and DataStax Enterprise 3.2 and later releases. All protocol versions are supported. Some features might not be available depending on the protocol version and server version.

    The schema.splits (default: 8C) setting was added to compensate for the absence of paging in C* 1.2. The token ring is split into small chunks and is controlled by this setting.

    For example:

    bin/dsbulk unload -url myData.csv --driver.pooling.local.connections 8 \
      --driver.pooling.local.requests 128 --driver.pooling.remote.requests 128 \
      --schema.splits 0.5C -k test -t test

    On --schema.splits, you can optionally use special syntax, nC, to specify a number that is a multiple of the available cores, resulting in a calculated number of splits. If the number of cores is 8, --schema.splits 0.5C = 0.5 * 8, which results in 4 splits. Refer to --schema.splits, --dsbulk.schema.splits number.

  • Add support for keyspace-qualified UDFs in mappings.

    If needed, you can qualify a user-defined function (UDF) with a keyspace name. For example: fieldA = ks1.func1(column1, column2)

  • Allow fields to appear as function parameters on the left side of mapping entries.

    When loading, a mapping entry can contain a function on the left side that references fields of the dataset. For example, consider the case where:

    • A dataset has two fields, fieldA and fieldB

    • A table with three columns: colA, colB and colSum

    • A user-define function: sum(int, int) The following mapping works:

    fieldA = colA, fieldB = colB, sum(fieldA,fieldB)=colSum

    + This will store the sum of fieldA and fieldB into colSum.

  • Improve handling of search queries.

    You can supply a DataStax Enterprise search predicate using the solr_query mechanism. For example, assume you create a search index on the dsbulkblog.iris_with_id table:

    cqlsh -e "CREATE SEARCH INDEX IF NOT EXISTS ON dsbulkblog.iris_with_id"

    You can issue a query for just the Iris-setosa rows:

    dsbulk unload -query "SELECT id, petal_length, petal_width, \
     sepal_length, sepal_width, species FROM dsbulkblog.iris_with_id \
      WHERE solr_query = '{\\\"q\\\": \\\"species:Iris-setosa\\\"}'"
  • Ability to hard-limit the number of concurrent continuous paging sessions.

    DataStax Bulk Loader adds a new setting: executor.continuousPaging.maxConcurrentQueries (Default: 60). It sets the maximum number of concurrent continuous paging queries that should be carried in parallel. Set this number to a value equal to or less than the value configured server-side for continuous_paging.max_concurrent_sessions in the cassandra.yaml configuration file, which is also 60 by default. Otherwise some requests may be rejected. You can disable executor.continuousPaging.maxConcurrentQueries by assigning any negative value or 0.

  • Ability to skip unloading or loading the solr_query column.

    DataStax Bulk Loader will skip the solr_query column when loading and unloading.

1.3.2 Resolved issues

  • Setting executor.maxInFlight to a negative value triggers fatal error.

  • Murmur3TokenRangeSplitter should allow long overflows when splitting ranges.

  • CSV connector trims trailing white space when reading data.

  • Avoid overflows in CodecUtils.numberToInstant.

  • Call to ArrayBackedRow.toString() causes fatal NPE.

DataStax Bulk Loader 1.2.0 release notes

1 August 2018

1.2.0 Changes and enhancements

After upgrading to 1.2.0, be sure to review and adjust scripts to use changed settings.

  • Improve range split algorithm in multi-DC and vnodes environments.

  • Support simplified notation for JSON arrays and objects in collection fields.

1.2.0 Resolved issues:

  • CSVWriter trims leading/trailing whitespace in values.

  • CSV connector fails when the number of columns in a record is greater than 512.

  • Bulk Loader fails when mapping contains a primary key column mapped to a function.

DataStax Bulk Loader 1.1.0 release notes

18 June 2018

1.1.0 Changes and enhancements

After upgrading to 1.1.0, be sure to review and adjust scripts to use changed settings.

  • Combine batch.mode and batch.enabled into a single setting: batch.mode. If you are using the batch.enabled setting in scripts, change to batch.mode with value DISABLED.

  • Improve handling of Univocity exceptions.

  • Logging improvements.

    • Log messages are logged only to operation.log. Logging does not print to stdout.

    • Configurable logging levels with the log.verbosity setting.

    • The setting log.ansiEnabled is changed to log.ansiMode.

  • New count workflow.

    • Supports counting rows in a table.

    • Configurable counting mode.

    • When mode = partitions, configurable number of partitions to count. Support to count the number of rows for the N biggest partitions in a table.

  • Counter tables are supported for load and unload.

  • Improve validation to include user-supplied queries and mappings.

  • The codec.timestamp CQL_DATE_TIME setting is renamed to CQL_TIMESTAMP. Adjust scripts to use the new setting.

1.1.0 Resolved issues

  • Generated query does not contain all token ranges when a range wraps around the ring.

  • Empty map values do not work when loading using dsbulk.

  • DSBulk cannot handle columns of type list<timestamp>.

  • Generated queries do not respect indexed mapping order.

  • DSBulk fails to start with Java 10+.

DataStax Bulk Loader 1.0.2 release notes

5 June 2018

1.0.2 Changes and enhancements

  • DataStax Bulk Loader 1.0.2 is bundled with DSE 6.0.1.

  • Configure whether to use ANSI colors and other escape sequences in log messages printed to standard output and standard error.

1.0.1 Changes and enhancements

  • DataStax Bulk Loader (dsbulk) version 1.0.1 is automatically installed with DataStax Enterprise, and can also be installed as a standalone tool. DataStax Bulk Loader 1.0.1 is supported for use with DSE 5.0 and later.

  • Support to manage special characters on the command line and in the configuration file.

  • Improve error messages for incorrect mapping.

  • Improved monitoring options.

  • Detect console width on Windows.

  • Null words are supported by all connectors. The schema.nullStrings is changed to codec.nullWords. Renamed the convertTo and convertFrom methods. See Codec options and Schema options.

  • Use Logback to improve filtering to make stack traces more readable and useful. On ANSI-compatible terminals, the date prints in green, the hour in cyan, the level is blue (INFO) or red (WARN), and the message prints in black.

  • Improved messaging for completion with errors.

  • Settings schema.allowExtraFields and schema.allowMissingFields are added to reference.conf.

  • Support is dropped for using :port to specify the port to connect to. Specify the port for all hosts only with driver.port.

1.0.1 Resolved issues

  • Numeric overflows should display the original input that caused the overflow.

  • Null words are not supported by all connectors.

  • Addresses might not be properly translated when the cluster has a custom native port.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com