Engine options
Use the engine options for general performance tuning and throughput for dsbulk operations.
For query-level tuning, results paging, and other driver performance settings, see Driver options.
Synopsis
The standard form for engine options is --engine.KEY VALUE:
-
KEY: The specific option to configure, such as thedryRunoption. -
VALUE: The value for the option, such as a string, number, or Boolean.HOCON syntax rules apply unless otherwise noted. For more information, see Escape and quote DSBulk command line arguments.
Short and long forms
On the command line, you can specify options in short form (if available), standard form, or long form.
For all engine options, the long form is the standard form with a dsbulk. prefix, such as --dsbulk.engine.dryRun.
The following examples show the same command with different forms of the dryRun option:
# Short form
dsbulk load -dryRun -url filename.csv -k ks1 -t table1
# Standard form
dsbulk load --engine.dryRun -url filename.csv -k ks1 -t table1
# Long form
dsbulk load --dsbulk.engine.dryRun -url filename.csv -k ks1 -t table1
In configuration files, you must use the long form with the dsbulk. prefix.
For example:
dsbulk.engine.dryRun = true
--engine.dataSizeSamplingEnabled
Whether to use data size sampling to optimize the DSBulk execution engine.
-
true(default): Enable data size sampling for applicabledsbulk loadcommands.This option only applies to
dsbulk loadcommands that read from a file, directory, or HTTP/HTTPS URL. Data size sampling never occurs fordsbulk unload,dsbulk count, ordsbulk loadcommands that read from standard input (-url -or-url stdin:/). -
false: Disable data size sampling.Set
--engine.dataSizeSamplingEnabled falseif your data source cannot be read multiple times.To perform data size sampling, DSBulk invokes the connector to read a few sample records and estimate the total data size. Then, DSBulk invokes the connector again to read the entire body of data and perform the
loadoperation.Because DSBulk must invoke the connector twice, data size sampling is only possible if the data source can be read multiple times. DSBulk can’t complete the
loadoperation if it cannot reread the entire data source, such as short-lived ephemeral files and single-use URLs.
--engine.dryRun (-dryRun)
Whether to run a dsbulk load command as a dry run (simulation) without actually modifying the database.
-
false(default): Disable dry-run mode. Theloadoperation runs normally, and the data is written to the database. -
true: Enables dry-run mode so you can test aloadoperation without writing data to the database.
--engine.executionId
Set a unique identifier for each dsbulk load, dsbulk unload, and dsbulk count operation.
-
null(default), unspecified, or empty: The DSBulk engine automatically generates identifiers in the formatWORKFLOW_TIMESTAMP, such asLOAD_20240315-142530-123456:-
WORKFLOW: The operation type, eitherLOAD,UNLOAD, orCOUNT. -
TIMESTAMP: The UTC time that the workflow started, formatted asuuuuMMdd-HHmmss-SSSSSS, with microsecond precision, if available, or millisecond precision otherwise.
-
-
User-defined string: A user-supplied identifier, such as an environment variable or template.
If the execution ID is user-defined, you must guarantee its uniqueness. Duplicate execution IDs can cause commands to fail.
To specify a template, the format must comply with formatting rules for
String.format(), and it can use the following variables:-
%1$s: A variable representing the workflow type. At runtime, this resolves toLOAD,UNLOAD, orCOUNT. -
%2$t: A variable representing the UTC time that the workflow started, formatted asuuuuMMdd-HHmmss-SSSSSS, with microsecond precision, if available, or millisecond precision otherwise. -
%3$s: A variable representing the JVM process PID for the operation. This parameter isn’t available on all operating systems. If the PID cannot be determined, DSBulk inserts a random integer instead.
--engine.executionId "%1$s-%2$t"is the same as the default identifier format (WORKFLOW_TIMESTAMP). You can use this template to amend the default identifier format. For example, to add atest-prefix to the default identifier format, set--engine.executionID "test-%1$s-%2$t". -
--engine.maxConcurrentQueries (-maxConcurrentQueries)
The maximum number of concurrent queries to run in parallel.
This option is a safeguard that can prevent a command from starting more concurrent queries than the cluster can handle. You can also use this option to regulate throughput if latency is too high.
Allowed values for --engine.maxConcurrentQueries include the following:
-
AUTO(default): DSBulk calculates an optimal number of concurrent queries according to the number of available cores and the operation being executed. Typically, the resulting value is no more than 8 times the number of available cores. -
NC: A special syntax that you can use to set the value as a multiple of the number of available cores for a given operation. For example, if you set-maxConcurrentQueries 10Cand there are 8 cores, then there can be 80 parallel queries (10 * 8 = 80). -
Positive integer: Specify an exact number of concurrent queries allowed to run in parallel, such as
100.
|
To check the --engine.maxConcurrentQueries setting, set -verbosity 2, run a load, unload, or count command, and then check the operation.log file for a line starting with Using read concurrency: or Using write concurrency:.
For the purposes of this option, batch statements are considered one query.
--s3.clientCacheSize
|
The |
When using the urlfile option with AWS S3 URLs, DSBulk creates an S3 client for each bucket specified in the S3 URLs.
DSBulk caches the S3 clients to prevent them from being recreated unnecessarily when processing many S3 URLs that target the same buckets.
If all of your S3 URLs target the same bucket, then the same S3 client is used for each URL, and the cache contains only one entry.
The size of the S3 client cache is controlled by the --s3.clientCacheSize (--dsbulk.s3.clientCacheSize) option, and the default is 20 entries.
The default value is arbitrary, and it only needs to be changed when loading from many different S3 buckets in a single command.
For more information, see Load from AWS S3 and --connector.csv.urlfile or --connector.json.urlfile.