Schema options

Schema options for the dsbulk command

Specify schema options for the dsbulk command.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-k,--schema.keyspace string

Keyspace used for loading, unloading, or counting data. Keyspace names should not be quoted and are case-sensitive. MyKeyspace will match a keyspace named MyKeyspace but not mykeyspace. Required option if schema.query is not specified; otherwise, optional.

Default: unspecified

-t,--schema.table string

Table used for loading, unloading, or counting data. Table names should not be quoted and are case-sensitive. MyTable will match a table named MyTable but not mytable. Required option if schema.query is not specified; otherwise, optional.

Default: unspecified

-m, --schema.mapping string
The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:
  • Indexed data sources: 0 = col1, 1 = col2, 2 = col3, where 0, 1, 2, are the zero-based indices of fields in the source data; and col1, col2, col3 are bound variable names in the insert statement.
  • A shortcut to map the first n fields is to simply specify the destination columns: col1, col2, col3.
  • Mapped data sources: fieldA = col1, fieldB = col2, fieldC = col3, where fieldA, fieldB, fieldC are field names in the source data; and col1, col2, col3 are bound variable names in the insert statement.
  • A shortcut to map fields named like columns is to simply specify the destination columns: col1, col2, col3.
To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columns __ttl and __timestamp: fieldA = __timestamp, fieldB = __ttl. Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit + codec.epoch; the latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.

To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g. now() = c4). Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g. field1=now()). Function calls can also be qualified by a keyspace name: field1 = keyspace1.max(c1, c2).

In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified. For example, if a source row has fields c1, c2, c3, and c5, and the table has columns c1, c2, c3, c4, one can map all like-named columns and specify that c5 in the source maps to c4 in the table as follows: * = *, c5 = c4.

To specify that all like-named fields be mapped, except for c2, use: * = -c2. To skip c2 and c3, use: * = [-c2, -c3].

Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL: "Field ""A""" = "Column 2" (to escape a double-quote, simply double it).
Note: Unlike CQL grammar, unquoted identifiers will not be lowercased by DataStax Bulk Loader. An identifier such as MyColumn1 will match a column named MyColumn1, but will not match mycolumn1.

The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector's documentation to know which kinds of mapping it supports.

Default: null

--schema.nullToUnset ( true | false )
Specify whether to map null input values to "unset" in the database; that is, don't modify a potentially pre-existing value of this field for this row. Only applicable for loading; otherwise ignored. Note that setting to false creates tombstones to represent null. When the protocol version in use does not support unset values (all protocol versions less than 4), this setting is forced to false and a warning will be logged. Setting ignored for counting.
CAUTION: This setting is applied after the codec.nullStrings setting, and may intercept nulls produced by that setting.

Default: true

--schema.allowExtraFields ( true | false )

Specify whether to accept records that contain extra fields that are not declared in the mapping. For example, assume a record contains three fields A, B, and C, but the mapping declares only fields A and B. When this option is true, C is silently ignored and the record is considered valid. When set to false, the record is rejected. Only applicable for loading; ignored otherwise.

Default: true

--schema.allowMissingFields ( true | false )

Specify whether to accept records that are missing fields declared in the mapping. For example, assume the mapping declares three fields A, B, and C, but a record contains only fields A and B. When this option is true, C is silently assigned null and the record is considered valid. When set to false, the record is rejected. Also applies to user-defined types (UDTs) and tuples, with element replacing field in explanation. If the missing field is mapped to a primary key column, the record is always rejected because the database would reject the record. Only applicable for loading; ignored otherwise.

Default: false

-query,--schema.query string

The query to use. If not specified, then schema.keyspace and schema.table must be specified, and dsbulk will infer the appropriate statement based on the table's metadata, using all available columns. If schema.keyspace is provided, the query need not include the keyspace to qualify the table reference.

For loading, the statement can be any INSERT, UPDATE or DELETE statement. INSERT statements are preferred for most load operations, and bound variables should correspond to mapped fields; for example, INSERT INTO table1 (c1, c2, c3) VALUES (:fieldA, :fieldB, :fieldC). UPDATE statements are required if the target table is a counter table, and the columns are updated with incremental operations (SET col1 = col1 + :fieldA) where :fieldA is a column in the input data. A DELETE statement will remove existing data during the load operation.

For unloading and counting, the statement can be any regular SELECT statement. If the statement does not contain a WHERE clause, the engine will generate a token range restriction clause of the form: token(...) > :start and token(...) <= :end, and the engine will generate as many statements as there are token ranges in the cluster, thus allowing parallelization of reads while at the same time targeting coordinators that are also replicas. However, if the SELECT statement does contain a WHERE clause, the engine will only be able to parallelize the operation if that WHERE clause also includes a token(...) > :start and token(...) <= :end relation (the bound variables can have any name).

Statements can use both positional and named bound variables. Named bound variables are preferred. Named bound variables usually have names matching those of the columns in the destination table, but this is not a strict requirement; it is, however, required that their names match those of fields specified in the mapping. Positional variables will be named after their corresponding column in the destination table.
Important: Positional variable cannot be used in SELECT statements that use WHERE clause restrictions using token(), USING TIMESTAMP clause, or USING TTL clause.

See schema.mapping setting for more information.

Default: null

--schema.queryTimestamp string

The timestamp of inserted/updated cells during load; otherwise, the current time of the system running the tool is used. Only applicable to loading; ignored otherwise. Express the value in ISO_ZONED_DATE_TIME format. DSE sets the query timestamps to the nearest microsecond and truncates sub-microseconds; any sub-microsecond information specified is lost. For more information, see the CQL Reference.

Default: unspecified

--schema.queryTtl number

The Time-To-Live (TTL) of inserted/updated cells during load (seconds); a value of -1 means there is no TTL. Not applicable to unloading. For more information, see the CQL Reference, Setting the time-to-live (TTL) for value, and Expiring data with time-to-live.

Default: -1

--schema-splits string
The number of token range splits in which to divide the token ring. This setting determines how many read requests are generated when reading an entire table. Only used when unloading and counting; ignored otherwise.
Note: The actual number of splits may be slightly greater or less than the number specified, depending on the actual cluster topology and token ownership. Also, it is not possible to generate fewer splits than the total number of primary token ranges in the cluster. Thus, the actual number of splits is always equal to or greater than that number.
Set the --schema-splits value to higher values if you experience timeouts when reading from DSE or DDAC tables, especially if paging is disabled. You can optionally use the special syntax nC to specify a number that is a multiple of the available cores, resulting in a calculated number of splits. For example, if the number of cores is 8, --schema-splits 0.5C = 0.5 * 8, which results in 4 splits.

Default: 8C