Schema options

Schema options for the dsbulk command

Specify schema options for the dsbulk command.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-k,--schema.keyspace string

Keyspace used for loading, unloading, or counting data. Keyspace names should not be quoted and are case-sensitive. MyKeyspace will match a keyspace named MyKeyspace but not mykeyspace. Required option if schema.query is not specified; otherwise, optional.

Default: unspecified

-t,--schema.table string

Table used for loading, unloading, or counting data. Table names should not be quoted and are case-sensitive. MyTable will match a table named MyTable but not mytable. Required option if schema.query is not specified; otherwise, optional.

Default: unspecified

-m,--schema.mapping string
The field-to-column mapping to use, that applies to both loading and unloading. If not specified, the loader will apply a strict one-to-one mapping between the source fields and the database table. If that is not what you want, then you must supply an explicit mapping. Mappings should be specified as a map of the following form:
  • Indexed data sources: 0 = col1, 1 = col2, 2 = col3, where 0, 1, 2, are the zero-based indices of fields in the source data; and col1, col2, col3 are bound variable names in the insert statement.
    • A shortcut to map the first n fields is to simply specify the destination columns: col1, col2, col3.
  • Mapped data sources: fieldA = col1, fieldB = col2, fieldC = col3, where fieldA, fieldB, fieldC are field names in the source data; and col1, col2, col3 are bound variable names in the insert statement.
To specify that a field should be used for the query timestamp or ttl, use the specially named fake columns __ttl and __timestamp: fieldA = __timestamp, fieldB = __ttl. Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit + codec.epoch; the latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.

To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g. now() = c4). Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g. field1=now()). Function calls can also be qualified by a keyspace name: field1 = keyspace1.max(c1, c2).

In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified. For example, if a source row has fields c1, c2, c3, and c5, and the table has columns c1, c2, c3, c4, one can map all like-named columns and specify that c5 in the source maps to c4 in the table as follows: * = *, c5 = c4.

One can specify that all like-named fields be mapped, except for c2: * = -c2. To skip c2 and c3: * = [-c2, -c3]. Use quotes for any non-alphanumeric identifiers, fields or columns.

The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector's documentation to know which kinds of mapping it supports.

Default: null

--schema.nullToUnset ( true | false )
Specify whether to map null input values to "unset" in the database, i.e., don't modify a potentially pre-existing value of this field for this row. Valid for load scenarios, otherwise ignored. Note that setting to false creates tombstones to represent null. When the protocol version in use does not support unset values (all protocol versions less than 4), this setting is forced to false and a warning will be logged. Setting ignored for counting.
CAUTION: This setting is applied after the codec.nullStrings setting, and may intercept nulls produced by that setting.

Default: true

--schema.allowExtraFields ( true | false )

Specify whether or not to accept records that contain extra fields that are not declared in the mapping. For example, if a record contains three fields A, B, and C, but the mapping only declares fields A and B, then if this option is true, C will be silently ignored and the record will be considered valid, and if false, the record will be rejected. Only applicable for loading, ignored otherwise.

Default: true

--schema.allowMissingFields ( true | false )

Specify whether or not to accept records that are missing fields declared in the mapping. For example, if the mapping declares three fields A, B, and C, but a record contains only fields A and B, then if this option is true, C will be silently assigned null and the record will be considered valid, and if false, the record will be rejected. Also applies to user-defined types (UDTs) and tuples, with element replacing field in explanation. If the missing field is mapped to a primary key column, the record will always be rejected, since the database will reject the record. Only applicable for loading, ignored otherwise.

Default: false

-query,--schema.query string

The query to use. If not specified, then schema.keyspace and schema.table must be specified, and dsbulk will infer the appropriate statement based on the table's metadata, using all available columns. If schema.keyspace is provided, the query need not include the keyspace to qualify the table reference.

For loading, the statement can be any INSERT, UPDATE or DELETE statement. INSERT statements are preferred for most load operations, and bound variables should correspond to mapped fields; for example, INSERT INTO table1 (c1, c2, c3) VALUES (:fieldA, :fieldB, :fieldC). UPDATE statements are required if the target table is a counter table, and the columns are updated with incremental operations (SET col1 = col1 + :fieldA) where :fieldA is a column in the input data. A DELETE statement will remove existing data during the load operation.

For unloading and counting, the statement can be any regular SELECT statement. If the statement does not contain a WHERE clause, the engine will generate a token range restriction clause of the form: token(...) > :start and token(...) <= :end, and the engine will generate as many statements as there are token ranges in the cluster, thus allowing parallelization of reads while at the same time targeting coordinators that are also replicas. However, if the SELECT statement does contain a WHERE clause, the engine will only be able to parallelize the operation if that WHERE clause also includes a token(...) > :start and token(...) <= :end relation (the bound variables can have any name).

Statements can use both positional and named bound variables. Named bound variables are preferred. Named bound variables usually have names matching those of the columns in the destination table, but this is not a strict requirement; it is, however, required that their names match those of fields specified in the mapping. Positional variables will be named after their corresponding column in the destination table.
Important: Positional variable cannot be used in SELECT statements that use WHERE clause restrictions using token(), USING TIMESTAMP clause, or USING TTL clause.

See schema.mapping setting for more information.

Default: null

--schema.queryTimestamp string

The timestamp of inserted/updated cells during load; otherwise, the current time of the system running the tool is used. Not applicable to unloading nor counting. Express the value in ISO_ZONED_DATE_TIME format. DSE sets the query timestamps to the nearest microsecond and truncates sub-microseconds; any sub-microsecond information specified is lost. For more information, see the CQL Reference.

Default: unspecified

--schema.queryTtl number

The Time-To-Live (TTL) of inserted/updated cells during load (seconds); a value of -1 means there is no TTL. Not applicable to unloading. For more information, see the CQL Reference, Setting the time-to-live (TTL) for value, and Expiring data with time-to-live.

Default: -1