Schema options
Specify schema options for the dsbulk command.
Options highlighted with an icon () are available only for DataStax Graph 6.8.x data. Not available for Apache Cassandra® or DataStax Astra DB. |
The options can be used in short form (-k keyspace_name
) or long form (--schema.keyspace keyspace_name
).
-k, --schema.keyspace, --dsbulk.schema.keyspace string
Keyspace used for loading or unloading data.
Do not quote keyspace names, and note that they are case sensitive.
|
Either keyspace (this option) or for graph data the graph option is required if query is not specified or is not qualified with a keyspace name.
Default: null
-m, --schema.mapping, --dsbulk.schema.mapping string
The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:
-
Indexed data sources:
0 = col1, 1 = col2, 2 = col3
, where0
,1
,2
, are the zero-based indices of fields in the source data; andcol1
,col2
,col3
are bound variable names in the insert statement. -
A shortcut to map the first
n
fields is to simply specify the destination columns:col1, col2, col3
. -
Mapped data sources:
fieldA = col1, fieldB = col2, fieldC = col3
, wherefieldA
,fieldB
,fieldC
are field names in the source data; andcol1
,col2
,col3
are bound variable names in the insert statement. -
A shortcut to map fields named like columns is to simply specify the destination columns:
col1, col2, col3
. To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columnsttl()
andwritetime()
:fieldA = writetime(), fieldB = ttl()
.
Starting in DataStax Bulk Loader 1.8.0, the special tokens |
Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit + codec.epoch; the latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.
To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g.
now() = c4
).
Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g.
field1=now()
).
Function calls can also be qualified by a keyspace name: field1 = keyspace1.max(c1, c2)
.
In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified.
For example, if a source row has fields c1
, c2
, c3
, and c5
, and the table has columns c1
, c2
, c3
, c4
, one can map all like-named columns and specify that c5
in the source maps to c4
in the table as follows: * = *, c5 = c4
.
To specify that all like-named fields be mapped, except for c2
, use: * = -c2
.
To skip c2
and c3
, use: * = [-c2, -c3]
.
Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL: "Field ""A""" = "Column 2"
(to escape a double-quote, simply double it).
Unlike CQL grammar, unquoted identifiers are not lowercased by DataStax Bulk Loader.
An identifier such as |
The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector’s documentation to know which kinds of mapping it supports.
Default: null
--schema.allowExtraFields, --dsbulk.allowExtraFields { true | false }
Specify whether to accept records that contain extra fields that are not declared in the mapping.
For example, assume a record contains three fields A, B, and C, but the mapping declares only fields A and B.
When this option is true
, C is silently ignored and the record is considered valid.
When set to false
, the record is rejected.
Only applicable for loading;
ignored otherwise.
Default: true
--schema.allowMissingFields, --dsbulk.allowMissingFields { true | false }
Specify whether to accept records that are missing fields declared in the mapping.
For example, assume the mapping declares three fields A, B, and C, but a record contains only fields A and B.
When this option is true
, C is silently assigned null and the record is considered valid.
When set to false
, the record is rejected.
Also applies to user-defined types (UDTs) and tuples, with element replacing field in explanation.
If the missing field is mapped to a primary key column, the record is always rejected because the database would reject the record.
Only applicable for loading; ignored otherwise.
Default: false
-g, --schema.graph, --dsbulk.schema.graph string
The graph name used for loading or unloading DataStax Graph data. You can use this option only for graphs created with DataStax Graph 6.8.0 and later.
Do not quote graph names, and note that they are case sensitive.
For example, |
The keyspace or graph (this option) is required if the query option is not specified, or is not qualified with a keyspace name.
Default: null
-v, --schema.vertex, --dsbulk.schema.vertex string
The vertex label used for loading or unloading DataStax Graph data.
You can use this option only for graphs created with DataStax Graph 6.8.0 and later.
The vertex label must correspond to an existing table created with the VERTEX LABEL
option of CREATE TABLE
.
+
Do not quote vertex labels, and note that they are case sensitive.
For example, |
For related CREATE TABLE
information, see Creating a table with a vertex label or an edge label in the CQL commands reference.
Default: null
-e, --schema.edge, --dsbulk.schema.edge string
The edge label used for loading or unloading DataStax Graph data.
You can use this option only for graphs created with DataStax Graph 6.8.0 and later.
The edge label must correspond to an existing table created with the EDGE LABEL
option of CREATE TABLE
.
Do not quote edge labels, and note that they are case sensitive.
For example, |
For related CREATE TABLE
information, see Creating a table with a vertex label or an edge label in the CQL commands reference.
Default: null
-from, --schema.from, --dsbulk.schema.from string
The name of the edge’s incoming vertex label, for loading or unloading DataStax Graph data. You can use this option only for graphs created with DataStax Graph 6.8.0 and later. This option is mandatory when edge is specified; ignored otherwise.
Do not quote vertex labels, and note that they are case sensitive.
For example, |
Default: null
-to, --schema.to, --dsbulk.schema.to string
The name of the edge’s outgoing vertex label, for loading or unloading graph data. You can use this option only for graphs created with DataStax Graph 6.8.0 and later. This option is mandatory when edge is specified; ignored otherwise.
Do not quote vertex labels, and note that they are case sensitive.
|
Default: null
--schema.nullToUnset, --dsbulk.schema.nullToUnset { true | false }
Specify whether to map null input values to "unset" in the database.
That is, the effect is to not modify a potentially pre-existing value of this field for this row.
Valid for dsbulk load
scenarios, otherwise ignored.
Setting to false
creates tombstones to represent null
.
The setting is applied after the codec.nullStrings setting, and may intercept nulls produced by that setting.
The setting is ignored by dsbulk count
.
When set to true
, but the active protocol version does not support unset values (that is, all protocol versions less than 4), DataStax Bulk Loader forces this setting to false
and a warning is logged.
Default: true
-query,--schema.query, --dsbulk.schema.query string
The query to use.
If not specified, then schema.keyspace and schema.table must be specified, and dsbulk
infers the appropriate statement based on the table’s metadata, using all available columns.
If schema.keyspace
is provided, the query need not include the keyspace to qualify the table reference.
Using |
For loading, the statement can be any INSERT
, UPDATE
or DELETE
statement.
INSERT
statements are preferred for most load operations, and bound variables should correspond to mapped fields;
for example, INSERT INTO table1 (c1, c2, c3) VALUES (:fieldA, :fieldB, :fieldC)
.
UPDATE
statements are required if the target table is a counter table, and the columns are updated with incremental operations (SET col1 = col1 + :fieldA
) where :fieldA
is a column in the input data.
A DELETE
statement removes existing data during the load operation.
For unloading and counting, the statement can be any regular SELECT
statement.
If the statement does not contain a WHERE
clause, the engine generates a token range restriction clause of the form: token(...) > :start and token(...) <= :end
, and the engine generates as many statements as there are token ranges in the cluster, thus allowing parallelization of reads while at the same time targeting coordinators that are also replicas.
However, if the SELECT
statement does contain a WHERE
clause, the engine is only be able to parallelize the operation if that WHERE
clause also includes atoken(...) > :start and token(...) <= :end
relation (the bound variables can have any name).
Starting in DataStax Bulk Loader 1.6.0, |
Statements can use both positional and named bound variables. Named bound variables are preferred. Named bound variables usually have names matching those of the columns in the destination table, but this is not a strict requirement; it is, however, required that their names match those of fields specified in the mapping. Positional variables are named after their corresponding column in the destination table.
Positional variable cannot be used in SELECT statements that use |
When loading and unloading DataStax Graph data, the query must be provided in plain CQL; Gremlin queries are not supported.
See schema mapping setting for more information.
Default: null
--schema.queryTimestamp, --dsbulk.queryTimestamp string
The timestamp of inserted/updated cells during load; otherwise, the current time of the system running the tool is used. Only applicable to loading; ignored otherwise. Express the value in ISO_ZONED_DATE_TIME format. DSE sets the query timestamps to the nearest microsecond and truncates sub-microseconds; any sub-microsecond information specified is lost. For more information, see the CQL Reference.
Default: unspecified
--schema.queryTtl, --dsbulk.schema.queryTtl number
The Time-To-Live (TTL) of inserted/updated cells during load (seconds); a value of -1 means there is no TTL. Not applicable to unloading. For more information, see the CQL Reference, Setting the time-to-live (TTL) for value, and Expiring data with time-to-live.
Default: -1
--schema.splits, --dsbulk.schema.splits number
The number of token range splits in which to divide the token ring. This setting determines how many read requests are generated when reading an entire table. Only used when unloading and counting; ignored otherwise.
The actual number of splits may be slightly greater or less than the number specified, depending on the actual cluster topology and token ownership. Also, it is not possible to generate fewer splits than the total number of primary token ranges in the cluster. Thus, the actual number of splits is always equal to or greater than that number. |
Set the --schema.splits
value to higher values if you experience timeouts when reading from DSE, DataStax Astra, or Apache Cassandra tables, especially if paging is disabled.
You can optionally use the special syntax nC
to specify a number that is a multiple of the available cores, resulting in a calculated number of splits.
For example, if the number of cores is 8, --schema.splits 0.5C
= 0.5 * 8, which results in 4 splits.
Default: 8C
--timestamp, --schema.preserveTimestamp, --dsbulk.schema.preserveTimestamp { true | false }
Whether to preserve cell timestamps when loading and unloading.
Ignored when schema.query
is provided, or when the target table is a counter table.
If true
, the following rules are applied to generated queries:
-
When loading, instead of a single
INSERT
statement, the generated query is aBATCH
query. This is required in order to preserve individual column timestamps for each row. -
When unloading, the generated
SELECT
statement exports each column along with its individual timestamp. -
For both loading and unloading, DataStax Bulk Loader imports and exports timestamps using field names such as
writetime(<column>)
, where<column>
is the column’s internal CQL name. For example, if the table has a column namedMyCol
, its corresponding timestamp would be exported aswritetime(MyCol)
in the generated query and in the resulting connector record. If you intend to use this feature to export and import tables letting DataStax Bulk Loader generate the appropriate queries, these names are fine and need not be changed. If, however, you would like to export or import data to or from external sources that use different field names, you could do so by using the function writetime in a schema.mapping entry. For example, the following mapping would mapcol1
along with its timestamp to two distinct fields,field1
andfield1_writetime
:field1 = col1, field1_writetime = writetime(col1)
Default: false
--ttl, --schema.preserveTtl, --dsbulk.schema.preserveTtl { true | false }
Whether to preserve cell TTLs when loading and unloading.
Ignored when schema.query
is provided, or when the target table is a counter table.
If true
, the following rules are applied to generated queries:
-
When loading, instead of a single
INSERT
statement, the generated query is aBATCH
query; this is required in order to preserve individual column TTLs for each row. -
When unloading, the generated
SELECT
statement exports each column along with its individual TTL. -
For both loading and unloading, DataStax Bulk Loader imports and exports TTLs using field names such as
ttl(<column>)
, where<column>
is the column’s internal CQL name. For example, if the table has a column namedMyCol
, its corresponding TTL would be exported asttl(MyCol)
in the generated query and in the resulting connector record. If you intend to use this feature to export and import tables letting DataStax Bulk Loader generate the appropriate queries, these names are fine and need not be changed. If, however, you would like to export or import data to or from external sources that use different field names, you could do so by using the functionttl
in aschema.mapping
entry. For example, the following mapping would map col1 along with its TTL to two distinct fields,field1
andfield1_ttl
:
field1 = col1, field1_ttl = ttl(col1)
Default: false