Schema options
Schema options for the dsbulk command.

The options can be used
in short form (-k keyspace_name
) or long form
(--schema.keyspace keyspace_name
).
- -k, --schema.keyspace, --dsbulk.schema.keyspace string
-
Keyspace used for loading or unloading data.Note: Do not quote keyspace names, and note that they are case sensitive.Either keyspace (this option) or for graph data the graph option is required if query is not specified or is not qualified with a keyspace name.
MyKeyspace
will match a keyspace namedMyKeyspace
but notmykeyspace
.Default:
null
- -t, --schema.table, --dsbulk.schema.table string
- -m, --schema.mapping, --dsbulk.schema.mapping string
- The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader for Apache Cassandra applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:
- Indexed data sources:
0 = col1, 1 = col2, 2 = col3
, where0
,1
,2
, are the zero-based indices of fields in the source data; andcol1
,col2
,col3
are bound variable names in the insert statement. - A shortcut to map the first
n
fields is to simply specify the destination columns:col1, col2, col3
. - Mapped data sources:
fieldA = col1, fieldB = col2, fieldC = col3
, wherefieldA
,fieldB
,fieldC
are field names in the source data; andcol1
,col2
,col3
are bound variable names in the insert statement. - A shortcut to map fields named like columns is to simply specify the
destination columns:
col1, col2, col3
.
To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columnsttl(*)
andwritetime(*)
:fieldA = writetime(*), fieldB = ttl(*)
.Attention: Starting in DataStax Bulk Loader 1.8.0, the special tokensTimestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit + codec.epoch; the latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.__timestamp
and__ttl
are deprecated (but still honored) . If used, a warning message is logged. When you can, replace any__timestamp
and__ttl
tokens withwritetime(*)
andttl(*)
, respectively.To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g.
now() = c4
). Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g.field1=now()
). Function calls can also be qualified by a keyspace name:field1 = keyspace1.max(c1, c2)
.In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified. For example, if a source row has fields
c1
,c2
,c3
, andc5
, and the table has columnsc1
,c2
,c3
,c4
, one can map all like-named columns and specify thatc5
in the source maps toc4
in the table as follows:* = *, c5 = c4
.To specify that all like-named fields be mapped, except for
c2
, use:* = -c2
. To skipc2
andc3
, use:* = [-c2, -c3]
.Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL:"Field ""A""" = "Column 2"
(to escape a double-quote, simply double it).Note: Unlike CQL grammar, unquoted identifiers will not be lowercased by DataStax Bulk Loader. An identifier such asMyColumn1
will match a column namedMyColumn1
, but will not matchmycolumn1
.The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector's documentation to know which kinds of mapping it supports.
Default:
null
- Indexed data sources:
- --schema.allowExtraFields, --dsbulk.allowExtraFields { true | false }
-
Specify whether to accept records that contain extra fields that are not declared in the mapping. For example, assume a record contains three fields A, B, and C, but the mapping declares only fields A and B. When this option is
true
, C is silently ignored and the record is considered valid. When set tofalse
, the record is rejected. Only applicable for loading; ignored otherwise.Default:
true
- --schema.allowMissingFields, --dsbulk.allowMissingFields { true | false }
-
Specify whether to accept records that are missing fields declared in the mapping. For example, assume the mapping declares three fields A, B, and C, but a record contains only fields A and B. When this option is
true
, C is silently assigned null and the record is considered valid. When set tofalse
, the record is rejected. Also applies to user-defined types (UDTs) and tuples, with element replacing field in explanation. If the missing field is mapped to a primary key column, the record is always rejected because the database would reject the record. Only applicable for loading; ignored otherwise.Default:
false
- The graph name used for loading or unloading DataStax Graph data. You can use this
option only for graphs created with DataStax Graph 6.8.0 and later. Note: Do not quote graph names, and note that they are case sensitive. For example,The keyspace or graph (this option) is required if the query option is not specified, or is not qualified with a keyspace name.
MyGraph
will match a graph namedMyGraph
, but notmygraph
.Default:
null
- The vertex label used for loading or unloading DataStax Graph data. You can use this
option only for graphs created with DataStax Graph 6.8.0 and later. The vertex label
must correspond to an existing table created with the
VERTEX LABEL
option ofCREATE TABLE
.Note: Do not quote vertex labels, and note that they are case sensitive. For example,A value for table, vertex (this option), or edge is required if the query option is not specified.MyVertex
will match a label namedMyVertex
, but will not matchmyvertex
. - The edge label used for loading or unloading DataStax Graph data. You can use this
option only for graphs created with DataStax Graph 6.8.0 and later. The edge label must
correspond to an existing table created with the
EDGE LABEL
option ofCREATE TABLE
.Note: Do not quote edge labels, and note that they are case sensitive. For example,MyEdge
will match a label namedMyEdge
but notmyedge
. - The name of the edge's incoming vertex label, for loading or unloading DataStax Graph
data. You can use this option only for graphs created with DataStax Graph 6.8.0 and
later. This option is mandatory when edge is specified; ignored otherwise. Note: Do not quote vertex labels, and note that they are case sensitive. For example,
MyVertex
will match a label namedMyVertex
but notmyvertex
.Default:
null
- The name of the edge's outgoing vertex label, for loading or unloading graph data. You
can use this option only for graphs created with DataStax Graph 6.8.0 and later. This
option is mandatory when edge is
specified; ignored otherwise. Note: Do not quote vertex labels, and note that they are case sensitive.
MyVertex
will match a label namedMyVertex
but notmyvertex
.Default:
null
- --schema.nullToUnset, --dsbulk.schema.nullToUnset { true | false }
-
Specify whether to map null input values to "unset" in the database. That is, the effect is to not modify a potentially pre-existing value of this field for this row. Valid for
dsbulk load
scenarios, otherwise ignored. Setting tofalse
creates tombstones to representnull
.The setting is applied after the codec.nullStrings setting, and may intercept nulls produced by that setting.
The setting is ignored by
dsbulk count
. When set totrue
, but the active protocol version does not support unset values (that is, all protocol versions less than 4), DataStax Bulk Loader for Apache Cassandra forces this setting tofalse
and a warning is logged.Default:
true
- -query,--schema.query, --dsbulk.schema.query string
-
The query to use. If not specified, then schema.keyspace and schema.table must be specified, and
dsbulk
will infer the appropriate statement based on the table's metadata, using all available columns. Ifschema.keyspace
is provided, the query need not include the keyspace to qualify the table reference.Important: UsingSELECT *
queries is not recommended and can have non-deterministic results. That is, non-deterministic as to which data is exported, and in which order. The result may change between different runs of the sameSELECT *
query. Also, if a database table has DSE Search enabled, the pseudo-columnsolr_query
could be accidentally exported by aSELECT *
query. DataStax Bulk Loader transparently filters out thesolr_query
pseudo-column for all automatically-generated queries. However, if you use theschema.query
option, you may need to filter out thatsolr_query
column, whenever applicable.For loading, the statement can be any
INSERT
,UPDATE
orDELETE
statement.INSERT
statements are preferred for most load operations, and bound variables should correspond to mapped fields; for example,INSERT INTO table1 (c1, c2, c3) VALUES (:fieldA, :fieldB, :fieldC)
.UPDATE
statements are required if the target table is a counter table, and the columns are updated with incremental operations (SET col1 = col1 + :fieldA
) where:fieldA
is a column in the input data. ADELETE
statement will remove existing data during the load operation.For unloading and counting, the statement can be any regular
SELECT
statement. If the statement does not contain aWHERE
clause, the engine will generate a token range restriction clause of the form:token(...) > :start and token(...) <= :end
, and the engine will generate as many statements as there are token ranges in the cluster, thus allowing parallelization of reads while at the same time targeting coordinators that are also replicas. However, if theSELECT
statement does contain aWHERE
clause, the engine will only be able to parallelize the operation if thatWHERE
clause also includes atoken(...) > :start and token(...) <= :end
relation (the bound variables can have any name).Tip: Starting in DataStax Bulk Loader 1.6.0,ORDER BY
,GROUP BY
andLIMIT
clauses now cause the query to be executed "as is," without parallelization.Statements can use both positional and named bound variables. Named bound variables are preferred. Named bound variables usually have names matching those of the columns in the destination table, but this is not a strict requirement; it is, however, required that their names match those of fields specified in the mapping. Positional variables will be named after their corresponding column in the destination table.Important: Positional variable cannot be used in SELECT statements that useWHERE
clause restrictions usingtoken()
,USING TIMESTAMP
clause, orUSING TTL
clause.When loading and unloading DataStax Graph data, the query must be provided in plain CQL; Gremlin queries are not supported.
See schema.mapping setting for more information.
Default:
null
- --schema.queryTimestamp, --dsbulk.queryTimestamp string
-
The timestamp of inserted/updated cells during load; otherwise, the current time of the system running the tool is used. Only applicable to loading; ignored otherwise. Express the value in ISO_ZONED_DATE_TIME format. DSE sets the query timestamps to the nearest microsecond and truncates sub-microseconds; any sub-microsecond information specified is lost. For more information, see the CQL Reference.
Default: unspecified
- --schema.queryTtl, --dsbulk.schema.queryTtl number
-
The Time-To-Live (TTL) of inserted/updated cells during load (seconds); a value of -1 means there is no TTL. Not applicable to unloading. For more information, see the CQL Reference, Setting the time-to-live (TTL) for value, and Expiring data with time-to-live.
Default:
-1
- --schema.splits, --dsbulk.schema.splits number
- The number of token range splits in which to divide the token ring. This setting determines how many read requests are generated when reading an entire table. Only used when unloading and counting; ignored otherwise.Note: The actual number of splits may be slightly greater or less than the number specified, depending on the actual cluster topology and token ownership. Also, it is not possible to generate fewer splits than the total number of primary token ranges in the cluster. Thus, the actual number of splits is always equal to or greater than that number.Set the
--schema.splits
value to higher values if you experience timeouts when reading from DSE, DataStax Astra, or Apache Cassandra tables, especially if paging is disabled. You can optionally use the special syntaxnC
to specify a number that is a multiple of the available cores, resulting in a calculated number of splits. For example, if the number of cores is 8,--schema.splits 0.5C
= 0.5 * 8, which results in 4 splits.Default:
8C
- --timestamp, --schema.preserveTimestamp, --dsbulk.schema.preserveTimestamp { true | false }
- Whether to preserve cell timestamps when loading and unloading. Ignored when
schema.query
is provided, or when the target table is a counter table. Iftrue
, the following rules will be applied to generated queries:- When loading, instead of a single
INSERT
statement, the generated query will be aBATCH
query; this is required in order to preserve individual column timestamps for each row. - When unloading, the generated
SELECT
statement will export each column along with its individual timestamp. - For both loading and unlaoding, DataStax Bulk Loader will import and export
timestamps using field names such as
writetime(<column>)
, where<column>
is the column's internal CQL name. For example, if the table has a column namedMyCol
, its corresponding timestamp would be exported aswritetime(MyCol)
in the generated query and in the resulting connector record. If you intend to use this feature to export and import tables letting DataStax Bulk Loader generate the appropriate queries, these names are fine and need not be changed. If, however, you would like to export or import data to or from external sources that use different field names, you could do so by using the function writetime in a schema.mapping entry. For example, the following mapping would mapcol1
along with its timestamp to two distinct fields,field1
andfield1_writetime
:field1 = col1, field1_writetime = writetime(col1)
Default:
false
- When loading, instead of a single
- --ttl, --schema.preserveTtl, --dsbulk.schema.preserveTtl { true | false }
- Whether to preserve cell TTLs when loading and unloading. Ignored when
schema.query
is provided, or when the target table is a counter table. Iftrue
, the following rules will be applied to generated queries:- When loading, instead of a single
INSERT
statement, the generated query will be aBATCH
query; this is required in order to preserve individual column TTLs for each row. - When unloading, the generated
SELECT
statement will export each column along with its individual TTL. - For both loading and unloading, DataStax Bulk Loader will import and export TTLs
using field names such as
ttl(<column>)
, where<column>
is the column's internal CQL name. For example, if the table has a column namedMyCol
, its corresponding TTL would be exported asttl(MyCol)
in the generated query and in the resulting connector record. If you intend to use this feature to export and import tables letting DataStax Bulk Loader generate the appropriate queries, these names are fine and need not be changed. If, however, you would like to export or import data to or from external sources that use different field names, you could do so by using the functionttl
in aschema.mapping
entry. For example, the following mapping would map col1 along with its TTL to two distinct fields,field1
andfield1_ttl
:field1 = col1, field1_ttl = ttl(col1)
Default:
false
- When loading, instead of a single