Schema options

Specify schema options for the dsbulk command.

Options highlighted with an icon (DataStax Graph icon) are available only for DataStax Graph 6.8.x data. Not available for Apache Cassandra® or DataStax Astra DB.

The options can be used in short form (-k keyspace_name) or long form (--schema.keyspace keyspace_name).

-k, --schema.keyspace, --dsbulk.schema.keyspace string

Keyspace used for loading or unloading data.

Do not quote keyspace names, and note that they are case sensitive. MyKeyspace matches a keyspace named MyKeyspace but not mykeyspace.

Either keyspace (this option) or for graph data the graph option is required if query is not specified or is not qualified with a keyspace name.

Default: null

-t, --schema.table, --dsbulk.schema.table string

Table used for loading or unloading data.

Do not quote table names, and note that they are case sensitive. MyTable matches a table named MyTable but not mytable.

This table option, the vertex option, or, for graph data, the edge option is required if query is not specified.

Default: null

-m, --schema.mapping, --dsbulk.schema.mapping string

The field-to-column mapping to use. Applies to loading and unloading. If not specified, DataStax Bulk Loader applies a strict one-to-one mapping between the source fields and the database table. If that is not your intention, you must supply an explicit mapping. Mappings should be specified as a map of the following form:

  • Indexed data sources: 0 = col1, 1 = col2, 2 = col3, where 0, 1, 2, are the zero-based indices of fields in the source data; and col1, col2, col3 are bound variable names in the insert statement.

  • A shortcut to map the first n fields is to simply specify the destination columns: col1, col2, col3.

  • Mapped data sources: fieldA = col1, fieldB = col2, fieldC = col3, where fieldA, fieldB, fieldC are field names in the source data; and col1, col2, col3 are bound variable names in the insert statement.

  • A shortcut to map fields named like columns is to simply specify the destination columns: col1, col2, col3. To specify that a field should be used as the timestamp (write time) or as ttl (time to live) of the inserted row, use the specially named fake columns ttl() and writetime(): fieldA = writetime(), fieldB = ttl().

Starting in DataStax Bulk Loader 1.8.0, the special tokens \\__timestamp and \\__ttl are deprecated (but still honored) . If used, a warning message is logged. When you can, replace any \\__timestamp and \\__ttl tokens with writetime (*) and ttl(*), respectively.

Timestamp fields can be parsed as CQL timestamp columns and must use the format specified in either codec.timestamp or codec.unit + codec.epoch; the latter is an integer representing the number of units specified by codec.unit since the specified epoch. TTL fields are parsed as integers representing duration in seconds and must use the format specified in codec.number.

To specify that a column should be populated with the result of a function call for loading operations, specify the function call as the input field (e.g. now() = c4). Similarly, to specify that a field should be populated with the result of a function call for unloading operations, specify the function call as the input column (e.g. field1=now()). Function calls can also be qualified by a keyspace name: field1 = keyspace1.max(c1, c2).

In addition, for mapped data sources, it is also possible to specify that the mapping be partly auto-generated and partly explicitly specified. For example, if a source row has fields c1, c2, c3, and c5, and the table has columns c1, c2, c3, c4, one can map all like-named columns and specify that c5 in the source maps to c4 in the table as follows: * = *, c5 = c4.

To specify that all like-named fields be mapped, except for c2, use: * = -c2. To skip c2 and c3, use: * = [-c2, -c3].

Any identifier, field, or column, that is not strictly alphanumeric (that is, not matching [a-zA-Z0-9_]+) must be surrounded by double-quotes, just like you would do in CQL: "Field ""A""" = "Column 2" (to escape a double-quote, simply double it).

Unlike CQL grammar, unquoted identifiers are not lowercased by DataStax Bulk Loader. An identifier such as MyColumn1 matches a column named MyColumn1, but does not match mycolumn1.

The exact type of mapping to use depends on the connector being used. Some connectors can only produce indexed records; others can only produce mapped ones, while others are capable of producing both indexed and mapped records at the same time. Refer to the connector’s documentation to know which kinds of mapping it supports.

Default: null

--schema.allowExtraFields, --dsbulk.allowExtraFields { true | false }

Specify whether to accept records that contain extra fields that are not declared in the mapping. For example, assume a record contains three fields A, B, and C, but the mapping declares only fields A and B. When this option is true, C is silently ignored and the record is considered valid. When set to false, the record is rejected. Only applicable for loading; ignored otherwise.

Default: true

--schema.allowMissingFields, --dsbulk.allowMissingFields { true | false }

Specify whether to accept records that are missing fields declared in the mapping. For example, assume the mapping declares three fields A, B, and C, but a record contains only fields A and B. When this option is true, C is silently assigned null and the record is considered valid. When set to false, the record is rejected. Also applies to user-defined types (UDTs) and tuples, with element replacing field in explanation. If the missing field is mapped to a primary key column, the record is always rejected because the database would reject the record. Only applicable for loading; ignored otherwise.

Default: false

DataStax Graph icon -g, --schema.graph, --dsbulk.schema.graph string

The graph name used for loading or unloading DataStax Graph data. You can use this option only for graphs created with DataStax Graph 6.8.0 and later.

Do not quote graph names, and note that they are case sensitive. For example, MyGraph matches a graph named MyGraph, but not mygraph.

The keyspace or graph (this option) is required if the query option is not specified, or is not qualified with a keyspace name.

Default: null

DataStax Graph icon -v, --schema.vertex, --dsbulk.schema.vertex string

The vertex label used for loading or unloading DataStax Graph data. You can use this option only for graphs created with DataStax Graph 6.8.0 and later. The vertex label must correspond to an existing table created with the VERTEX LABEL option of CREATE TABLE.

+

Do not quote vertex labels, and note that they are case sensitive. For example, MyVertex matches a label named MyVertex, but does not match myvertex.

A value for vertex (this option), or edge is required if the query option is not specified.

For related CREATE TABLE information, see Creating a table with a vertex label or an edge label in the CQL commands reference.

Default: null

DataStax Graph icon -e, --schema.edge, --dsbulk.schema.edge string

The edge label used for loading or unloading DataStax Graph data. You can use this option only for graphs created with DataStax Graph 6.8.0 and later. The edge label must correspond to an existing table created with the EDGE LABEL option of CREATE TABLE.

Do not quote edge labels, and note that they are case sensitive. For example, MyEdge matches a label named MyEdge but does not match myedge.

Also, when this edge option is specified, the from and the to options must be specified.

One of table, vertex, or edge is required if query is not specified.

For related CREATE TABLE information, see Creating a table with a vertex label or an edge label in the CQL commands reference.

Default: null

DataStax Graph icon -from, --schema.from, --dsbulk.schema.from string

The name of the edge’s incoming vertex label, for loading or unloading DataStax Graph data. You can use this option only for graphs created with DataStax Graph 6.8.0 and later. This option is mandatory when edge is specified; ignored otherwise.

Do not quote vertex labels, and note that they are case sensitive. For example, MyVertex matches a label named MyVertex but does not match myvertex.

Default: null

DataStax Graph icon -to, --schema.to, --dsbulk.schema.to string

The name of the edge’s outgoing vertex label, for loading or unloading graph data. You can use this option only for graphs created with DataStax Graph 6.8.0 and later. This option is mandatory when edge is specified; ignored otherwise.

Do not quote vertex labels, and note that they are case sensitive. MyVertex matches a label named MyVertex but does not match myvertex.

Default: null

--schema.nullToUnset, --dsbulk.schema.nullToUnset { true | false }

Specify whether to map null input values to "unset" in the database. That is, the effect is to not modify a potentially pre-existing value of this field for this row. Valid for dsbulk load scenarios, otherwise ignored. Setting to false creates tombstones to represent null.

The setting is applied after the codec.nullStrings setting, and may intercept nulls produced by that setting.

The setting is ignored by dsbulk count. When set to true, but the active protocol version does not support unset values (that is, all protocol versions less than 4), DataStax Bulk Loader forces this setting to false and a warning is logged.

Default: true

-query,--schema.query, --dsbulk.schema.query string

The query to use. If not specified, then schema.keyspace and schema.table must be specified, and dsbulk infers the appropriate statement based on the table’s metadata, using all available columns. If schema.keyspace is provided, the query need not include the keyspace to qualify the table reference.

Using SELECT * queries is not recommended and can have non-deterministic results. That is, non-deterministic as to which data is exported, and in which order. The result may change between different runs of the same SELECT * query. Also, if a database table has DSE Search enabled, the pseudo-column solr_query could be accidentally exported by a SELECT * query. DataStax Bulk Loader transparently filters out the solr_query pseudo-column for all automatically-generated queries. However, if you use the schema.query option, you may need to filter out that solr_query column, whenever applicable.

For loading, the statement can be any INSERT, UPDATE or DELETE statement. INSERT statements are preferred for most load operations, and bound variables should correspond to mapped fields; for example, INSERT INTO table1 (c1, c2, c3) VALUES (:fieldA, :fieldB, :fieldC). UPDATE statements are required if the target table is a counter table, and the columns are updated with incremental operations (SET col1 = col1 + :fieldA) where :fieldA is a column in the input data. A DELETE statement removes existing data during the load operation.

For unloading and counting, the statement can be any regular SELECT statement. If the statement does not contain a WHERE clause, the engine generates a token range restriction clause of the form: token(...) > :start and token(...) <= :end, and the engine generates as many statements as there are token ranges in the cluster, thus allowing parallelization of reads while at the same time targeting coordinators that are also replicas. However, if the SELECT statement does contain a WHERE clause, the engine is only be able to parallelize the operation if that WHERE clause also includes atoken(...) > :start and token(...) <= :end relation (the bound variables can have any name).

Starting in DataStax Bulk Loader 1.6.0, ORDER BY, GROUP BY and LIMIT clauses now cause the query to be executed "as is," without parallelization.

Statements can use both positional and named bound variables. Named bound variables are preferred. Named bound variables usually have names matching those of the columns in the destination table, but this is not a strict requirement; it is, however, required that their names match those of fields specified in the mapping. Positional variables are named after their corresponding column in the destination table.

Positional variable cannot be used in SELECT statements that use WHERE clause restrictions using token(), USING TIMESTAMP clause, or USING TTL clause.

When loading and unloading DataStax Graph data, the query must be provided in plain CQL; Gremlin queries are not supported.

See schema mapping setting for more information.

Default: null

--schema.queryTimestamp, --dsbulk.queryTimestamp string

The timestamp of inserted/updated cells during load; otherwise, the current time of the system running the tool is used. Only applicable to loading; ignored otherwise. Express the value in ISO_ZONED_DATE_TIME format. DSE sets the query timestamps to the nearest microsecond and truncates sub-microseconds; any sub-microsecond information specified is lost. For more information, see the CQL Reference.

Default: unspecified

--schema.queryTtl, --dsbulk.schema.queryTtl number

The Time-To-Live (TTL) of inserted/updated cells during load (seconds); a value of -1 means there is no TTL. Not applicable to unloading. For more information, see the CQL Reference, Setting the time-to-live (TTL) for value, and Expiring data with time-to-live.

Default: -1

--schema.splits, --dsbulk.schema.splits number

The number of token range splits in which to divide the token ring. This setting determines how many read requests are generated when reading an entire table. Only used when unloading and counting; ignored otherwise.

The actual number of splits may be slightly greater or less than the number specified, depending on the actual cluster topology and token ownership. Also, it is not possible to generate fewer splits than the total number of primary token ranges in the cluster. Thus, the actual number of splits is always equal to or greater than that number.

Set the --schema.splits value to higher values if you experience timeouts when reading from DSE, DataStax Astra, or Apache Cassandra tables, especially if paging is disabled. You can optionally use the special syntax nC to specify a number that is a multiple of the available cores, resulting in a calculated number of splits. For example, if the number of cores is 8, --schema.splits 0.5C = 0.5 * 8, which results in 4 splits.

Default: 8C

--timestamp, --schema.preserveTimestamp, --dsbulk.schema.preserveTimestamp { true | false }

Whether to preserve cell timestamps when loading and unloading. Ignored when schema.query is provided, or when the target table is a counter table. If true, the following rules are applied to generated queries:

  • When loading, instead of a single INSERT statement, the generated query is a BATCH query. This is required in order to preserve individual column timestamps for each row.

  • When unloading, the generated SELECT statement exports each column along with its individual timestamp.

  • For both loading and unlaoding, DataStax Bulk Loader imports and exports timestamps using field names such as writetime(<column>), where <column> is the column’s internal CQL name. For example, if the table has a column named MyCol, its corresponding timestamp would be exported as writetime(MyCol) in the generated query and in the resulting connector record. If you intend to use this feature to export and import tables letting DataStax Bulk Loader generate the appropriate queries, these names are fine and need not be changed. If, however, you would like to export or import data to or from external sources that use different field names, you could do so by using the function writetime in a schema.mapping entry. For example, the following mapping would map col1 along with its timestamp to two distinct fields, field1 and field1_writetime:

    field1 = col1, field1_writetime = writetime(col1)

Default: false

--ttl, --schema.preserveTtl, --dsbulk.schema.preserveTtl { true | false }

Whether to preserve cell TTLs when loading and unloading. Ignored when schema.query is provided, or when the target table is a counter table. If true, the following rules are applied to generated queries:

  • When loading, instead of a single INSERT statement, the generated query is a BATCH query; this is required in order to preserve individual column TTLs for each row.

  • When unloading, the generated SELECT statement exports each column along with its individual TTL.

  • For both loading and unloading, DataStax Bulk Loader imports and exports TTLs using field names such as ttl(<column>), where <column> is the column’s internal CQL name. For example, if the table has a column named MyCol, its corresponding TTL would be exported as ttl(MyCol) in the generated query and in the resulting connector record. If you intend to use this feature to export and import tables letting DataStax Bulk Loader generate the appropriate queries, these names are fine and need not be changed. If, however, you would like to export or import data to or from external sources that use different field names, you could do so by using the function ttl in a schema.mapping entry. For example, the following mapping would map col1 along with its TTL to two distinct fields, field1 and field1_ttl:

field1 = col1, field1_ttl = ttl(col1)

Default: false

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com