CREATE TABLE
Define a new table.
Define a new table.
Synopsis
CREATE TABLE [IF NOT EXISTS] [keyspace_name.]table_name (
column_definition [, ...]
PRIMARY KEY (column_name [, column_name ...])
[WITH table_options
| CLUSTERING ORDER BY (clustering_column_name order])
| ID = 'table_hash_tag'
| COMPACT STORAGE]
Syntax conventions | Description |
---|---|
UPPERCASE | Literal keyword. |
Lowercase | Not literal. |
Italics |
Variable value. Replace with a user-defined value. |
[] |
Optional. Square brackets ( [] ) surround optional command
arguments. Do not type the square brackets. |
( ) |
Group. Parentheses ( ( ) ) identify a group to choose from. Do
not type the parentheses. |
| |
Or. A vertical bar ( | ) separates alternative elements. Type
any one of the elements. Do not type the vertical bar. |
... |
Repeatable. An ellipsis ( ... ) indicates that you can repeat
the syntax element as often as required. |
'Literal string' |
Single quotation ( ' ) marks must surround literal strings in
CQL statements. Use single quotation marks to preserve upper case. |
{ key : value } |
Map collection. Braces ( { } ) enclose map collections or key
value pairs. A colon separates the key and the value. |
<datatype1,datatype2> |
Set, list, map, or tuple. Angle brackets ( < > ) enclose
data types in a set, list, map, or tuple. Separate the data types with a comma.
|
cql_statement; |
End CQL statement. A semicolon ( ; ) terminates all CQL
statements. |
[--] |
Separate the command line options from the command arguments with two hyphens (
-- ). This syntax is useful when arguments might be mistaken for
command line options. |
' <schema> ... </schema> ' |
Search CQL only: Single quotation marks ( ' ) surround an
entire XML schema declaration. |
@xml_entity='xml_entity_type' |
Search CQL only: Identify the entity and literal value to overwrite the XML element in the schema and solrConfig files. |
Description
The CREATE TABLE
command creates a new table under the current keyspace.
The IF NOT EXISTS
keywords may be used in creating a table. Attempting to
create an existing table returns an error unless the IF NOT EXISTS
option
is used. If the option is used, the statement if a no-op if the table already exists.
A static column can store the same data in multiple
clustered rows of a partition, and then retrieve that data with a single
SELECT
statement.
You can add a counter column to a counter table.
column_definition
Name column, define the data type, and set to static or primary key.
column_name cql_type_definition [STATIC | PRIMARY KEY] [, ...]
Description | |
---|---|
column_name | Use a unique name for each column in table. To preserve case or use special characters, enclose the name in double-quotes. |
cql_type_definition | Defines the type of data allowed in the column, see CQL data type or a user-defined type. |
STATIC |
Optional, the column has a single value. |
PRIMARY KEY |
Optional, use to indicate that the column is the only primary key for the table; to define a key that uses multiple columns see PRIMARY KEY. |
- When primary key is at the end of a column definition that column is the only primary key for the table.
- A table must have at least one
PRIMARY KEY
. - A static column cannot be a primary key.
- Primary keys can include frozen collections.
Create a table that has a frozen user defined type.
CREATE TABLE cycling.race_winners (
race_name text,
race_position int,
cyclist_name FROZEN<fullname>,
PRIMARY KEY (race_name, race_position));
See "Creating a user-defined type" for information on creating UDTs. In Cassandra 3.6 and later, UDTs can be created unfrozen if only non-collection fields are used in the user-defined type creation. If the table is created with an unfrozen UDT, then individual field values can be updated and deleted.
PRIMARY KEY
Uniquely identifies rows, determines storage partitions, and data ordering (clustering) within the partition.
Single key
PRIMARY KEY
is one column, append PRIMARY KEY to the end of the
column definition. This is only schema information required to create a table. When there is
one primary key, it is the partition key; the data is divided and stored by the unique
values in this
column.column_name cql_type_definition PRIMARY KEY
Alternatively, you can declare the primary key consisting of only one column in the same way as you declare a compound primary key.
CREATE TABLE cycling.cyclist_name ( id UUID PRIMARY KEY, lastname text, firstname text );
Primary keys cannot have the data type: counter, non-frozen collection, or static.
Compound key
PRIMARY KEY (partition_column_name, clustering_column_name [, ...])
CREATE TABLE cycling.cyclist_category ( category text, points int, id UUID, lastname text, PRIMARY KEY (category, points)) WITH CLUSTERING ORDER BY (points DESC);
Composite partition key
PRIMARY KEY (
(partition_column_name[, ...]),
clustering_column_name [, ...])
CREATE TABLE cycling.rank_by_year_and_name ( race_year int, race_name text, cyclist_name text, rank int, PRIMARY KEY ((race_year, race_name), rank) );
table_options
CQL table properties and descriptions of the syntax.
Table properties tune data handling, including I/O operations, compression, and compaction. Set table properties in the following CQL requests:
option_name = 'value' [AND ...]
option_name = { 'subproperty' : 'value' [, ...] } [AND ...]
WITH
clause to define table property options,
separate multiple values with AND
, for
example:ALTER TABLE [keyspace_name.]table_name
WITH option_name = 'value'
AND option_name = option_map;
- bloom_filter_fp_chance
-
False-positive probability for SSTable bloom filter. When a client requests data, the bloom filter checks if the row exists before executing disk I/O.
bloom_filter_fp_chance = N
Value ranges from 0 to 1.0, where:0
: (Min value) enables the largest possible bloom filter and uses the most memory.1.0
: (Max value) disables the bloom filter.
Tip: Recommended setting:0.1
. A higher value yields diminishing returns.Default:
bloom_filter_fp_chance = '0.01'
- caching
-
Caching optimizes the use of cache memory of a table without manual tuning. Cassandra weighs the cached data by size and access frequency. Coordinate this setting with the global caching properties in the cassandra.yaml file. See Cassandra 3.0 documentation.
caching = { 'keys' = 'ALL | NONE', 'rows_per_partition' = 'ALL' | 'NONE' |N}
Valid values:ALL
– all primary keys or rowsNONE
– no primary keys or rowsN
: (rows per partition only) Number of rows; specify a whole number
Cassandra caches only the first N rows in a partition, as determined by the clustering order.
For example, to cache all riders in each age partition:ALTER MATERIALIZED VIEW cycling.cyclist_by_age WITH caching = { 'keys' : 'ALL', 'rows_per_partition' : 'ALL' } ;
Default:
{ 'keys' : 'NONE', 'rows_per_partition' : 'NONE' }
- cdc
-
Create a Change Data Capture (CDC) log on the table.
cdc = TRUE | FALSE
Valid values:TRUE
- create CDC logFALSE
- do not create CDC log
CREATE TABLE
andALTER TABLE
can be modified with this table option, but notCREATE MATERIALIZED VIEW
orALTER MATERIALIZED VIEW
.For example, create a table with a CDC log:CREATE TABLE cycling.cyclist_name WITH cdc = TRUE;
- comments
-
Provide documentation on the table.
comments = 'text'
Tip: Enter a description of the types of queries the table was designed to satisfy.For example, note the base table for the materialized view:ALTER MATERIALIZED VIEW cycling.cyclist_by_age WITH comment = "Basetable: cyclist_mv";
- dclocal_read_repair_chance
- Probability that a successful read operation triggers a read repair, between 0 and 1;
default value:
0.01
. Unlike the repair controlled by dclocal_read_repair_chance, this repair is limited to replicas in the same DC as the coordinator. - default_time_to_live
- TTL (Time To Live) in seconds, where zero is disabled. When specified, the value is
set for the Time To Live (TTL) marker on each column in the table; default
value:
0
. When the table TTL is exceeded, the table is tombstoned. - gc_grace_seconds
- Seconds after data is marked with a tombstone (deletion marker) before it is eligible
for garbage-collection. Default value: 864000. The default value allows time for
Cassandra to maximize consistency prior to deletion.In a single-node cluster, this property can safely be set to zero. You can also reduce this value for tables whose data is not explicitly deleted — for example, tables containing only data with TTL set, or tables with default_time_to_live set. However, if you lower the gc_grace_seconds value, consider its interaction with these operations:
- hint replays — When a node goes down and then comes back up, other nodes replay the write operations (called hints) that are queued for that node while it was unresponsive. Cassandra does not replay hints older than gc_grace_seconds after creation. The max_hint_window_in_ms setting in the cassandra.yaml file sets the time limit (3 hours by default) for collecting hints for the unresponsive node.
- batch replays — Like hint queues, batch operations store database mutations that are replayed in sequence. As with hints, Cassandra does not replay a batched mutation until gc_grace_seconds after it was created. If your application uses batch operations, consider the possibility that decreasing gc_grace_seconds increases the chance that a batched write operation may restore deleted data. The batchlog_replay_throttle_in_kb and concurrent_batchlog_writes properties in the cassandra.yaml file give some control of the batch replay process. The most important factors, however, are the size and scope of the batches you use.
- memtable_flush_period_in_ms
- Milliseconds before
memtables
associated with the table are flushed.Default:
0
- min_index_interval
- Minimum gap between index entries in the index summary. A lower min_index_interval means the index summary contains more entries from the index, which allows Cassandra to search fewer index entries to execute a read. A larger index summary may also use more memory. The value for min_index_interval is the densest possible sampling of the index.
- max_index_interval
- If the total memory usage of all index summaries reaches this value, Cassandra decreases the index summaries for the coldest SSTables to the maximum set by max_index_interval. The max_index_interval is the sparsest possible sampling in relation to memory pressure.
- read_repair_chance
- The probability that a successful read operation triggers a read repair. Unlike the
repair controlled by dclocal_read_repair_chance, this
repair is not limited to replicas in the same DC as the coordinator. The value must be
between
0
and1
; default value:0.0
. - speculative_retry
-
Overrides normal read timeout when read_repair_chance is not 1.0, sending another request to read. Specify the value as a number followed by a type,
ms
(milliseconds) orpercentile
. For example,speculative_retry = '3ms'
.Use the speculative retry property to configure rapid read protection. In a normal read, Cassandra sends data requests to just enough replica nodes to satisfy the consistency level. In rapid read protection, Cassandra sends out extra read requests to other replicas, even after the consistency level has been met. The speculative retry property specifies the trigger for these extra read requests.
ALWAYS
: The coordinator node sends extra read requests to all other replicas after every read of that table.Xpercentile
: Cassandra constantly tracks each table's typical read latency (in milliseconds). Set speculative retry toXpercentile
to tell the coordinator node to retrieve the typical latency time of the table being read and calculate X percent of that figure. The coordinator sends redundant read requests if the number of milliseconds it waits without responses exceeds that calculated figure. (For example, if the speculative_retry property for Table_A is set to80percentile
, and that table's typical latency is 60 milliseconds, the coordinator node handling a read of Table_A would send a normal read request first, and send out redundant read requests if it received no responses within 48ms, which is 80 % of 60ms.)Nms
: The coordinator node sends extra read requests to all other replicas if the coordinator node has not received any responses withinN
milliseconds.NONE
: The coordinator node does not send extra read requests after any read of that table.
Or:ALTER TABLE users WITH speculative_retry = '10ms';
ALTER TABLE users WITH speculative_retry = '99percentile';
compression
Configuring compression for a table.
compression
by specifying the compression algorithm class followed
by the subproperties in simple JSON format. Choosing the right compressor depends on your
requirements for space savings over read performance. LZ4 is fastest to decompress, followed by
Snappy, then by Deflate. Compression effectiveness is inversely correlated with decompression
speed. The extra compression from Deflate or Snappy is not enough to make up for the decreased
performance for general-purpose workloads, but for archival data they may be worth considering.
Developers can also implement custom compression classes using the
org.apache.cassandra.io.compress.ICompressor
interface.compression = {
['class' : 'compression_algorithm_name',
'chunk_length_kb' : 'value',
'crc_check_chance' : 'value',]
| 'sstable_compression' : '']
}
Subproperty | Description |
---|---|
class | Sets the compressor name, The class name of the compression algorithm. Cassandra
provides the following built-in classes: Default:
LZ4Compressor . |
chunk_length_kb | Size (in KB) of the block. On disk, SSTables are compressed by block to allow random
reads. Values larger than the default value might improve the compression rate, but
increases the minimum size of data to be read from disk when a read occurs. The default
value is a good middle-ground for compressing tables. Adjust compression size to account
for read/write access patterns (how much data is typically requested at once) and the
average size of rows in the table. Default value: 64KB . Default
value: |
crc_check_chance | When compression is enabled, each compressed block includes a checksum of that block
for the purpose of detecting disk bitrot and avoiding the propagation of corruption to
other replica. This option defines the probability with which those checksums are checked
during read. By default they are always checked. Set to 0 to disable
checksum checking and to 0.5, for instance, to check them on every other
read. Default: |
sstable_compression | Disables compression. Specify a null value. |
sstable_compression
option with value of
empty string
(''):ALTER TABLE cycling.cyclist_name WITH COMPRESSION = {'sstable_compression': ''};
compaction
Constructing a map of the compaction option and its subproperties.
WITH
clause defines the strategy for cleaning up data after writes. Define a
compaction class and properties in simple JSON format:
compaction = {
'class' : 'compaction_strategy_name'
[, 'subproperty_name' : 'value',...]
}
For more guidance, see the When to Use Leveled Compaction, Leveled Compaction in Apache Cassandra blog, and How data is maintained.
compaction properties
- SizeTieredCompactionStrategy (STCS)
-
Triggers a minor compaction when table meets the
min_threshold
. Minor compactions do not involve all the tables in a keyspace. See SizeTieredCompactionStrategy in the Cassandra documentation for more details.Default compaction strategy.
Table 3. subproperties Subproperty Description bucket_high Size-tiered compaction merges sets of SSTables that are approximately the same size. Casssandra compares each SSTable size to the average of all SSTable sizes on the node. It merges SSTables whose size in KB are within [average-size × bucket_low] and [average-size × bucket_high]. Default value:
1.5
bucket_low See bucket_high. Default value:
0.5
.enabled Enables background compaction. Default value: true
. See Enabling and disabling background compaction.log_all Activates advanced logging for the entire cluster. Default value:
false
.max_threshold The maximum number of SSTables to allow in a minor compaction. Default value:
32
.min_threshold The minimum number of SSTables to trigger a minor compaction. Default value: 4
.min_sstable_size STCS groups SSTables into buckets. The bucketing process groups SSTables that differ in size by less than 50%. This bucketing process is too fine grained for small SSTables. If your SSTables are small, use min_sstable_size to define a size threshold (in bytes) below which all SSTables belong to one unique bucket. Default value: 50MB
.only_purge_repaired_tombstones In Cassandra 3.0 and later: true allows purging tombstones only from repaired SSTables. The purpose is to prevent data from resurrecting if repair is not run within gc_grace_seconds
. If you do not run repair for a long time, Cassandra keeps all tombstones — this may cause problems. Default value:false
.tombstone_compaction_interval The minimum number of seconds after which an SSTable is created before Cassandra considers the SSTable for tombstone compaction. An SSTable is eligible for tombstone compaction if the table exceeds the tombstone_threshold ratio. Default value: 86400
.tombstone_threshold The ratio of garbage-collectable tombstones to all contained columns. If the ratio exceeds this limit, Cassandra starts compaction on that table alone, to purge the tombstones. Default value: 0.2
.unchecked_tombstone_compaction True allows Cassandra to run tombstone compaction without pre-checking which tables are eligible for this operation. Even without this pre-check, Cassandra checks an SSTable to make sure it is safe to drop tombstones. Default value: false
.Note: Cassandra 3.0 and later no longer support thecold_reads_to_omit
property for SizeTieredCompactionStrategy.Note: For more details, see How data is maintained > Compaction Strategy > STCS - DateTieredCompactionStrategy
-
Stores data written within a certain period of time in the same SSTable. Also see DateTieredCompactionStrategy in the Cassandra documentation.
Table 4. Subproperties Subproperty Description base_time_seconds The size of the first time window. Default value: 3600
.enabled Enables background compaction. Default value: true
. See Enabling and disabling background compaction.log_all True activates advanced logging for the entire cluster. Default value: false
.max_sstable_age_days Cassandra does not compact SSTables if its most recent data is older than this property. Fractional days can be set. Default value: 1000
.Attention: This parameter is deprecated.max_window_size_seconds The maximum window size in seconds. Default value: 86400
.max_threshold The maximum number of SSTables allowed in a minor compaction. Default value: 32
.min_threshold The minimum number of SSTables that trigger a minor compaction. Default value: 4
.timestamp_resolution Units, MICROSECONDS or MILLISECONDS, to match the timestamp of inserted data. Default value: MICROSECONDS
.tombstone_compaction_interval The minimum number of seconds after an SSTable is created before Cassandra considers the SSTable for tombstone compaction. Tombstone compaction is triggered if the number of garbage-collectable tombstones in the SSTable is greater than tombstone_threshold. Default value: 86400
.tombstone_threshold The ratio of garbage-collectable tombstones to all contained columns. If the ratio exceeds this limit, Cassandra starts compaction on that table alone, to purge the tombstones. Default value: 0.2
.unchecked_tombstone_compaction True allows Cassandra to run tombstone compaction without pre-checking which tables are eligible for this operation. Even without this pre-check, Cassandra checks an SSTable to make sure it is safe to drop tombstones. Default value: false
. - TimeWindowCompactionStrategy (TWCS)
-
Compacts SSTables using a series of time windows or buckets. TWCS creates a new time window within each successive time period. During the active time window, TWCS compacts all SSTables flushed from memory into larger SSTables using STCS. At the end of the time period, all of these SSTables are compacted into a single SSTable. Then the next time window starts and the process repeats. See TimeWindowCompactionStrategy in the Cassandra documentation for more details.
Table 5. Subproperties Compaction Subproperties Description compaction_window_unit Time unit used to define the bucket size, milliseconds, seconds, hours, etc. Default value: milliseconds
.compaction_window_size Units per bucket. log_all True activates advanced logging for the entire cluster. Default value: false
. - LeveledCompactionStrategy (LCS)
-
Creates SSTables of a fixed, relatively small size (160 MB by default) that are grouped into levels. Within each level, SSTables are guaranteed to be non-overlapping. Each level (L0, L1, L2 and so on) is 10 times as large as the previous. Disk I/O is more uniform and predictable on higher than on lower levels as SSTables are continuously being compacted into progressively larger levels. At each level, row keys are merged into non-overlapping SSTables in the next level. See LeveledCompactionStrategy (LCS) in the Cassandra documentation.
Table 6. Subproperties Subproperties Default Description enabled true Enables background compaction. See Enabling and disabling background compaction below. log_all false True activates advanced logging for the entire cluster. sstable_size_in_mb 160MB The target size for SSTables that use the Leveled Compaction Strategy. Although SSTable sizes should be less or equal to sstable_size_in_mb, it is possible tthat compaction may produce a larger SSTable during compaction. This occurs when data for a given partition key is exceptionally large. Cassandra does not split the data into two SSTables. tombstone_compaction_interval 864000 (one day) The minimum number of seconds after an SSTable is created before Cassandra considers the SSTable for tombstone compaction. Cassandra begins tombstone compaction SSTable's tombstone_threshold exceeds value of the following property. tombstone_threshold 0.2 The ratio of garbage-collectable tombstones to all contained columns. If the ratio exceeds this limit, Cassandra starts compaction on that table alone, to purge the tombstones. unchecked_tombstone_compaction false True allows Cassandra to run tombstone compaction without pre-checking which tables are eligible for this operation. Even without this pre-check, Cassandra checks an SSTable to make sure it is safe to drop tombstones.
Enabling and disabling background compaction
ALTER TABLE mytable
WITH COMPACTION = {
'class': 'SizeTieredCompactionStrategy',
'enabled': 'false' }
Disabling background compaction can be harmful: without it, Cassandra does not regain disk space, and may allow zombies to propagate. Although compaction uses I/O, it is better to leave it enabled in most cases.
Reading extended compaction logs
Set the log_all
subproperty to true
to
collect in-depth information about compaction activity on a node in a dedicated log file.
When extended
compaction is enabled, Cassandra creates a file named compaction-%d.log
(where %d
is a sequential number) in
home/logs.
-
type:enable
Lists SSTables that have been flushed previously
{"type":"enable","keyspace":"test","table":"t","time":1470071098866,"strategies": [ {"strategyId":"0","type":"LeveledCompactionStrategy","tables":[],"repaired":true,"folders": ["/home/carl/oss/cassandra/bin/../data/data"]}, {"strategyId":"1","type":"LeveledCompactionStrategy","tables":[],"repaired":false,"folders": ["/home/carl/oss/cassandra/bin/../data/data"] } ] }
type: flush
Logs a flush event from a memtable to an SSTable on disk, including the CompactionStrategy for each table.
{"type":"flush","keyspace":"test","table":"t","time":1470083335639,"tables": [ {"strategyId":"1","table": {"generation":1,"version":"mb","size":106846362,"details": {"level":0,"min_token":"-9221834874718566760","max_token":"9221396997139245178"} } } ] }
type: compaction
Logs a compaction event.
{"type":"compaction","keyspace":"test","table":"t","time":1470083660267,"start":"1470083660188","end":"1470083660267","input": [ {"strategyId":"1","table": {"generation":1372,"version":"mb","size":1064979,"details": {"level":1,"min_token":"7199305267944662291","max_token":"7323434447996777057"} } } ],"output": [ {"strategyId":"1","table": {"generation":1404,"version":"mb","size":1064306,"details": {"level":2,"min_token":"7199305267944662291","max_token":"7323434447996777057"} } } ] }
type: pending
Lists the number of pending tasks for a compaction strategy
{"type":"pending","keyspace":"test","table":"t","time":1470083447967,"strategyId":"1","pending":200}
Table keywords
- CLUSTERING ORDER BY ( column_name ASC | DESC)
-
Order rows storage to make use of the on-disk sorting of columns. Specifying order can make query results more efficient. Options are:
ASC
: ascending (default order)DESC
: descending, reverse orderThe following example shows a table definition that changes the clustering order to descending by insertion time.
The following example shows a table definition stores the categories with the highest points first.CREATE TABLE cycling.cyclist_category ( category text, points int, id UUID, lastname text, PRIMARY KEY (category, points)) WITH CLUSTERING ORDER BY (points DESC);
- COMPACT STORAGE
-
Use
COMPACT STORAGE
to store data in the legacy (Thrift) storage engine format to conserve disk space.Use compact storage for the category table.CREATE TABLE cycling.cyclist_category ( category text, points int, id UUID, lastname text, PRIMARY KEY (category, points)) WITH CLUSTERING ORDER BY (points DESC) AND COMPACT STORAGE;
Important: For Cassandra 3.0 and later, the storage engine is much more efficient at storing data, and compact storage is not necessary. - ID
-
If a table is accidentally dropped with DROP TABLE, use this option to recreate the table and run a commitlog replayer to retrieve the data.
CREATE TABLE users ( userid text PRIMARY KEY, emails set<text> ) WITH ID='5a1c395e-b41f-11e5-9f22-ba0be0483c18';
Configuring read repairs
Cassandra performs read repair whenever a read reveals inconsistencies among replicas. You can also configure Cassandra to perform read repair after a completely consistent read. Cassandra compares and coordinates all replicas, even those that were not accessed in the successful read. The dclocal_read_repair_chance and read_repair_chance set the probability that a consistent read of a table triggers a read repair. The first of these properties sets the probability for a read repair that is confined to the same datacenter as the coordinator node. The second property sets the probability for a read repair across all datacenters that contain matching replicas. This cross-datacenter operation is much more resource-intensive than the local operation.
Recommendations: if the table is for time series data, both properties can be set to
0
(zero). For other tables, the more performant strategy is to set
dc_local_read_repair_chance to 0.1
and
read_repair_chance to 0
. If you want to use
read_repair_chance, set this property to 0.1
.