Data model and schema configuration checks
In most of the projects seen by the DataStax field team, data modeling is one of the major factors that define a project’s success. Systems that did data modeling right are scalable and usually have less problems. Systems where the data model was done incorrectly are usually unstable and can fail even with relatively small amounts of data. That is why the field team pays attention to the data model when performing a cluster review. If you need additional information on data modeling for Cassandra, see the Cassandra or the DataStax CQL data modeling documentation.
Data model checks
This section lists a general set of checks that the field team performs when analyzing an existing data model. (You can use them as well against data models that are in development.) The team gets the existing schema as a part of diagnostic tarball generated by OpsCenter, or from diagnostic collection scripts. You can get it by executing cqlsh -e 'describe schema;'
on one of the cluster nodes and outputting results into a file, such as schema.cql
. The schema.cql
name is used throughout this topic. In diagnostic tarballs this file is located in driver/schema
for each node. In addition to information about schema, you can use nodetool
commands, which execute on every node of the cluster (or are retrieved from a diagnostic tarball file), and which cover situations when only some nodes are affected.
Keyspace replication settings
Check that all keyspaces have correct replication settings. Look for following:
- Incorrect replication strategy
-
Using SimpleStrategy instead of the NetworkTopologyStrategy when you have a cluster with multiple datacenters. With multiple datacenters, replicas are not guaranteed to be correctly placed between datacenters.
It is better to use NetworkTopologyStrategy even if you have only one datacenter, as it simplifies adding more datacenters in the future. |
- Keyspaces are under replicated inside datacenters or not replicated to all datacenters
-
Either situation is especially critical for system keyspaces such as system_auth. For example, if you lose replicas with data from system_auth, you or your applications may lose the ability to login to your cluster.
- Keyspaces are over replicated
-
Sometimes in the large clusters, some keyspaces have replication factor much higher than the usual 3. In some cases, it is a valid number, such as 5. Much higher values usually increase latencies for both read and write operations, especially when using consistency levels, such as
QUORUM
orLOCAL_QURUM
. If you want to additionally protect your data and make sure your cluster is available, consider adding new datacenters, backups, and so on. - Even number is used for replication factor (RF)
-
Usually, even numbers of replicas do not play well with consistency levels such as
QUORUM
orLOCAL_QURUM
, as this makes the cluster less resilient to failure.QUORUM
is calculated asN/2+1
, whereN
is the number of replicas in a cluster.LOCAL_QUORUM
is calculated using the same number, butN
is the number of replicas in a specific datacenter.For example for
RF=2
, the number of replicas inQUORUM
is equal to2
, so operations fail when one node is down. Operations do not fail if you increase RF to3
, because the number of replicas forQUORUM
remains2
.RF=3
means that if one node goes down it does not affect operations, as shown with following table:
Replication factor |
Number of nodes down without losing the ability to achieve consistency level |
---|---|
2 |
0 |
3 |
1 |
4 |
1 |
5 |
2 |
6 |
2 |
7 |
3 |
To fix problems with replication, you can execute the `ALTER KEYSPACE` command manually, or to perform these operations automatically, use the adjust-keyspaces.sh script
or something similar. System keyspaces that use LocalStrategy
or EverywhereStrategy
must be left unchanged.
Number of tables
A large number of tables in the Cassandra can directly impact the performance of the cluster. Typically, you should have no more than 200 actively used tables in a cluster. Having 500 actively used tables is regarded as a failure level even if the cluster works, as it is likely that there are inefficiencies and failures.
The problem arises because every table uses approximately 1 MB
of memory for metadata. For each table acted on, a memtable representation is allocated. Tables with large amounts of data also increase pressure on memory by storing more data for the bloom filter and other auxiliary data structures. Also, each keyspace causes additional overhead in JVM memory. All of these factors impact the performance of Cassandra. The following benchmarks show a significant drop in throughput that occurs with the increase in the number of tables:
To check how many tables and keyspaces are in cluster:
grep 'CREATE TABLE' schema.cql |wc -l
Check table structure
Several things should be checked in the definitions of the table in order to avoid an adverse impact to cluster operations.
Check structure of primary and partition keys
The structure of the primary key, especially the partition key, can have major impact on the performance and stability of the cluster. When analyzing the table structure, take the following into account:
-
When the primary key consists of only the partition key, there is a chance that the row size is too small. An insufficient row size can lead to inefficiencies when accessing or storing data because the metadata associated with the partition might be larger than the row size itself.
-
Check the data types of the partition key when the table consists of the one column. Some data types (by definition) have low cardinality, such as boolean or tinyint, which can lead to uneven distribution of data between nodes. For example, if you define columns with the boolean type, you will have only 2 partitions in the table. You may also get large partitions when there are many rows inside a partition.
Another potential problem can arise from using the date type for partition key column. In many cases, when the date type is used to write data organized by days, it often lead to hotspots because applications are writing/reading large amounts of data (hundreds and thousands requests per second) for a specific day.
Check number of columns in a table
It is not recommended to define hundreds or thousands of columns for a single table because:
-
the recommended maximum number of cells per partition and columns per row is easy to exceed. See Number of cells per partition.
-
the overhead for storing individual values: every cell has a timestamp associated with it, which adds at least 8 bytes. If TTL is present, that adds additional overhead.
-
it can affect performance of the range scans.
If a table has too many columns, start by analyzing the data access patterns. Solutions include:
-
Combining multiple columns that are read together into a frozen user-defined type (UDT), where all data in the UDT is written as one cell.
-
Performing serialization and desensitization of data inside the application.
-
Storing the data as a blob.
Check applicability of used data types
Cassandra provides a rich set of data types that can be used for table columns. Because so many data types exist, users often use the incorrect data type. An example of incorrect usage include using text
type for storing timestamps. Another example uses numeric
types that have value ranges much larger than is required, such as using a long
type for columns when an int
type
is sufficient. Such inappropriate usage leads to following problems:
-
the unnecessary use of disk space. For example, timestamp encoded as text using
ISO-8601
notation occupies 28 bytes, while timestamp type uses only 8 bytes. Similarly for numeric types, the long type occupies 8 bytes, while int is using only 4. The situation with even worse when decimal and varint types are used, as they do not have fixed size and are sized depending on the actual value. -
the inability to search data correctly, if using DSE Search. For example, if using text data type for storage of numbers or timestamps, you may not be able to perform range queries.
-
the inability to perform correct sorting of data when the data are incorrectly encoded.
Check use of collection types
Cassandra provides several data types to store multiple values inside the single column: list, set, and map. Each of these types requires that type of elements in a collection is defined during the table’s creation. Collection types are:
- frozen
-
The whole content of the collection is serialized and stored as one value. This type solves some of the problems described below, but does not allow to update individual elements of collection.
- non-frozen
-
The collection is stored as set of individual elements in separate cells.
Collection types are handy for development. When using them, take these factors into account.
-
Additional overhead for keeping metadata of individual elements when using non-frozen collections. Such overhead includes a write timestamp and optional TTL. For list type, there is an additional overhead to store the index of the elements for which the
UUID
is used (16 bytes per element). -
When insert or full update of a non-frozen collection occurs, such as replacing the value of the column with another value like
UPDATE table SET field = new_value …
, Cassandra inserts a tombstone marker to prevent possible overlap with previous data even if data did not previously exist. A large number of tombstones can significantly affect read performance. -
There is an upper bound on the number of elements in collection. For Cassandra 3.0.1, 3.1 and later:
2 billion
. For earlier versions:65,535
. Higher numbers of elements can result in either performance problems when accessing data in non-frozen collections or, when using frozen collections, exceeding the maximum mutation size limits. In addition, when reading a column with a collection type, its whole content is returned, and the transfer of a large amount of data may harm performance. -
For non-frozen collections where individual elements were updated after an
insert
, performance can degrade as data could be spread between multiple SSTables that need to be read to reconstruct the actual column value. -
Because read repair does not propagate the tombstones, the content of the collections where elements were deleted can be effected. This effect happens because the custom tombstone used as a delete marker is not propagated.
The problems listed above can be mitigated by following several rules:
-
Use frozen collections until there is a necessity to update individual elements.
-
Keep the number of elements in all collection types on the order of dozens with a maximum of several hundred elements. Content of the collection column is read as whole, so if there are too many elements then read problems occur. This is because the maximum possible size of the page is
256 MB
.
When a query returns many rows, it is inefficient to return them as a single response message. Instead, the driver breaks the results into pages that are returned as needed. Applications can control how many rows are included in a single page, but there is a maximal size of the page defined by the native protocol.
-
When you know that no previous data exists and to prevent creation of tombstones when inserting data into a set or map (or when performing the full update of a set or a map), you can use the
append
operation for columns. For example:
CREATETABLE test.m1 ( id int PRIMARY KEY, m map<int, text> );
instead of using:
INSERT INTO test.m1(id, m) VALUES (1, {1:'t1', 2:'t2'});
or
UPDATE test.m1 SET m = {1:'t1', 2:'t2'} WHERE id = 1;
which generate tombstones. Instead execute:
UPDATE test.m1 SET m = m + {1:'t1', 2:'t2'} WHERE id = 1;
which has the same result, but without tombstone generation.
If there is only one column with collection type in the table, you could modeled it as an additional clustering column. For example:
CREATETABLE test.m1 ( id int PRIMARY KEY, m map<int, text> );
This table can be created without a map column (use the same approach for sets and lists):
CREATETABLE test.m1 ( id int, m_key int, m_value text, PRIMARY KEY(id, m_key) );
You can select either all values for specific partition by omitting the condition on m_key or selecting only the specific element by providing a full primary key. It is more advantageous than the column with collection type, which returns as whole.
Check the use of list type
Everything described in the previous section also applies to the list
type. However, list
type has additional limitations:
-
Setting and removing an element by position and removing occurrences of particular values incur an internal read-before-write.
-
Prepend or append operations are not idempotent.
In case of failure you cannot simply retry the operation because its completion state is unknown. Retrying can result in duplicate elements; not retrying may lose data. For more information, see the List fields documentation.
If you do not need to keep elements in a specific order or have elements with duplicate values, use set
type instead of list
type. If you still need to use a column with the list
type, consider using a frozen
version of it.
Check the use of user-defined types
Cassandra allows creation of user-defined types (UDTs), enabling you to group related information together and to use the group as a single entity. From a data model analysis point of view, you can apply the same rules as for collections:
-
Use frozen UDTs where possible.
-
For non-frozen UDTs do not specify too many fields.
But UDTs have problems related to serialization or deserialization of the UDT. Another problem arises from a schema evolution standpoint. While it is possible to add fields to the UDT, it is impossible to remove them. Therefore, use UDTs only in limited situations, and only when it is strongly necessary. Otherwise, the preference is to define data as regular columns in the table. Another alternative is to perform serialization and desensitization of UDT data inside the application and store data as a blob.
Check the use of deeply-nested UDTs and collections of UDTs
Although UDTs can be nested inside other UDTs or as elements in the collections, you need to be very careful with it. If too many elements exist in a collection or there are too many nested UDTs, then the maximal size of the mutation is reached, and operations will fail.
Check the use of tuple type
CQL provides a tuple
data type that can group several elements of different data types into a single entity. Limitations for tuples are:
-
Their value is always frozen, indicating that a column gets re-written for each update.
-
You have to access elements by position, which makes it harder to develop code because you need to remember the position where each type is used and the meaning of each position.
Because of these limitations, DataStax recommends not using the tuple data type and using UDTs instead.
Check the use of counter data type
The counter
data type allows you to perform increment and decrement operations, which is useful for some applications. Starting with Cassandra 2.1, implementing a counter
is more robust, but limitations remain:
-
Value of counters may not be precise when nodes go down. There are dropped mutations and similar occurences, because counter operations are not idempotent, and cannot be retried. Retrying may result in an overcount; not retrying, an undercount.
-
Tables may contain regular columns only for the
counter
type; there is no possibility to mix it with other data types.
Check the use of the blob data type
Cassandra supports storing of the binary data in the database by providing a blob type. When using blogs, make sure that you do not store in Cassandra objects larger than a couple of hundred kilobytes, otherwise problems with fetching data from the database can happen. For example, a query could fail when the size of the fetched page is larger than the limit set by native protocol (256MB
).
Definition of the sorting order for clustering columns
When defining a table, you can define sorting direction for clustering columns. Applications can reverse the defined sorting direction when performing a query, but it is less efficient than reading the data in the same sorting direction as defined on the table level. DataStax recommends defining the correct sorting direction when creating a table. Also, when sorting is reversed during query time, it affects all columns and not just a single column, as Cassandra just reads data in the opposite direction.
Number of cells per partition
Cassandra documentation often uses term cell to describe a stored value of the regular column (non-primary key column). Besides the actual value, each cell has associated metadata, such as timestamp, optional TTL, and additional data for complex cells. Collection and user-defined types are even more complex.
Cassandra has a hard limit of 2 billion (231)
cells per partition. To ensure predictable performance for read operations, DataStax recommends limiting the number of cells in partitions to keep the partition size less than 100 MB
.
You can check the number of cells per partition using the nodetool tablehistograms
command (cfhistograms
in older versions of Cassandra ). Newer versions of Cassandra and DSE can output data for all tables in the system, while older versions require specification of the keyspace and table names.
Look at the Cell Count
column of the output, and check values in 99%
and Max
rows. When values are greater than 100,000
, consider changing your data model. Such a value can be a sign of the presence of large partitions (described in the next section), too many columns, or too many elements in the non-frozen
collections.
{sudo=bash]
nodetool tablehistograms test widerows
Output:
test/widerows histograms | |||||
---|---|---|---|---|---|
Percentile |
SSTables |
Write Latency (micros) |
Read Latency (micros) |
Partition Size (bytes) |
Cell Count |
50% |
1.00 |
0.00 |
1310.72 |
545791 |
103 |
75% |
1.00 |
0.00 |
1310.72 |
545791 |
124 |
95% |
1.00 |
0.00 |
1310.72 |
545791 |
124 |
98% |
1.00 |
0.00 |
1310.72 |
545791 |
124 |
99% |
1.00 |
0.00 |
1310.72 |
545791 |
124 |
Min |
1.00 |
0.00 |
1310.72 |
454827 |
87 |
Max |
1.00 |
0.00 |
1572.86 |
545791 |
124 |
Large partitions
For Cassandra, it is recommended to keep the size of partitions under 100MB
size. Usually the presence of large partitions is a sign of an incorrect data model, triggered by these factors:
-
Low cardinality of partition keys. - when too few possible values exist for partition keys.
-
Non-uniform spread of data between partitions.
For example, if a customer ID is used as a partition key, large customers applications will write much more data than smaller customers. As a result, some nodes may have much more data than other nodes. More data increase the load on these nodes because they handle more requests, require more compaction, and so on.
-
Too many columns and rows in the table, especially when every row contains data for all or most columns.
-
Storing large blobs or long texts in the table.
-
Large partitions create an additional load on Cassandra, such as allocation of additional memory to hold partition index.
Before Cassandra versions 3.6, reading large partitions put much more pressure on Java heap and often led to nodes crashing. |
-
Uneven data distribution between the nodes can lead to hotspots when some nodes handle many more requests than others.
-
Large partitions require transferring more data when performing reading of the whole partition.
-
Cassandra partition size can effect external systems, such as Spark, because a Cassandra partition is the minimal object mapped into the Spark partition. Any imbalance in Cassandra may lead to an imbalance when processing data with Spark.
Finding information about the partitions
Use the following tools to find the size of the partition:
-
Use the
nodetool tablehistograms
command (cfhistograms
in older versions of Cassandra), to find the partition sizes for75
,95
,99
, and100
percentiles. If you see a large difference between these values, it is likely you have a non-uniform spread of partition key values. Similar information can be obtained from thesstablepartitions
command. -
Information about maximum partition size is available via
nodetool tablestats
(cfstats in the older Cassandra versions). Check if the value in the lines Compacted partition maximum bytes are larger than the recommended100 MB
. -
If there is a non-uniform spread of partition key values, you can identify the values of partition keys that have the largest number of rows using DataStax Bulk loader (dsbulk). The main advantage of dsbulk is that it works with the whole cluster. For example, to find largest partitions in the test table:
dsbulk count -k test -t widerows --log.verbosity 0 --stats.modes partitions '29' 106 51.71 '96' 99 48.29
-
You can use the
sstablemetadata
utility with the-s
command line parameter to identify the largest partitions in specific SSTables. The-s
flag is available in Cassandra 4.0 and in DSE 6.x. For Cassandra 3.x, use thesstable-tools
project (which was an inspiration for thesstablemetadata
utility.) One advantage ofsstablemetadata
is that it provides information about the largest partitions as both row count and size in bytes. A disadvantage is that it works with individual SSTable files, and a partition could be split between them. For example:sstablemetadata-u-spath_to_file/mc-1-big-Data.db
Output:
SSTable: /Users/ott/var/dse-5.1/data/cassandra/data/datastax/vehicle-8311d7d14eb211e8b416e79c15bfa204/mc-1-big
Size: 58378400
Partitions: 800
Tombstones: 0
Cells: 2982161
WidestPartitions:
[266:20180425] 534
[202:20180425] 534
[232:20180425] 534
[187:20180425] 534
[228:20180425] 534
LargestPartitions:
[266:20180425] 134568 (131.4 kB)
[202:20180425] 134568 (131.4 kB)
[232:20180425] 134568 (131.4 kB)
[187:20180425] 134568 (131.4 kB)
[228:20180425] 134568 (131.4 kB) …
Check for low cardinality of partition key values either by looking into lines
Number of partitions
(estimate) in output oftablestats/cfstats
, or by performingSELECT DISTINCT
partition_key_list, count(*) FROM tableand checking values in the count column of the output.
For any of the problems described in this section, the only solution is to change the data model to select correct partition key and clustering columns. In some cases, you might be able to promote a clustering column to partition key or introduce an artificial bucketing column into a partition key.
Check compaction strategy
Prefer to use the default compaction strategy (SizeTieredCompactionStrategy, STCS) unless it causes issues or a clear gain exists with another strategy. More information on tuning of compaction strategies could be found in the separate document.
Check the use of auxiliary indexing
Cassandra and DSE provide a number of ways to perform search in the tables using columns that are not partition key columns, including:
-
Secondary indexes
-
Materialized views
-
SASI indexes
-
DSE Search indexes
DSE 6.8 includes beta version of Storage-Attached Indexing (SAI). |
By utilizing these technologies, users may not need to perform denormalization of the data into additional tables. But each of the listed implementations has limitations that are described next.
Check the use of secondary indexes
The native secondary indexes in Cassandra act as reverse indexes and index the data on each node by creating an internal table that maps specific values of the column into a full primary key. The aim is to support data access based on a condition that would not be allowed with the structure of the key as defined in the base table.
Secondary indexes have some limitations:
-
Can only index a single regular column per index.
-
Does not support non-equality or range conditions.
-
Can be heavily impacted by cardinality of the indexed column. If low cardinality exits, it can lead to creation of the wide partitions. For high cardinality you might creates many very small partitions. In either case, performance may suffer as described in [Check Table Structure].
-
Does not deal with deletions well. A high number of tombstones in a secondary index severely degrades its performance.
-
As secondary indexes index data locally to the content of the base table on each node, they cannot follow the normal placement by partition key. Consequently, if they are the only access criteria to the data, without restriction on the partition key, they result in a scatter-gather request involving querying all nodes in a datacenter that causes suboptimal performance.
For these reasons, secondary indexes must be used with great caution and designed out by denormalizing where possible. They might be a reasonable option when used to filter out results limited to a single partition on relatively small partitions in tables where deletions are infrequent and the base partition is located on the node and the index can be applied afterwards. Even under these conditions, it is strongly recommended to thoroughly test queries that use secondary indexes with representative data and load.
Because Cassandra needs resources to build and maintain secondary indexes in a consistent state, DataStax recommends keeping the number relatively low and removing all unused secondary indexes. You can check number of the defined secondary indexes using:
grep 'CREATE INDEX' schema.cql|wc -l
Check the use of materialized views
Cassandra 3.0 and DSE 5.0 introduced support for Materialized Views to make it easier for client applications to denormalize data automatically and transparently. Materialized views are defined at schema level as views on a specified base table. These views contain the same information of the base table (or a subset of it) but with a different primary key, thus allowing different read patterns that are not possible with the original key structure.
When data is written to a table, all its materialized views are automatically updated accordingly so that they can be read at any point just as regular tables, according to their key. Note that the data is actually stored in each view, so the total footprint increases depending on the number of views and the information they contain.
When materialized views are used on the tables, take the following into account:
-
Constraints on the structure of primary key for materialized view:
-
The key of the materialized view must contain all the columns that form the key of the base table. This is because the definition of uniqueness of the row must be the same.
-
The key of the materialized view can contain at most one regular column from the base table on the condition that this column can never be null.
-
-
Use of materialized views on the table put an additional work on the database, so plan resources accordingly.
-
To build rows in the materialized view, Cassandra needs to read the corresponding row from the base table, which puts additional load onto the IO system and increases latencies.
-
If the materialized view has a different partition key, the insert of the data require network communication with other nodes that are responsible for corresponding token range.
-
In some cases, a materialized view can be out-of-sync with the base table. To fix an out-of-sync occurance, rebuild the view using nodetool rebuild_view (a regular repair does not work for materialized views).
-
When a materialized view is created on a table with existing data, a materialized view needs to be built, which may take some time, depending on the amount of data. Check the status of the built job with nodetool viewbuildstatus command.
-
In Cassandra, materialized views are still marked as experimental and not recommended for production use.
Although materialized views are handy from a development point of view, you can achieve better performance by creating one or more auxiliary tables and writing to all of them. Although it adds complexity to the application code, it also has its benefits, such as more flexibility in defining the primary key for auxiliary tables and avoiding the reading the data from disk before writing an entry into a materialized view.
If materialized views are still required, keep the number of them low. Check the number of materialized views with following command:
grep 'CREATE MATERIALIZED VIEW' schema.cql|wc -l
Check the use of SASI indexes
SASI (SSTable-attached Secondary Indexes) are an alternative implementation of secondary indexes aimed at allowing more flexibility in the querying criteria and improved performance. SASI were contributed to Apache Cassandra by an external contributor, but the initial implementation was developed for a very specific use case using an old version of Cassandra and a deprecated API. Moreover, it was tested to a very limited extent. Further testing and initial experimentation showed that SASI indexes are affected by numerous bugs. Despite fixes and improvements, they are still not considered production-ready as they are unreliable and they risk returning inconsistent results.
DataStax recommends avoiding SASI indexes for any query on a production system.
You can check usage of SASI indexes with following command:
grep -e 'CREATE CUSTOM INDEX.*SASIIndex' schema.cql|wc -l
Check the use of DSE Search indexes
DSE includes its own implementation of a search index called DSE Search, based on the Apache Solr. DSE Search is transparently integrated with core Cassandra and allows the indexing of stored data. It can perform different types of searches on arbitrary columns of the table or their combinations, such as full text search, range search, exact search, and so on.
Although it is very flexible, several things need to be taken into consideration:
There are limitations imposed by Apache Lucene and Solr, as well as DSE Search. Many of these limitations are fixed in the Storage-Attached Indexing (SAI) available in the DSE 6.8. |
-
To build an object inside the DSE Search index, DSE needs to read the corresponding row from the base table, which increases IO.
-
The number of the tables with DSE Search index. The recommended maximum number of indexes depends on the version of the DSE and hardware. See Capacity planning for DSE Search.
-
Use and number of virtual nodes.
Because DSE Search performs scatter-gather query against all token ranges, the number of queries sent is directly proportional to the number of token ranges. For DSE Search use single-token architecture or keep number of vnodes to 8 or less.
-
Size of the search index. The recommendation is to keep a single index side under the 250 GB limit, with 500 GB size for all search indexes.
-
What columns are indexed and their types.
The size of the DSE Search index can be significantly larger than the size of the data in Cassandra, depending on the types of the indexed columns and type of indexing. To keep the index size under the control, index only columns required for searching. Different indexing types can also affect the performance. For example, text columns are indexed for full text search versus the substring search.
-
Some data types are not supported, such as counters and frozen maps.
-
Static columns are not indexed.
-
Number of objects (documents) inside the individual search index on the single node (maximum 2 billion documents).
This limit can be reached fast when the indexed table uses columns with user-defined types, as such columns are indexed as separate documents.
-
DSE Search executes a query with consistency level ONE. This level means there could be discrepancy in returned results if the data are not repaired.
-
Row-level access control is not supported.
You can check usage of DSE Search index with following command:
grep -e 'CREATE CUSTOM INDEX.*Cql3SolrSecondaryIndex' schema.cql|wc -l
Access schema and configuration of individual indexes using this command: [sudo=bash]
DESCRIBE ACTIVE SEARCH INDEX