Data model and schema configuration checks

General data modeling and schema configuration recommendations for DataStax Enterprise (DSE) and Apache Cassandra.

DISCLAIMER

This document gives general data modeling and schema configuration recommendations for DataStax Enterprise (DSE) and Apache Cassandra™. This document requires basic DSE/Cassandra knowledge. It doesn’t replace the official documentation.

In most of the projects seen by the DataStax field team, data modeling is one of the major factors that define a project's success. Systems that did data modeling right are scalable and usually have less problems. Systems where the data model was done incorrectly are usually unstable and can fail even with relatively small amounts of data. That’s why the field team pays attention to the data model when performing a cluster review. If you need additional information on data modeling for Cassandra than what's presented here, see the Cassandra or the DataStax CQL data modeling documentation.

Data model checks

This section lists a general set of checks that the field team performs when analyzing an existing data model. (You can use them as well against data models that are in development.) The team gets the existing schema as a part of diagnostic tarball generated by OpsCenter, or from diagnostic collection scripts. You can get it by executing cqlsh -e 'describe schema;' on one of the cluster nodes and outputting results into a file, such as schema.cql. This name is used throughout this topic. In diagnostic tarballs it’s located in driver/schema for each node. Besides information about schema, you can use nodetool commands, which are executed on every node of the cluster (or retrieved from a diagnostic tarball file), as there can be situations when only some nodes are affected.

When performing analysis of the data model, take into account the hard limits in Cassandra and DSE (depending on the version) related to CQL, as well as the recommendations in this topic.

Keyspace replication settings

Check that all keyspaces have correct replication settings. Look for following:
Incorrect replication strategy
Using SimpleStrategy instead of the NetworkTopologyStrategy when you have a cluster with multiple datacenters. If this is the case, replicas aren't guaranteed to be correctly placed between datacenters.
Tip: It’s better to use NetworkTopologyStrategy even if you have one datacenter, as it will be simpler if you later decide to add more datacenters.
Keyspaces are under replicated inside datacenters or not replicated to all datacenters
This is especially critical for system keyspaces, such as, system_auth. For example, if you lose replicas with data from system_auth, you or your applications may lose the ability to login to your cluster.
Keyspaces are over replicated
Sometimes in the big clusters, some keyspaces have replication factor much higher than the usual 3. In some cases, it’s a valid number, such as 5. Much higher values usually increase latencies for both read and write operations, especially when using consistency levels, such as QUORUM or LOCAL_QURUM. If you want to additionally protect your data and make sure your cluster is available, consider adding new datacenters, backups, and so on.
Even number is used for replication factor (RF)
Usually, even numbers of replicas don’t play well with consistency levels such as QUORUM or LOCAL_QURUM, as this makes the cluster is less resilient to failure. QUORUM is calculated as N/2+1, where N is the number of replicas in a cluster. LOCAL_QUORUM is calculated using the same number, but N is the number of replicas in a specific datacenter.
For example for RF=2, the number of replicas in QUORUM is equal to 2, so operations fail when one node is down. This doesn’t happen if you increase RF to 3, as the number of replicas for QUORUM is still 2. This means if one node goes down it won’t affect operations as shown with following table:
Replication factor Number of nodes down without losing the ability to achieve consistency level QUORUM in a cluster or LOCAL_QUORUM in one datacenter
2 0
3 1
4 1
5 2
6 2
7 3
To fix problems with replication, you can execute the ALTER KEYSPACE command manually, or to perform these operations automatically, use the adjust-keyspaces.sh script or something similar. System keyspaces that use LocalStrategy or EverywhereStrategy must be left unchanged.

Number of tables

A big number of tables in the Cassandra can directly impact the performance of the cluster. Typically, you should have no more than 200 actively used tables in a cluster. Having 500 actively used tables is considered as a failure level even if the cluster works, as it's likely that there are inefficiencies and failures.

The problem arises because every table uses approximately 1 MB of memory for metadata. For each table acted on, a memtable representation is allocated. Tables with large amounts of data also increase pressure on memory by storing more data for the bloom filter and other auxiliary data structures. Also, each keyspace causes additional overhead in JVM memory. All of this together, impacts the performance of the Cassandra. The following benchmarks show a significant drop in throughput occurs with the increase in the number of tables:

To check how many tables and keyspaces are in cluster:
grep 'CREATE TABLE' schema.cql |wc -l

Check table structure

Several things should be checked in the table’s definitions that can impact cluster operations.

Check structure of primary and partition keys

The structure of the primary key, especially the partition key, can have major impact on the performance and stability of the cluster. When analyzing the table structure, take the following into account :
  • When the primary key consists only of the partition key, there is a chance that row size is too small. This can lead to inefficiencies when accessing or storing data because the metadata associated with partition might be bigger than row size itself.
  • Check the data types of the partition key when the table consists of the one column. Some data types (by definition) have low cardinality, such as boolean or tinyint, which can lead to uneven distribution of data between nodes. For example, if you define columns with the boolean type, you will have only 2 partitions in the table. You may also get big partitions when there are many rows inside a partition.

    Another potential problem can arise from using the date type for partition key column. In many cases, when the date type is used to write data organized by days, it often lead to hotspots because applications are writing/reading large amounts of data (hundreds and thousands requests per second) for a specific day.

Check number of columns in a table

It’s not recommended to define hundreds or thousands of columns for a single table because:
  • The recommended maximum number of cells per partition (and too many columns per row) is easy to exceed. See Number of cells per partition below.
  • The overhead for storing individual values: every cell has timestamp associated with it, which adds at least 8 bytes. If TTL is present, it adds even more overhead.
  • It can affect performance of the range scans.
If a table has too many columns, start by analyzing the data access patterns. Solutions include:
  • Combining multiple columns that are read together into a frozen user defined type (UDT), where all data in the UDT is written as one cell.
  • Performing serialization and desensitization of data inside the application.
  • Storing the data as a blob.

Check applicability of used data types

Cassandra provides a rich set of data types that can be used for table’s columns. Because some many data types exist, users often use the incorrect data type. Examples include using text type for storing timestamps and numeric types that have value ranges much bigger than is required, such as using a long type for columns where int is more than enough. Such inappropriate usage leads to following problems:
  • Unnecessary use of disk space. For example, timestamp encoded as text using ISO-8601 notation occupies 28 bytes, while timestamp type uses only 8 bytes. Similarly for numeric types, the long type occupies 8 bytes, while int is using only 4. The situation with even worse when decimal and varint types are used, as they don’t have fixed size and are sized depending on the actual value.
  • If using DSE Search, you may not be able to search data correctly. For example, if using text data type for storage of numbers or timestamps, you may not be able to perform range queries.
  • You may not be able to perform correct sorting of data when the data are incorrectly encoded.

Check use of collection types

Cassandra provides several data types to store multiple values inside the single column: list, set, and map. Each of these types requires that type of elements in a collection is defined during the table's creation. Collection types are:
frozen
The whole content of the collection is serialized and stored as one value. This type solves some of the problems described below, but doesn’t allow to update individual elements of collection.
non-frozen
The collection is stored as set of individual elements in separate cells.
Collection types are handy for development. When using them, take these factors into account:
  • Additional overhead for keeping metadata of individual elements when using non-frozen collections. This includes a write timestamp and optional TTL. For list type, there is an additional overhead to store the index of the elements for which the UUID is used (16 bytes per element).
  • When insert or full update of a non-frozen collection occurs, such as replacing the value of the column with another value like UPDATE table SET field = new_value …, Cassandra inserts a tombstone marker to prevent possible overlap with previous data even if data did not previously exist. A large number of tombstones can significantly affect read performance.
  • There is an upper bound on the number of elements in collection. For Cassandra 3.0.1/3.1 and later: 2 billion. For earlier versions: 65,535. Higher numbers of elements can result in either performance problems when accessing data in non-frozen collections or when using frozen collections, exceeding the maximum mutation size limits. In addition, when reading a column with a collection type, it’s whole content is returned, and the transfer of big amount of data might harm performance.
  • For non-frozen collections where individual elements were updated after insert, performance can degrade as data could be spread between multiple SSTables that need to be read to reconstruct actual column value.
  • Because read repair doesn’t propagate the tombstones, the content of the collections where elements were deleted can be effected. This happens because the custom tombstone used as delete marker won’t be propagated.
The problems listed above can be mitigated by following several rules:
  • Use frozen collections until there is a necessity to update individual elements.
  • Keep the number of elements in all collection types on the order of dozens with a maximum of several hundreds of elements. Content of the collection column is read as whole, so if you have too many elements read problems occur because of the maximum possible size of the page is 256 MB.
    Note: When a query returns many rows, it is inefficient to return them as a single response message. Instead, the driver breaks the results into pages which get returned as needed. Applications can control how many rows are included in a single page, but there is a maximal size of the page defined by the native protocol.
  • When you know that no previous data exists and to prevent creation of tombstones when inserting data into a set or map (or when performing the full update of a set or map), you can use append operation for columns. For example:
    CREATETABLE test.m1 (
    id int PRIMARY KEY,
    m map<int, text>
    );
    instead of using:
    INSERT INTO test.m1(id, m) VALUES (1, {1:'t1', 2:'t2'}); 
    or
    UPDATE test.m1 SET m = {1:'t1', 2:'t2'} WHERE id = 1; 
    which generate tombstones, execute:
    UPDATE test.m1 SET m = m + {1:'t1', 2:'t2'} WHERE id = 1; 

    which has the same result, but without tombstone generation.

    If there is only one column with collection type in the table, you could modeled it as an additional clustering column. For example:

    CREATETABLE test.m1 (
    id int PRIMARY KEY,
    m map<int, text>
    );
    This table could be created without map column (the same approach is used for sets and lists):
    CREATETABLE test.m1 (
    id int,
    m_key int, 
     m_value text,
     PRIMARY KEY(id, m_key)
    );

    You can select either all values for specific partition by omitting the condition on m_key or selecting only the specific element by providing a full primary key. It’s a bigger advantage over the column with collection type, which is returned as whole.

Check use of list type

Everything described in the previous section applies to list type as well. However, list type has additional limitations:
  • Setting and removing an element by position and removing occurrences of particular values incur an internal read-before-write.
  • Prepend or append operations aren’t idempotent.

    In case of failure you can't simply retry the operation because it's not known if it was done or not. Retrying can result in duplicate elements; not retrying may lose data. For more information, see the List fields documentation.

If you don’t need to keep elements in specific order or have elements with duplicate values, use set type instead of list type. If you still need to use a column with the list type, consider using a frozen version of it.

Check use of user defined types

Cassandra allows creation of user defined types (UDTs), This type allows you to group related information together and use it as a single entity. From a data model analysis point of view, you can apply the same rules as for collections:
  • Use frozen UDTs where possible.
  • For non-frozen UDTs don’t specify too many fields.
But UDTs still have problems, related to serialization/deserialization of the UDT. Another problem arises from a schema evolution standpoint: while it’s possible to add fields to UDT, it’s impossible to remove them. This means UDTs should only be used in the limited situations, when it’s strongly necessary. Otherwise, prefer to define this data as regular columns in the table. Another alternative is to perform serialization and desensitization of UDT data inside the application and store data as a blob.

Check the use of deeply nested UDTs and collections of UDTs

Although UDTs can be nested inside other UDTs or as elements in the collections, you need to be very careful with it. If too many elements exist in a collection or there are too many nested UDTs, then the maximal size of the mutation is reached, and operations will fail.

Check use of tuple type

CQL provides a tuple data type that can group several elements of different data types into a single entity. Limitations for this type are:
  • Its value is always frozen, which means that column gets re-written for each update.
  • You have to access elements by position, which makes it harder to develop code because you need to remember in which position which type is used and the meaning of this position.

Because of these limitations, DataStax recommends not using this data type and use UDTs instead.

Check the use of counter data type

The counter data type allows you to perform increment and decrement operations, which is useful for some applications. Starting with Cassandra 2.1, counters implementation is more robust but still has limitations:
  • Value of counters may not be precise when nodes go down, there are dropped mutations and similar because counter operations aren’t idempotent, and can't be retried: retrying may result in overcount; not retrying, undercount.
  • Tables may contain regular columns only for the counter type; there is no possibility to mix it with other data types.

Check use of the blob data type

Cassandra supports storing of the binary data in the database by providing a blob type. When using blogs, make sure that you don’t store in Cassandra objects larger than a couple of hundred kilobytes, otherwise problems with fetching data from the database can happen. For example, a query could fail when the size of the fetched page is bigger than the limit set by native protocol (256MB).

Definition of the sorting order for clustering columns

When defining a table, you can define sorting direction for clustering columns. Applications can reverse the defined sorting direction when performing a query, but it will be less efficient than reading the data in the same sorting direction defined on table level. DataStax recommends defining the correct sorting direction when creating a table. Also, when sorting is reversed during query time, reversing affects all columns not just a single column; Cassandra just reads data in the opposite direction.

Number of cells per partition

Cassandra documentation often uses term cell to describe a stored value of the regular column (non-primary key column). Besides the actual value, each cell has associated metadata, such as timestamp, optional TTL, and additional data for complex cells. Collection and user-defined types are even more complex.

Cassandra has a hard limit of 2 billion (231) cells per partition. To ensure predictable performance for read operations, DataStax recommends limiting the number of cells in partitions to keep the partition size less than 100 MB.

You can check the number of cells per partition using the nodetool tablehistograms command (cfhistograms in older versions of Cassandra). Newer versions of Cassandra and DSE can output data for all tables in the system, while older versions require specification of the keyspace and table names.

Look at the Cell Count column of the output, and check values in 99% and Max rows. If you have values greater than 100,000, think about changing your data model; it could be a sign of the presence of the big partitions (described in the next section), too many columns, or too many elements in the non-frozen collections.
nodetool tablehistograms test widerows
test/widerows histograms
Percentile     SSTables     Write Latency     Read Latency     Partition Size     Cell Count
                            (micros)          (micros)         (bytes)
50%           1.00          0.00              1310.72          545791             103
75%           1.00          0.00              1310.72          545791             124
95%           1.00          0.00              1310.72          545791             124
98%           1.00          0.00              1310.72          545791             124
99%           1.00          0.00              1310.72          545791             124
Min           1.00          0.00              1310.72          454827             87
Max           1.00          0.00              1572.86          545791             124

Big partitions

For Cassandra it’s recommended to keep the size of partitions under 100MB size. Usually the presences of the big partitions is a sign of incorrect data model, triggered by these factors:

  • Low cardinality of partition keys. - when too few possible values exist for partition keys.
  • Non-uniform spread of data between partitions.

    For example, if a customer ID is used as a partition key, large customers applications will write much more data than smaller customers. As a result, some nodes may have much more data than other nodes. More data increase the load on these nodes because they handle more requests, require more compaction, and so on.

  • Too many columns and rows in the table, especially when every row contains data for all or most columns.
  • Storing big blobs or long texts in the table.
  • Big partitions create an additional load on Cassandra, such as allocation of additional memory to hold partition index.
    Note: Before Cassandra versions 3.6, reading big partitions put much more pressure on Java heap and often led to nodes crashing.
  • Uneven data distribution between the nodes can lead to hotspots when some nodes handle many more requests than others.
  • Big partitions require transferring more data when performing reading of the whole partition.
  • Cassandra partition size can effect external systems, such as Spark, because Cassandra’s partition is the minimal object mapped into the Spark partition; any imbalance in Cassandra may lead to imbalance when processing data with Spark.

Finding information about the partitions

Use the following tools to find the size of the partition:
  • Use the nodetool tablehistograms command (cfhistograms in older versions of Cassandra) to find the partition sizes for 75, 95, 99, and 100 percentiles. If you see a big difference between these values, it's likely you have non-uniform spread of partition key values. Similar information can be obtained from the sstablepartitions command.
  • Information about maximum partition size is available via nodetool tablestats (cfstats in the older Cassandra versions). Check if the value in the lines Compacted partition maximum bytes are bigger than the recommended 100 MB.
  • If there is a non-uniform spread of partition key values, you can identify the values of partition keys that have the biggest number of rows using DataStax Bulk loader (dsbulk). The main advantage of dsbulk is that it works with the whole cluster. For example, to find largest partitions in the test table:
    dsbulk count -k test -t widerows --log.verbosity 0 --stats.mode partitions
    '29' 106 51.71
    '96' 99 48.29
  • You can use the sstablemetadata utility with the -s command line parameter to identify the largest partitions in specific SSTables. The -s flag is available in Cassandra 4.0 and in DSE 6.x. For Cassandra 3.x, use the sstable-tools project (which was an inspiration for the sstablemetadata utility.) One advantage of sstablemetadata is that it provides information about the biggest partitions as both row count and size in bytes. A disadvantage is that it works with individual SSTable files, and partition could be split between them. For example:
    sstablemetadata-u-spath_to_file/mc-1-big-Data.db
    SSTable: /Users/ott/var/dse-5.1/data/cassandra/data/datastax/vehicle-8311d7d14eb211e8b416e79c15bfa204/mc-1-big
    Size: 58378400
    Partitions: 800
    Tombstones: 0
    Cells: 2982161
    WidestPartitions:
      [266:20180425] 534
      [202:20180425] 534
      [232:20180425] 534
      [187:20180425] 534
      [228:20180425] 534
    LargestPartitions:
      [266:20180425] 134568 (131.4 kB)
      [202:20180425] 134568 (131.4 kB)
      [232:20180425] 134568 (131.4 kB)
      [187:20180425] 134568 (131.4 kB)
      [228:20180425] 134568 (131.4 kB)
    ...

    Check for low cardinality of partition key values either by looking into lines Number of partitions (estimate) in output of tablestats/cfstats, or by performing SELECT DISTINCT partition_key_list, count(*) FROM table and check value in the count column of the output.

For any of the problems described in this section, the only solution is to change the data model to select correct partition key and clustering columns. In some cases, you might be able to promote a clustering column to partition key or introduce an artificial bucketing column into a partition key.

Check compaction strategy

Prefer to use the default compaction strategy (SizeTieredCompactionStrategy, STCS) unless it causes issues or a clear gain exists with another strategy. More information on tuning of compaction strategies could be found in the separate document.

Check use of auxiliary indexing

Cassandra and DSE provide a number of ways to perform search in the tables using columns that aren’t partition key columns, including:

  • Secondary indexes
  • Materialized views
  • SASI indexes
  • DSE Search indexes
Note: DSE 6.8 includes beta version of Storage-Attached Indexing (SAI).

Using these technologies, users may not need to perform denormalization of the data into additional tables. But each of the listed implementations has limitations that are described below.

Check use of secondary indexes

The native secondary indexes in Cassandra act as reverse indexes and index the data on each node by creating an internal table that maps specific values of the column into a full primary key. The aim is to support data access based on a condition that would not be allowed with the structure of the key as defined in the base table.

Secondary indexes have some limitations:
  • Can only index a single regular column per index.
  • Does not support non-equality or range conditions.
  • Can be heavily impacted by cardinality of the indexed column. If low cardinality exits, it can lead to creation of the wide partitions. For high cardinality you might creates many very small partitions. In either case, performance may suffer as described above.
  • Does not deal with deletions well. A high number of tombstones in a secondary index severely degrades its performance.
  • As secondary indexes index data locally to the content of the base table on each node, they cannot follow the normal placement by partition key. Consequently, if they are the only access criteria to the data, without restriction on the partition key, they result in a scatter-gather request involving querying all nodes in a datacenter that causes suboptimal performance.

For these reasons, secondary indexes must be used with great caution and designed out by denormalizing where possible. They might be a reasonable option when used to filter out results limited to a single partition on relatively small partitions in tables where deletions are infrequent and the base partition is located on the node and the index can be applied afterwards. Even under these conditions, it is strongly recommended to thoroughly test queries that use secondary indexes with representative data and load.

Because Cassandra needs resources to build and maintain secondary indexes in a consistent state, DataStax recommends keeping the number relatively low and removing all unused secondary indexes. You can check number of the defined secondary indexes using:
grep 'CREATE INDEX' schema.cql|wc -l

Check use of materialized views

Cassandra 3.0 and DSE 5.0 introduced support for Materialized Views to make it easier for client applications to denormalize data automatically and transparently. Materialized views are defined at schema level as views on a specified base table. These views contain the same information of the base table (or a subset of it) but with a different primary key, thus allowing different read patterns that are not possible with the original key structure.

When data is written to a table, all its materialized views are automatically updated accordingly so that they can be read at any point just as regular tables, according to their key. Note that the data is actually stored in each view, so the total footprint increases depending on the number of views and the information they contain.

When materialized views are used on the tables, take the following into account:
  • Constraints on the structure of primary key for materialized view:
    • The key of the materialized view must contain all the columns that form the key of the base table. This is because the definition of uniqueness of the row must be the same.
    • The key of the materialized view can contain at most one regular column from the base table on condition that this column can never be null.
  • Use of materialized views on the table put an additional work on the database, so plan resources accordingly.
  • To build rows in the materialized view, Cassandra needs to read the corresponding row from the base table, which puts additional load onto the IO system and increases latencies.
  • If the materialized view has a different partition key, the insert of the data require network communication with other nodes that are responsible for corresponding token range.
  • In some cases, a materialized view can be out-of-sync with the base table. If this happens, it rebuild using nodetool rebuild_view (regular repair doesn’t work for materialized views).
  • When a materialized view is created on a table with already existing data, a materialized view needs to be built, which may take some time, depending on the amount of data. Check the status of the built job with nodetool viewbuildstatus command.
  • In Cassandra, materialized views are still marked as experimental and not recommended for production use.

Although materialized views are handy from a development point of view, you can achieve better performance by creating one or more auxiliary tables and writing to all of them. Although it adds complexity to the application code, it also has its benefits, such as more flexibility in defining the primary key for auxiliary tables and avoiding the reading the data from disk before writing an entry into a materialized view.

If materialized views are still required, keep their number low. Check the number of materialized views with following command:
grep 'CREATE MATERIALIZED VIEW' schema.cql|wc -l

Check use of SASI indexes

SASI (SSTable-attached Secondary Indexes) are an alternative implementation of secondary indexes aimed at allowing more flexibility in the querying criteria and improved performance. SASI were contributed to Apache Cassandra by an external contributor, but the initial implementation was developed for a very specific use case using an old version of Cassandra and a deprecated API. Moreover, it was tested to a very limited extent. Further testing and initial experimentation showed that SASI indexes are affected by numerous bugs. Despite fixes and improvements, they are still not considered production-ready as they are unreliable and risk returning inconsistent results.

DataStax recommended avoiding SASI indexes for any query on a production system.

You can check usage of SASI indexes with following command:
grep -e 'CREATE CUSTOM INDEX.*SASIIndex' schema.cql|wc -l

Check use of DSE Search indexes

DSE includes its own implementation of search index called DSE Search, based on the Apache Solr. This implementation is transparently integrated with core Cassandra and allows to indexing of stored data. It can perform different types of searches on arbitrary columns of the table or their combinations, such as full text search, range search, exact search, and so on.

Although it is very flexible, several things need to be taken into consideration:
Note: There are limitations imposed by Apache Lucene and Solr, as well as DSE Search. Many of these limitations are fixed in the Storage-Attached Indexing (SAI) available in the DSE 6.8.
  • To build an object inside the DSE Search index, DSE needs to read the corresponding row from the base table, which increases IO.
  • The number of the tables with DSE Search index.

    The recommended maximum number of indexes depends on the version of the DSE and hardware. See Capacity planning for DSE Search.

  • Use and number of virtual nodes.

    Because DSE Search performs scatter-gather query against all token ranges, the number of queries sent is directly proportional to the number of token ranges. For DSE Search use single-token architecture or keep number of vnodes to 8 or less.

  • Size of the search index.

    The recommendation is to keep a single index side under the 250 GB limit, with 500 GB size for all search indexes.

  • What columns are indexed and their types.

    The size of the DSE Search index can be significantly bigger than the size of the data in Cassandra, depending on the types of the indexed columns and type of indexing. To keep the index size under the control, index only columns required for searching. Different indexing types can also affect the performance. For example, text columns are indexed for full text search versus the substring search.

  • Some data types are not supported, such as counters and frozen maps.
  • Static columns are not indexed.
  • Number of objects (documents) inside the individual search index on the single node (maximum 2 billion documents).

    This limit can be reached fast when the indexed table uses columns with user-defined types, as such columns are indexed as separate documents.

  • DSE Search executes query with consistency level ONE. This means there could be discrepancy in returned results if the data aren’t repaired.
  • Row-level access control isn’t supported.
You can check usage of DSE Search index with following command:
grep -e 'CREATE CUSTOM INDEX.*Cql3SolrSecondaryIndex' schema.cql|wc -l

Access schema and configuration of individual indexes using the DESCRIBE ACTIVE SEARCH INDEX command.