Calculating user data size

Accounting for storage overhead in determining user data size.

The size of your raw data may be larger or smaller once it is loaded into Cassandra due to storage overhead. How much depends on how well it compresses and the characteristics of your data and tables. The following calculations account for data persisted to disk, not for data stored in memory.

Procedure

  • Determine column overhead:
    regular_total_column_size = column_name_size + column_value_size + 15
    counter - expiring_total_column_size = column_name_size + column_value_size + 23

    Every column in Cassandra incurs 15 bytes of overhead. Since each row in a table can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, you should add an additional 8 bytes (23 bytes total).

  • Account for row overhead.
    Every row in Cassandra incurs 23 bytes of overhead.
  • Estimate primary key index size:
    primary_key_index = number_of_rows * ( 32 + average_key_size )

    Every table also maintains a partition index. This estimation is in bytes.

  • Determine replication overhead:
    replication_overhead = total_data_size * ( replication_factor - 1 )

    The replication factor plays a role in how much disk capacity is used. For a replication factor of 1, there is no overhead for replicas (as only one copy of data is stored in the cluster). If replication factor is greater than 1, then your total data storage requirement will include replication overhead.