Best practices for Cassandra drivers

Session and cluster handling

All drivers use root objects to connect to your Cassandra clusters, which can be Astra DB databases, DSE clusters, Cassandra clusters, or HCD clusters.

These root objects are expensive to create because they initialize and maintain connection pools to every node in a cluster. However, a single root object can handle thousands of queries concurrently, so you can use the same session instance to execute all the queries for an application.

Follow these best practices to ensure that you instantiate and reuse these objects effectively:

Create only one long-lived root object for each physical cluster.
Create one session instance for each application, and then reuse that session for the entire lifetime of the application. For example, don’t open and close the session object for each individual request or batch of requests.
The Java and Node.js drivers combine the cluster and session concepts into one object. All other drivers provide separate cluster and session objects.

For more information about session and cluster configuration, see your driver’s documentation:

C/C++ driver sessions and clusters

Create one CassCluster for each physical cluster.

Create and reuse one CassSession for each application.

C# driver sessions and clusters

Create one Cassandra.Cluster for each physical cluster.

Create and reuse one Cassandra.Session for each application.

GoCQL driver sessions and clusters

Create one NewCluster for each physical cluster.

Create and reuse one Session for each application.

Java driver CQL sessions

The Java driver defines clusters and sessions together in CqlSession.

Create one CqlSession instance for each target physical Cassandra cluster, and then reuse those instances throughout your application.

For more information and examples, see the documentation for your version of the Java driver:

If you want the driver to retry the connection if it fails to establish a CQL session, set advanced.reconnect-on-init to true. For more information, see Cassandra driver reconnection policies.

Node.js driver client interface

The Node.js driver combines the concepts of session and cluster into a single Client interface.

Create one Client instance for a given cluster, and then use that instance across your application. For an example, see Getting started with the Node.js driver.

PHP driver sessions and clusters

Create one Cluster for each physical cluster.

Create and reuse one Session for each application.

Python driver sessions and clusters

Create one Cluster for each physical cluster.

Create and reuse one Session for each application.

Ruby driver sessions and clusters

Use Cassandra::Cluster for each physical cluster.

Use Cassandra::Session for each application.

For more information and best practices to help optimize driver connections, see Performance tuning for Cassandra drivers.

Use asynchronous queries for bulk data access

DataStax recommends executing queries asynchronously when processing large amounts of data, including large numbers of queries or long-running queries. For more information and recommended usage patterns, see Asynchronous query execution with Cassandra drivers.

In large partitions, fetch rows in batches

When dealing with large partitions, don’t attempt to read the complete partition at once. Doing so can exceed memory limits granted to your operating system process.

Example: Pagination with the Java driver

....
// Avoid
List<Row> rows =
   session.execute("SELECT * FROM employees WHERE company = 'DS'").all();

// Do
Iterator<Row> iterator =
   session.execute("SELECT * FROM employees WHERE company = 'DS'").iterator();
while (iterator.hasNext()) {
   Row row = iterator.next();
   // process
}
....

Example: Pagination with the Python driver

from cassandra.query import SimpleStatement
query = "SELECT * FROM users"  # users contains 100 rows
statement = SimpleStatement(query, fetch_size=10)
for user_row in session.execute(statement):
    process_user(user_row)

By default, drivers pre-fetch 5000 rows, and this setting is configurable. For more information, see Result paging with Cassandra drivers.

Use prepared statements for frequently run queries

Prepared statements are queries that you can run multiple times with different parameters. They can make your applications more efficient by reducing resource requirements and eliminating redundancies in code. For more information, see Prepared statements with Cassandra drivers.

Use lightweight transactions (LWTs) judiciously

Lightweight transaction (LWT) statements aren’t intended as general purpose optimistic locking mechanisms.

With LWTs, Cassandra performs three round trips between nodes to propose, agree, and publish the new row state, and the row must be read from disk.

Compared to traditional CQL statements, LWTs usually have higher resource requirements, higher response latency, and lower throughput.

For additional considerations for LWTs, see the following:

Data modeling recommendations

When designing a data model, consider the queries that you plan to execute. You must identify the patterns used to access data and the types of queries to be performed.

Avoid allow filtering

Don’t use the ALLOW FILTERING CQL clause unless you can guarantee that the given table is small and performing a full scan will have acceptable response latency.

Even if the clause is limited to a single partition, ALLOW FILTERING cannot guarantee the same performance as searching on clustering columns.

When searching on clustering columns, Cassandra can perform a binary search. When searching on non-clustering columns, it must read and compare all rows.

Use idempotent statements

A statement is idempotent if executing it multiple times leaves the database in the same state as executing it only once. This also means that the request is safe to retry, and your driver can automatically retry the request in the event of a failure.

For more information, see Query idempotence in Cassandra drivers and Cassandra driver retry policies.

Manage tombstones

A tombstone is a marker that indicates deleted table data.

DSE, HCD, and Cassandra store updates to tables in immutable SSTable files to maintain throughput and avoid reading stale data. Deleted data, time-to-live (TTL) data, and null values create tombstones that allow the database to reconcile the logically deleted data with new queries across the cluster.

Example: Tombstone creation from writing nulls

Assume a table has the following schema:

CREATE TABLE test_ks.my_table_compound_key (
        primary_key text,
        clustering_key text,
        regular_col text,
        PRIMARY KEY (primary_key, clustering_key)
        )

The following query does not create any tombstones:

INSERT INTO my_table_compound_key (primary_key, clustering_key)
        VALUES ('pk1', 'ck1');

However, this query does create a tombstone because it explicitly writes null to regular_col:

INSERT INTO my_table_compound_key (primary_key, clustering_key, regular_col)
        VALUES ('pk1', 'ck1', null);

While tombstones are a necessary byproduct of a distributed database, they can cause warnings and log errors. Additionally, heavy deletes and nulls use extra disk space and decrease performance on reads.

Culling tombstones and avoiding tombstone creation increases database and application performance:

Design your data model to avoid unnecessary deletes.

For example, minimize explicit data removal and leverage TTL. With TTL, make sure garbage collection occurs promptly, and be aware that each map key has a separate TTL value.
When writing, construct queries properly to avoid writing nulls.
When reading, use efficient filters in your queries to avoid fetching unnecessary rows, which can contain tombstones.

For example, when reading all rows from a partition, tombstones are scanned. You can use ordering on clustering columns to avoid reading lots of tombstones.
Optimize garbage collection and compaction strategies to remove tombstones efficiently.
Use tracing to monitor how tombstones impact query performance.

For more information, see What are tombstones and Garbage collection of tombstones.

Learn more about data modeling

Performance tuning for Cassandra drivers

For information about performance tuning, see Cassandra driver metrics, Connection pooling in Cassandra drivers, Load balancing with Cassandra drivers, and your driver’s documentation:

C/C++ driver tuning

See C/C++ driver performance tips and C/C++ driver tuning.

C# driver tuning

See C# driver tuning policies.

GoCQL driver tuning

See GoCQL driver performance.

Java driver tuning

DataStax recommends the following configurations for applications that send or store large payloads with the Java driver. For all performance tuning options and best practices, see the documentation for your version of the Java driver:

Adjust thread pooling when storing large payloads

If your application stores large payloads, you can potentially improve overall throughput and latency by adjusting the Netty IO thread count:

Make sure that you test your application’s performance before and after the change to verify any performance improvements. You might need to adjust the value multiple times to find an ideal configuration.

By default, IO thread pool size is twice the value of Runtime.getRuntime().availableProcessors():

x = Runtime.getRuntime().availableProcessors()

Default Netty IO thread count = x * 2

Enable compression when sending large payloads

If your application sends large text values (that could have high compression ratios), and your driver connects to a Cassandra cluster running in a remote location, consider enabling compression of protocol frames. This configuration saves network bandwidth but requires slightly more CPU resources.

datastax-java-driver {
  advanced.protocol {
    compression = lz4
  }
}

For more information, see the documentation for your version of the Java driver:

Node.js driver tuning

See Node.js driver tuning policies.

Python driver tuning

See Python driver performance notes.

Ruby driver tuning

See Ruby driver performance tips.

Session and cluster handling

Use asynchronous queries for bulk data access

In large partitions, fetch rows in batches

Use prepared statements for frequently run queries

Use lightweight transactions (LWTs) judiciously

Data modeling recommendations

Avoid allow filtering

Use idempotent statements

Manage tombstones

Learn more about data modeling

Performance tuning for Cassandra drivers

See also

Was this helpful?

Give Feedback