Best practices for DataStax drivers
These rules and recommendations improve performance and minimize resource utilization in applications that use DataStax drivers.
Create and reuse a single session for the entire lifetime of an application. Sessions are expensive to create because they initialize and maintain connection pools to every node in a cluster. A single driver session can handle thousands of queries concurrently. Use a single driver session to execute all the queries in an application. Using a single session per cluster allows the drivers to coalesce queries destined for the same node, which can significantly reduce system call overhead.
DataStax recommends that you not create one session per keyspace. Applications should use fully qualified keyspaces in their query strings or explicitly set the keyspace on statement objects.
Create a single cluster object per physical cluster. Cluster objects are relatively expensive to create because they maintain a control connection to a given cluster. Creating more than one cluster object per physical cluster duplicates these resources unnecessarily.
This rule does not apply to the C/C++ and OSS Java 4.X / DSE Java 2.x drivers because each session maintains its own cluster state. However, the single session per application rule still applies.
The Node.js driver combines the concepts of session and cluster into a single Client interface.
Use the driver’s asynchronous APIs to achieve maximum throughput. The asynchronous APIs provide execution methods that return immediately without blocking the application’s progress, allowing a single application thread to run many queries concurrently. Asynchronous execution methods return future objects that can be used by the application to obtain query results and errors if they occur. Running many queries concurrently allows applications to optimize their query processing, improves the driver’s ability to coalesce query requests, and maximizes use of server-side resources.
Prepare queries that are used more than once. Preparing queries allows the server and driver to reduce the amount of processing and network data required to run a query. For prepared statements, the server parses the query once and it is then cached for the lifetime of an application. The server also avoids sending response metadata after the initial prepare step, which reduces the data sent over the network and the corresponding client side processing.
When using a datacenter-aware load balancing policy, your application should explicitly set the local datacenter instead of allowing the drivers to infer the local datacenter from the contact points. If the driver chose the wrong local datacenter, it increases cross-datacenter traffic, which is often higher latency and monetarily expensive than inter-datacenter traffic. Setting the local datacenter explicitly eliminates the chance that the driver will choose the wrong local datacenter.
When configuring a driver connection, it is easy to include contact points in remote datacenters or invalid datacenters. For example, an application might include contact points for an internal datacenter used during testing. Explicitly setting the local datacenter avoids these types of errors.
In DSE and Cassandra, a tombstone is a marker that indicates that table data is logically deleted. DSE and Cassandra store updates to tables in immutable SSTable files to maintain throughput and avoid reading stale data. Deleted data, time-to-live (TTL) data, and null values will create tombstones, which allows the database to reconcile the logically deleted data with new queries across the cluster. While tombstones are a necessary byproduct of a distributed database, limiting the number of tombstones and avoiding tombstone creation increases database and application performance.
Deletes can often be avoided through data modeling techniques. Nulls can be avoided with proper query construction. For more details about tombstones, see https://docs.datastax.com/en/dse/6.8/dse-arch/datastax_enterprise/dbInternals/archTombstones.html.
Heavy deletes and nulls use extra disk space and decrease performance on reads. Tombstones can cause warnings and log errors.
For example, in the following schema:
CREATE TABLE test_ks.my_table_compound_key ( primary_key text, clustering_key text, regular_col text, PRIMARY KEY (primary_key, clustering_key) )
This query results in no tombstones for
INSERT INTO my_table_compound_key (primary_key, clustering_key) VALUES ('pk1', 'ck1');
However this query results in a tombstone for
INSERT INTO my_table_compound_key (primary_key, clustering_key, regular_col) VALUES ('pk1', 'ck1', null);