Controlling automatic direct join optimizations in queries

DSE can optimize join queries to directly lookup data in the database without performing a Spark shuffle, which uses a full table scan.

By default, this optimization is turned on. Direct joins are used when:

(table size * directJoinSizeRatio) > size of keys

The value of directJoinSizeRatio should be between 0 and 1. By default, this value is 0.9. The directJoinSizeRatio setting can be set when creating the reference to the database table or in the Spark Session.

spark.conf.set("directJoinSizeRatio", 0.2)

You can permanently enable or disable this optimization by setting the directJoinSetting option. Valid settings for directJoinSetting are:

  • on to permanently enable the optimization

  • off to permanently disable the optimization

  • auto (the default value) to let DSE determine when to enable it according to the criteria from the directJoinSizeRatio setting

You can programmatically enable or disable directJoinSetting by calling the directJoin function.

import org.apache.spark.sql.cassandra.CassandraSourceRelation._
import org.apache.spark.sql.cassandra._
val table = spark.read.cassandraFormat("tab", "ks").load
spark
  .range(1L,100000L)
  .withColumn("id", concat(lit("Store "), 'id))
  .join(table.directJoin(AlwaysOff), 'id === 'store)

The directJoin function can be set to AlwaysOn to permanently enable the optimization, AlwaysOff to permanently disable the optimization, or Automatic to let DSE determine when to use the optimization according to the formula for directJoinSizeRatio described earlier.

Most users should not change the directJoinSetting option. In most cases the direct join should be faster than a full table scan. If the calculation is producing less than optimal results adjust the threshold for automatic joins, or turn the optimization off.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com