Controlling automatic direct join optimizations in queries
DSE can optimize join queries to directly lookup data in the database without performing a Spark shuffle, which uses a full table scan.
By default, this optimization is turned on. Direct joins are used when:
(table size * directJoinSizeRatio) > size of keys
The value of
directJoinSizeRatio should be between 0 and 1.
By default, this value is 0.9.
directJoinSizeRatio setting can be set when creating the reference to the database table or in the Spark Session.
You can permanently enable or disable this optimization by setting the
Valid settings for
onto permanently enable the optimization
offto permanently disable the optimization
auto(the default value) to let DSE determine when to enable it according to the criteria from the
You can programmatically enable or disable
directJoinSetting by calling the
import org.apache.spark.sql.cassandra.CassandraSourceRelation._ import org.apache.spark.sql.cassandra._ val table = spark.read.cassandraFormat("tab", "ks").load spark .range(1L,100000L) .withColumn("id", concat(lit("Store "), 'id)) .join(table.directJoin(AlwaysOff), 'id === 'store)
directJoin function can be set to
AlwaysOn to permanently enable the optimization,
AlwaysOff to permanently disable the optimization, or
Automatic to let DSE determine when to use the optimization according to the formula for
directJoinSizeRatio described earlier.
Most users should not change the
In most cases the direct join should be faster than a full table scan.
If the calculation is producing less than optimal results adjust the threshold for automatic joins, or turn the optimization off.