Controlling automatic direct join optimizations in queries
DSE can optimize join queries to directly lookup data in the database without performing a Spark shuffle, which uses a full table scan.
DSE can optimize join queries to directly lookup data in the database without performing a Spark shuffle, which uses a full table scan.
By default, this optimization is turned on. Direct joins are used when:
(table size * directJoinSizeRatio) > size of keys
The value of directJoinSizeRatio
should be between 0 and 1. By default,
this value is 0.9. The directJoinSizeRatio
setting can be set when
creating the reference to the database table or in the Spark Session.
spark.conf.set("directJoinSizeRatio", 0.2)
You can permanently enable or disable this optimization by setting the
directJoinSetting
option. Valid settings for
directJoinSetting
are:
on
to permanently enable the optimizationoff
to permanently disable the optimizationauto
(the default value) to let DSE determine when to enable it according to the criteria from thedirectJoinSizeRatio
setting
You can programmatically enable or disable directJoinSetting
by calling
the directJoin
function.
import org.apache.spark.sql.cassandra.CassandraSourceRelation._ import org.apache.spark.sql.cassandra._ val table = spark.read.cassandraFormat("tab", "ks").load spark .range(1L,100000L) .withColumn("id", concat(lit("Store "), 'id)) .join(table.directJoin(AlwaysOff), 'id === 'store)
The directJoin
function can be set to AlwaysOn
to
permanently enable the optimization, AlwaysOff
to permanently disable
the optimization, or Automatic
to let DSE determine when to use the
optimization according to the formula for directJoinSizeRatio
described
earlier.
Most users should not change the directJoinSetting
option. In most cases
the direct join should be faster than a full table scan. If the calculation is producing
less than optimal results adjust the threshold for automatic joins, or turn the
optimization off.