Controlling automatic direct join optimizations in queries
DSE can optimize join queries to directly lookup data in the database without performing a Spark shuffle, which uses a full table scan.
DSE can optimize join queries to directly lookup data in the database without performing a Spark shuffle, which uses a full table scan.
By default, this optimization is turned on. Direct joins are used when:
(table size * directJoinSizeRatio) > size of keys
The value of directJoinSizeRatio should be between 0 and 1. By default,
            this value is 0.9. The directJoinSizeRatio setting can be set when
            creating the reference to the database table or in the Spark Session.
spark.conf.set("directJoinSizeRatio", 0.2)
        You can permanently enable or disable this optimization by setting the
                directJoinSetting option. Valid settings for
                directJoinSetting are:
- onto permanently enable the optimization
- offto permanently disable the optimization
- auto(the default value) to let DSE determine when to enable it according to the criteria from the- directJoinSizeRatiosetting
You can programmatically enable or disable directJoinSetting by calling
            the directJoin function.
import org.apache.spark.sql.cassandra.CassandraSourceRelation._
import org.apache.spark.sql.cassandra._
val table = spark.read.cassandraFormat("tab", "ks").load
spark
  .range(1L,100000L)
  .withColumn("id", concat(lit("Store "), 'id))
  .join(table.directJoin(AlwaysOff), 'id === 'store)
        The directJoin function can be set to AlwaysOn to
            permanently enable the optimization, AlwaysOff to permanently disable
            the optimization, or Automatic to let DSE determine when to use the
            optimization according to the formula for directJoinSizeRatio described
            earlier.
Most users should not change the directJoinSetting option. In most cases
            the direct join should be faster than a full table scan. If the calculation is producing
            less than optimal results adjust the threshold for automatic joins, or turn the
            optimization off.
