Using Spark predicate push down in Spark SQL queries
Spark predicate push down to database allows for better optimized Spark SQL queries.
Spark predicate push down to database allows for better optimized Spark queries. A predicate
is a condition on a query that returns true or false, typically located in the
WHERE
clause. A predicate push down filters the data in the database query,
reducing the number of entries retrieved from the database and improving query performance. By
default the Spark Dataset API will automatically push down valid WHERE
clauses to the database.
You can also use predicate push down on DSE Search indices within SearchAnalytics data centers.
Restrictions on column filters
Partition key columns can be pushed down as long as:
- All partition key columns are included in the filter.
- No more than one equivalence predicate per column.
Use an IN
clause to specify multiple restrictions for a particular
column:
val primaryColors = List("red", "yellow", "blue") val df = spark.read.cassandraFormat("cars", "inventory").load df.filter(df("car_color").isin(primaryColors: _*))
Clustering key columns can be pushed down with the following rules:
- Only the last predicate in the filter can be a non equivalence predicate.
- If there is more than one predicate for a column, the predicates cannot be equivalence predicates.
When predicate push down occurs
When a Dataset has no push down filters, all requests on the Dataset do a full unfiltered table scan. Adding predicate filters on the Dataset for eligible database columns modifies the underlying query to narrow its scope.
Determining if predicate push down is being used in queries
By using the explain
method on a Dataset (or EXPLAIN
in
Spark SQL), queries can be analyzed to see if the predicates need to be cast to the correct
data type. For example, create the following CQL table:
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE test; CREATE table words ( user TEXT, word TEXT, count INT, PRIMARY KEY (user, word)); INSERT INTO words (user, word, count ) VALUES ( 'Russ', 'dino', 10 ); INSERT INTO words (user, word, count ) VALUES ( 'Russ', 'fad', 5 ); INSERT INTO words (user, word, count ) VALUES ( 'Sam', 'alpha', 3 ); INSERT INTO words (user, word, count ) VALUES ( 'Zebra', 'zed', 100 );
Then create a Spark Dataset in the Spark console using that table and look for
PushedFilters
in the output after issuing the EXPLAIN
command:
val df = spark.read.cassandraFormat("words", "test").load df.explain
== Physical Plan == *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [user#0,word#1,count#2] ReadSchema: struct<user:string,word:string,count:int>
Because this query doesn't filter on columns capable of being pushed down, there are no
PushedFilters
in the physical plan.
Adding a filter, however, does change the physical plan to include
PushedFilters
:
val dfWithPushdown = df.filter(df("word") > "ham") dfWithPushdown.explain
== Physical Plan == *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [user#0,word#1,count#2] PushedFilters: [*GreaterThan(word,ham)], ReadSchema: struct<user:string,word:string,count:int>
The PushedFilters
section of the physical plan includes the
GreaterThan
push down filter. The asterisk indicates that push down
filter will be handled only at the datasource level.
Troubleshooting predicate push down
When creating Spark SQL queries that use comparison operators, making sure the predicates are pushed down to the database correctly is critical to retrieving the correct data with the best performance.
For example, given a CQL table with the following schema:
CREATE TABLE test.common ( year int, birthday timestamp, userid uuid, likes text, name text, PRIMARY KEY (year, birthday, userid) )
Suppose you want to write a query that selects all entries where the birthday is earlier than a given date:
SELECT * FROM test.common WHERE birthday < '2001-1-1';
Use the EXPLAIN
command to see the query plan:
EXPLAIN SELECT * FROM test.common WHERE birthday < '2001-1-1';
== Physical Plan == *Filter (cast(birthday#1 as string) < 2001-1-1) +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [year#0,birthday#1,userid#2,likes#3,name#4] ReadSchema: struct<year:int,birthday:timestamp,userid:string,likes:string,name:string> Time taken: 0.72 seconds, Fetched 1 row(s)
Note that the Filter
directive is treating the birthday
column, a CQL TIMESTAMP
, as a string. The query optimizer looks at this
comparison and needs to make the types match before generating a predicate. In this case the
optimizer decides to cast the birthday
column as a string to match the
string '2001-1-1'
, but cast functions cannot be pushed down. The predicate
isn't pushed down, and it doesn't appear in PushedFilters
. A full table
scan will be performed at the database layer, with the results returned to Spark for further
processing.
To push down the correct predicate for this query, use the cast
method to
specify that the predicate is comparing the birthday
column to a
TIMESTAMP
, so the types match and the optimizer can generate the correct
predicate.
EXPLAIN SELECT * FROM test.common WHERE birthday < cast('2001-1-1' as TIMESTAMP);
== Physical Plan == *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [year#0,birthday#1,userid#2,likes#3,name#4] PushedFilters: [*LessThan(birthday,2001-01-01 00:00:00.0)], ReadSchema: struct<year:int,birthday:timestamp,userid:string,likes:string,name:string> Time taken: 0.034 seconds, Fetched 1 row(s)
Note the PushedFilters
indicating that the LessThan
predicate will be pushed down for the column data in birthday
. This should
speed up the query as a full table scan will be avoided.