DSE Analytics and Search integration

An integrated DSE SearchAnalytics cluster allows analytics jobs to be performed using search queries. This integration allows finer-grained control over the types of queries that are used in analytics workloads, and improves performance by reducing the amount of data that is processed. However, a DSE SearchAnalytics cluster does not provide workload isolation and there are no detailed guidelines for provisioning and performance in production environments.

Nodes that are started in SearchAnalytics mode allow you to create analytics queries that use DSE Search indexes. These queries return RDDs that are used by Spark jobs to analyze the returned data.

The following code shows how to use a DSE Search query from the DSE Spark console.

val table = sc.cassandraTable("music","solr")
      val result = table.select("id","artist_name")

You can use Spark Spark Datasets/DataFrames instead of RDDs.

val table = spark.read.format("org.apache.spark.sql.cassandra")
      .options(Map("keyspace"->"music", "table" -> "solr"))
      val result = table.select("id","artist_name").where("solr_query='artist_name:Miles*'")

You may alternately use a Spark SQL query.

val result = spark.sql("SELECT id, artist_name FROM music.solr WHERE solr_query = 'artist_name:Miles*' LIMIT 10")

Configuring a DSE SearchAnalytics cluster

  1. Create DSE SearchAnalytics nodes in a mixed-workload cluster, as described in Initializing a single datacenter per workload type.

    The name of the datacenter is set to SearchAnalytics when using the DseSimpleSnitch. Do not modify existing search or analytics nodes that use DseSimpleSnitch to be SearchAnalytics nodes. If you use another snitch like GossipingPropertyFileSnitch you can have a mixed workload within a datacenter.

  2. Perform load testing to ensure your hardware has enough CPU and memory for the additional resource overhead that is required by Spark and Solr.

    SearchAnalytics nodes always use driver paging settings. See Using pagination (cursors) with CQL Solr queries.

    SearchAnalytics nodes might consume more resources than search or analytics nodes. Resource requirements of the nodes greatly depend on the type of query patterns you are using.

Considerations for DSE SearchAnalytics clusters

Care should be taken when enabling both Search and Analytics on a DSE node. Since both workloads are enabled, ensure proper resources are provisioned for these simultaneous workloads. This includes sufficient memory and compute resources to accommodate the specific indexing, query, and processing appropriate to the use case.

SearchAnalytics clusters are appropriate for production environments, provided these environments provide sufficient resources for the specific workload, as is the case for all DSE clusters.

All of the fields that are queried on DSE SearchAnalytics clusters must be defined in the search index schema definition. Fields that are not defined in the search index schema columns not defined are excluded in the results returned from Spark queries.

Using predicate push down in Spark SQL

Solr predicate push down allows queries in SearchAnalytics datacenters to use Solr-indexed columns in Spark SQL queries.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com