Indexing a column for different analysis

DSE Search supports indexing a CQL table column for different types of analysis using the Solr copyField directive.

For a complete explanation, see the Solr Reference Guide Copying fields.

When specified during search index creation, DSE automatically defines a new index string field and sets up the data copy. The new field is not stored in the database or returned in query results.

Restriction: Copying from/to the same dynamic field and setting the maximum number of characters (maxChars) in the copyField definition are unsupported.

The following example uses copy fields to copy various CQL columns, such as a twitter name and email, to a multiValued field. You can then query the multiValued field using a term to search for all columns in a single query.

The DSE per-segment filter cache is moved off-heap by using native memory to reduce on-heap memory consumption and garbage collection overhead.

The off-heap filter cache is enabled by default. To disable, pass the offheap JVM system property at startup time: -Dsolr.offheap.enable.

Procedure

  1. Create a keyspace using the replication strategy and replication factor that makes sense for your environment. The following example is for a single node test cluster:

    CREATE KEYSPACE user_info
      WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
  2. Create a table:

    CREATE TABLE user_info.users (
      id text PRIMARY KEY,
      name text,
      email text,
      skype text,
      irc text,
      twitter text
    ) ;
  3. Insert some data:

    INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user1', 'john smith', 'jsmith@abc.com', 'johnsmith', 'smitty', '@johnsmith');
    
    INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user2', 'elizabeth doe', 'lizzy@swbell.net', 'roadwarriorliz', 'elizdoe',  '@edoe576');
    
    INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user3', 'dan graham', 'etnaboy1@aol.com', 'danielgra', 'dgraham', '@dannyboy');
    
    INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user4', 'john smith', 'jonsmit@fyc.com', 'johnsmith', 'jsmith345', '@johnrsmith');
    
    INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user5', 'john smith', 'jds@adeck.net', 'jdsmith', 'jdansmith',  '@smithjd999');
    
    INSERT INTO user_info.users (id, name, email, skype, irc, twitter) VALUES
      ('user6', 'dan graham', 'hacker@legalb.com', 'dangrah', 'dgraham', '@graham222');
  4. Create a search index on the table:

    CREATE SEARCH INDEX ON user_info.users;
  5. Create a field that is only in the index that will contain all the data:

    ALTER SEARCH INDEX SCHEMA ON user_info.users 
    ADD fields.field[ @name='all', 
                      @type='StrField', 
                      @multiValued='true'];
  6. Use copyField to copy the data from all the CQL columns into the new all field of the index:

    ALTER SEARCH INDEX SCHEMA ON user_info.users 
       ADD copyField[@source='id', @dest='all'];
    ALTER SEARCH INDEX SCHEMA ON user_info.users 
       ADD copyField[@source='name', @dest='all'];
    ALTER SEARCH INDEX SCHEMA ON user_info.users 
      ADD copyField[@source='email', @dest='all'];
    ALTER SEARCH INDEX SCHEMA ON user_info.users 
      ADD copyField[@source='skype', @dest='all'];
    ALTER SEARCH INDEX SCHEMA ON user_info.users 
      ADD copyField[@source='irc', @dest='all'];
    ALTER SEARCH INDEX SCHEMA ON user_info.users 
      ADD copyField[@source='twitter', @dest='all'];
  7. To allow faceting on the name column, set docValues to true:

    ALTER SEARCH INDEX SCHEMA ON user_info.users
      SET fields.field[@name='name']@docValues='true';
  8. Reload the schema to make the pending changes active:

    RELOAD SEARCH INDEX ON user_info.users;
  9. Rebuild the index to apply the new schema to the existing data:

    REBUILD SEARCH INDEX ON user_info.users;
  10. Filter the query using the index to return all records that contain smitty in any of the columns.

    SELECT * FROM user_info.users WHERE solr_query = 'all:smitty';

    The output is:

     id    | email          | irc    | name       | skype     | solr_query | twitter
    -------+----------------+--------+------------+-----------+------------+------------
     user1 | jsmith@abc.com | smitty | john smith | johnsmith |       null | @johnsmith
    
    (1 rows)
  11. Get a count of unique names (skip nulls):

    SELECT name FROM user_info.users 
       WHERE solr_query= '{"q":"*","facet":{"field":"name","mincount":"1"}}';

    At the bottom of the output, the facet results appear: 3 instances of john smith, 2 instances of dan graham, and 1 instance of elizabeth doe.

     facet_fields
    ------------------------------------------------------------
     {"name":{"john smith":3,"dan graham":2,"elizabeth doe":1}}
    
    (1 rows)

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com