Creating a schema and data modeling

A search schema defines the relationship between data in a table and a Solr core. The schema identifies the columns to index in Solr and maps column names to Solr types.

This topic provides an overview of the search schema. For details about all the options and search schema settings, see the Solr wiki. A search schema defines the relationship between data in a table and a Solr core. The schema identifies the columns to index in Solr and maps column names to Solr types.

Table and schema definition 

A CQL table must be created before creating the Solr core. DSE Search maps the schema fields and the unique key specification to the Cassandra key components, and generates a synthetic unique key for Solr.

Apache Solr™ and Apache Lucene® limitations apply to DSE Search, including field name policies.

Solr field name policy applies to DSE Search field names:
  • Every field must have a name.
  • Field names must consist of alphanumeric or underscore characters only.
  • Fields cannot start with a digit
  • Names with both leading and trailing underscores (for example, _version_) are reserved.
Note: Although the field name policy is not strictly enforced, non-compliant field names are not supported from all components. Backward compatibility is not guaranteed.

For example, in the following excerpts from the Basic tutorial.

CREATE TABLE nhanes (
  "id" INT,
  "num_smokers" INT,
  "age" INT,
  . . .
  PRIMARY KEY ("id", "age")
);
<schema name="solr_quickstart" version="1.1">
 <types>
 . . .
 <fields>
   <field name="id" type="int" indexed="true"  stored="true"/>
   <field name="num_smokers" type="int" indexed="true"  stored="true"/>
   <field name="age" type="int" indexed="true"  stored="true"/>
 . . .
 <uniqueKey>(id,age)</uniqueKey>
 . . .

The schema must have a unique key. The unique key is like a primary key in SQL. It maps to the Cassandra primary key, which DataStax Enterprise uses to route documents to cluster nodes.

Fields with indexed="true" are indexed and searched by Lucene. The indexed fields are stored in Cassandra, not in Lucene, regardless of the value of the stored attribute value, with the exception of copy fields.
  • To store a field with indexed="false" in Cassandra and enable the field to be returned on search queries, set stored="true".
  • To ignore the field, set both indexed="false" and stored="false".

Defining the unique key 

DataStax Enterprise supports CQL tables using simple or compound primary keys, as shown in the Solr query join example, and composite partition keys.

When using legacy type mappings, the unique key must be defined as a string in the Solr schema.

Sample schema 

The following example from Querying a CQL collection set uses a simple primary key.
<schema name="my_search_demo" version="1.5">
  <types>
    <fieldType class="solr.StrField" multiValued="true" name="StrCollectionField"/>
    <fieldType name="string" class="solr.StrField"/>
    <fieldType name="text" class="solr.TextField"/>
    <fieldType class="solr.TextField" name="textcollection" multiValued="true">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
    </fieldType>
  </types>
  <fields>
    <field name="id"  type="string" indexed="true"  stored="true"/>
    <field name="quotes"  type="textcollection" indexed="true"  stored="true"/>
    <field name="name"  type="text" indexed="true"  stored="true"/>
    <field name="title"  type="text" indexed="true"  stored="true"/>
  </fields>
  <defaultSearchField>quotes</defaultSearchField>
  <uniqueKey>id</uniqueKey>
</schema>

DSE Search indexes the id, quotes, name, and title fields.

Mapping CQL primary keys and Solr unique keys 

The schema corresponding to a CQL table using a compound primary key requires a special syntax. If the field is a compound primary key or composite partition key column in Cassandra, you must enclose the unique keys in parentheses in the Solr schema. The schema for this kind of table requires a different syntax than the simple primary key:
  • List each compound primary key column that appears in the CQL table in the Solr schema as a field, just like any other column.
  • Declare the unique key using the key columns enclosed in parentheses.
  • Order the keys in the uniqueKey element as the keys are ordered in the CQL table.
  • When using composite partition keys, do not include the extra set of parentheses in the Solr uniqueKey.
    Cassandra Partition Key CQL Syntax Solr uniqueKey Syntax
    Simple CQL primary key CREATE TABLE ( . . . a <type> PRIMARY KEY, . . . );

    (a is both the partition key and the primary key)

    <uniqueKey>a</uniqueKey>
    Note: Parenthesis are not required for a single key.
    Compound primary key

    The Basic tutorial contains a schema for a Cassandra table that uses a CQL compound primary key.

    CREATE TABLE ( . . . PRIMARY KEY ( a, b, c ) );

    (a is the partition key and a b c is the primary key)

    <uniqueKey>(a, b, c)</uniqueKey>
    Composite partition key CREATE TABLE ( . . . PRIMARY KEY ( ( a, b), c );

    (a b is the partition key and a b c is the primary key)

    <uniqueKey>(a, b, c)</uniqueKey>

Overriding _partitionKey when not using joins 

The special _partitionKey field is used internally for joins. If you do not plan on using joins, you can override this field declaration in the schema.xml file for only the docValues and indexed properties.
<field name="_partitionKey" type="string" indexed="false"/>
To disable doc values, add docValues="false".
<fieldname="_partitionKey" type="string" docValues="false"/>

Document level boosting (Deprecated) 

Document-level boosting is deprecated and will be removed in a future release. To add document-level boosting on CQL tables, add a column named "_docBoost" of type float to the table. Fields belonging to that document are boosted at indexing time.

Changing a schema 

Changing the Solr schema makes reloading the Solr core necessary. Re-indexing can be disruptive. Users can be affected by performance hits caused by re-indexing. Changing the schema is recommended only when absolutely necessary and during scheduled down time.