Joining cores

Requirements for joining Solr documents.

DSE Search supports the OS Solr query time join through a custom implementation. You can join Solr documents, including those having different Solr cores under these conditions:

Solr cores need to have the same keyspace and same Cassandra partition key.
Both Cassandra tables that support the Solr cores to be joined have to be either Thrift- or CQL-compatible. You cannot have one that is Thift-compatible and one that is CQL-compatible.
The type of the unique key (Cassandra key validator of the partition key) are the same.
The order of table partition keys and schema unique keys are the same.

DataStax Enterprise 4.5 and later provides faster DocValues-based joins than earlier versions of DataStax Enterprise, such as 4.0.2. In the earlier version, using the simplified syntax shown in the next section for a join query requires reindexing the CQL Solr core, but not the Thrift Solr core. In DataStax Enterprise, using the simplified syntax automatically takes advantage of faster joins in the case of a CQL Solr core. In the case of a Thrift Solr core, to use the simplified syntax, reindex, and in the from field of the query, use docValues=true.

Simplified syntax

This simplified syntax is recommended for joining Solr cores:

q={!join fromIndex=test.from}field:value

The custom implementation eliminates the need to use to/from parameters required by OS Solr. Based on the key structure, DataStax Enterprise can determine what the parameters are. For backward compatibility with applications, the verbose, legacy syntax is also supported.

Example of using a query time join

This example creates two tables, songs and lyrics. The tables use the same partition key. The songs table uses a simple primary key, the UUID of a song. The primary key of the songs table is its partition key. The lyrics table uses a compound primary: id and song, both of type UUID. After joining cores, you construct a single query to retrieve information about songs having lyrics that include "love".

You can copy CQL commands, Solr HTTP requests, and the query from the downloaded commands.txt file.

Download and unzip the file containing the Solr schemas, Solr configuration files, and commands for this example.
This action creates /songs and /lyrics directories, schemas, and Solr configuration files for indexing data in the songs and lyrics tables.
Start cqlsh, and then create and use a keyspace named internet.
You can copy/paste from the downloaded commands.txt file.
Create two tables, song and lyrics, that share the internet keyspace and use the same partition key.
```
cqlsh> CREATE TABLE songs (song uuid PRIMARY KEY, title text, artist text);
cqlsh> CREATE TABLE lyrics (song uuid, id uuid, words text, PRIMARY KEY (song, id));
```
Both tables share the song partition key, a uuid. The second table also contains the id clustering column.
Insert the data from the downloaded file into the songs table.
Insert data into the lyrics table.
The lyrics of songs by Big Data and John Cedrick mention love.

Navigate to the songs directory that you created in step 1, and take a look at the Solr schema.xml. Navigate to the lyrics directory and take a look at the schema. Notice that the order of the unique key in the schema and the partition key of the lyrics table are the same: (song, id). Using (id, song) does not work.

<schema name="songs_schema" version="1.5">
  <types>
     <fieldType name="uuid" class="solr.UUIDField" />
     <fieldType name="text" class="solr.TextField">
     <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory"/>
     </analyzer>
     </fieldType>
  </types>
  <fields>
    <field name="song" type="uuid" indexed="true" stored="true"/>
    <field name="title" type="text" indexed="true" stored="true"/>
    <field name="artist" type="text" indexed="true" stored="true"/>
  </fields>
  <defaultSearchField>artist</defaultSearchField>
  <uniqueKey>song</uniqueKey>
</schema>
   
<schema name="lyrics_schema" version="1.5">
  <types>
    <fieldType name="uuid" class="solr.UUIDField" />
    <fieldType name="text" class="solr.TextField" >
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
    </analyzer>
   </fieldType>
  </types>
    <fields>
      <field name="song" type="uuid" indexed="true" stored="true"/>
      <field name="id" type="uuid" indexed="true" stored="true"/>
      <field name="words" type="text" indexed="true" stored="true"/>
    </fields>
  <defaultSearchField>words</defaultSearchField>
  <uniqueKey>(song, id)</uniqueKey>
</schema>

In the songs directory, post solrconfig.xml and schema.xml for the internet.songs core, and create the Solr core for internet.songs.
In the lyrics directory, post the solrconfig.xml and schema.xml for the internet.lyrics core, and create the Solr core for internet.lyrics.

Search for songs that have lyrics about love.

http://localhost:8983/solr/internet.songs/select/?q={!join+fromIndex=internet.lyrics}words:love&indent=true&wt=json

The output includes two songs having the word "love" in the lyrics, one by Big Data and the other by John Cedrick:

"response":{"numFound":2,"start":0,"docs":[
     {
       "song":"a3e64f8f-bd44-4f28-b8d9-6938726e34d4",
       "title":"Dangerous",
       "artist":"Big Data"},
     {
       "song":"8a172618-b121-4136-bb10-f665cfc469eb",
       "title":"Internet Love Song",
       "artist":"John Cedrick"}]
}}

Recursive join support

You can nest a join query to use the result of one join as an input for another join, and another, recursively. All joined data must reside on the same partition. To embed one query in the Solr query string of another, use the magic field name _query_.

Use the following syntax to construct a query that recursively joins cores.

F1:V1 AND _query_:"{!join fromIndex=keyspace.table}(F2:V2 AND _query_:\"{!join fromIndex=keyspace.table}(F3:V3)\")"

Where the top level from query includes a nested join query. The nested join in this example is:

_query_:\"{!join fromIndex=keyspace.table}(F3:V3)\"

Like an SQL SELECT IN ... (SELECT IN ...) query, Solr executes the nested join queries first, enabling multiple nested join queries if required.

A Solr join query is not a relational join where the values from the nested join queries are returned in the results.

Example of a recursive join query

This example builds on the solr query time join example. Embed in the query to join songs and lyrics having words:"love" a second query to join award-winning videos using AND _query_:"award:true".

You can copy CQL commands, Solr HTTP requests, and the query from the downloaded commands.txt file.

In cqlsh, create a videos table that shares the internet keyspace and uses the same partition key as the songs and lyrics tables.
```
cqlsh> CREATE TABLE videos (song uuid, award boolean, title text, PRIMARY KEY (song));
```
All three tables use the song partition key, a uuid.
Insert the data from the downloaded file into the videos table. The video data sets the award field to true for the videos featuring songs by Big Data and Brad Paisley.
Navigate to the videos directory that was created when you unzipped the downloaded file.
In the videos directory, post solrconfig.xml and schema.xml, and create the Solr core for internet.videos.

Use a nested join query to recursively join the songs and lyrics documents with the videos document, and to select the song that mentions love and also won a video award.

http://localhost:8983/solr/internet.songs/select/?q=
         {!join+fromIndex=internet.lyrics}words:love AND _query_:
         {!join+fromIndex=internet.videos}award:true&indent=true&wt=json

Output is:

"response":{"numFound":1,"start":0,"docs":[
    {
      "song":"a3e64f8f-bd44-4f28-b8d9-6938726e34d4",
      "title":"Dangerous",
      "artist":"Big Data"}]
}}

Support for the legacy join query

DataStax Enterprise supports using the legacy syntax that includes to/from fields in the query. The requirements for using the legacy syntax are:

Tables do not use composite partition key.
The query includes the force=true local parser parameter, as shown in this example that joins mytable1 and mytable2 in mykeyspace.

Legacy syntax example

curl 'http://localhost:8983/solr/mykeyspace.mytable1/select/?q=\{!join+from=id+to=id+fromIndex=mykeyspace.mytable2+force=true\}'