Getting started with Solr in DataStax Enterprise

DataStax Enterprise supports Open Source Solr (OSS) tools and APIs, simplifying migration from Solr to DataStax Enterprise.

DataStax Enterprise is built on top of Solr 4.6. By default, DataStax Enterprise turns off virtual nodes (vnodes). If want to you use vnodes on Solr nodes, DataStax recommends a range of 32 to 64 vnodes, which increases overhead by approximately 30%.

Introduction to Solr

The Apache Lucene project, Solr features robust full-text search, hit highlighting, and rich document (PDF, Microsoft Word, and so on) handling. Solr also provides more advanced features like aggregation, grouping, and geo-spatial search. Solr powers the search and navigation features of many of the largest Internet sites. Near real-time indexing can be performed.

Solr integration into DataStax Enterprise

The unique combination of Cassandra, DSE Search with Solr, and DSE Hadoop bridges the gap between online transaction processing (OLTP) and online analytical processing (OLAP). DSE Search in Cassandra offers a way to aggregate and look at data in many different ways in real-time using the Solr HTTP API or the Cassandra Query Language (CQL). Cassandra speed can compensate for typical MapReduce performance problems. By integrating Solr into the DataStax Enterprise big data platform, DataStax extends Solr’s capabilities.

DSE Search is easily scalable. You add search capacity to your cluster in the same way as you add Hadoop or Cassandra capacity to your cluster. You can have a hybrid cluster of nodes, provided the Solr nodes are in a separate data center, some running Cassandra, some running search, and some running Hadoop. If you have a hybrid cluster of nodes, follow instructions on isolating workloads. If you don't need Cassandra or Hadoop, migrate to DSE strictly for Solr and create an exclusively Solr cluster. You can use one data center and run Solr on all nodes.

Storage of indexes and data

Data that is added to Cassandra is locally indexed in Solr. Data that is added to Solr is locally indexed in Cassandra. Data changes to one also apply to the other. Solr has its own set of data files. Like Cassandra data files, you can control where the Solr data files are saved on the server. See Managing the location of Solr data.

The solr implementation in DataStax Enterprise does not support JBOD mode. Indexes are stored on the local disk inside Lucene, actual data is stored in Cassandra.

Sources of information about Solr

Covering all of the features of OSS is beyond the scope of DataStax Enterprise documentation. Because DSE Search/Solr supports all Solr tools and APIs, see the Solr documentation for more information, such as how to construct Solr query strings to retrieve indexed data. Additional information is available, including the following resources:

Apache Solr documentation
Solr Tutorial on the Solr site
Solr Tutorial on Apache Lucene site
Solr data import handler
Comma-Separated-Values (CSV) file importer
JSON importer
Solr cell project, including a tool for importing data from PDFs
Solr 4.x Deep Dive tutorial and reference book by Jack Krupansky

Benefits of using Solr in DataStax Enterprise

Solr offers real-time querying of files. Search indexes remain tightly in line with live data. There are significant benefits of running your enterprise search functions through DataStax Enterprise instead of OSS, including:

A fully fault-tolerant, no-single-point-of-failure search architecture
Linear performance scalability--add new search nodes online
Automatic indexing of data ingested into Cassandra
Automatic and transparent data replication
Isolation of all real-time, Hadoop, and search/Solr workloads to prevent competition between workloads for either compute resources or data
The capability to read/write to any Solr node, which overcomes the Solr write bottleneck
Selective updates of one or more individual fields (a full re-index operation is still required)
Search indexes that can span multiple data centers (OSS cannot)
Solr/search support for CQL for querying Solr documents (Solr HTTP API may also be used)
Creation of Solr indexes from existing tables created with CQL