Term and phrase searches using the Wikipedia demo

The Wikipedia demo scripts automatically download 3,000+ Wikipedia articles, create a CQL keyspace and table, insert the articles, and create a search index on both the title and body columns.

Prerequisites

The demo scripts connect to the localhost on the Solr port. Ensure that the Solr interface and port 127.0.0.1:8983 are accessible.

Procedure

This procedure was written for DSE 5.1 but can also be performed with DSE 6.9.

  1. Start DataStax Enterprise as a search node.

  2. Go to <installation_directory>/demos/wikipedia.

  3. Run the script to add the wikipedia schema:

    ./1-add-schema.sh

    This script creates the wiki keyspace with a single table solr.

  4. To use the demo in a cluster that has more than one node, change the keyspace replication from SimpleStrategy to NetworkTopologyStrategy, and set the factor to 1 in each datacenter:

    cqlsh -e ALTER KEYSPACE wiki WITH replication = {'class': 'NetworkTopologyStrategy', 'Cassandra' : 1, 'Solr' : 1};

    In this example, the cluster has two datacenters, Cassandra and Solr. Datacenter names are case sensitive.

  5. Load the data and index the table using the second script (2-index.sh).

    ./2-index.sh --wikifile wikipedia-sample.bz2

    3,000 articles are loaded into the solr table and then indexed.

    Start indexing wikipedia...
    ------------> config properties:
    docs.file = wikipedia-sample.bz2
    keep.image.only.docs = false
    -------------------------------
    Indexed 1000
    Indexed 2000
    Indexed 3000
    Finished
    Visit http://localhost:8983/demos/wikipedia/ to see data
  6. Verify that the data was successfully loaded into the keyspace/table:

    cqlsh -e 'DESC KEYSPACE wiki; SELECT count(*) FROM wiki.solr;'

    The results show the details of the keyspace, table schema, index settings, and number of articles.

    CREATE KEYSPACE wiki WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;
    
    CREATE TABLE wiki.solr (
        id text PRIMARY KEY,
        body text,
        date text,
        solr_query text,
        title text
    ) WITH bloom_filter_fp_chance = 0.01
        AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
        AND comment = ''
        AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
        AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
        AND crc_check_chance = 1.0
        AND dclocal_read_repair_chance = 0.1
        AND default_time_to_live = 0
        AND gc_grace_seconds = 864000
        AND max_index_interval = 2048
        AND memtable_flush_period_in_ms = 0
        AND min_index_interval = 128
        AND read_repair_chance = 0.0
        AND speculative_retry = '99PERCENTILE';
    CREATE CUSTOM INDEX wiki_solr_solr_query_index ON wiki.solr (solr_query) USING 'com.datastax.bdp.search.solr.Cql3SolrSecondaryIndex';
    
     count
    -------
      3579
    
    (1 rows)
    
    Warnings :
    Aggregation query used without partition key
  7. Start cqlsh using the wiki keyspace.

    cqlsh -k wiki

    CQL shell session starts on the localhost in the wiki keyspace.

    Connected to pw-search at 127.0.0.1:9042.
    [cqlsh 5.0.1 | Cassandra 3.11.0.1805 | DSE 5.1.3 | CQL spec 3.4.4 | Native protocol v4]
    Use HELP for help.
    cqlsh:wiki>
  8. Disable paging, for faster query results on small data sets:

    PAGING off

    Paging is turned off only for the session. Paging is enabled after a restart. Use a cqlshrc file to change the default startup parameters for cqlsh.

    Disabled Query paging.
  9. Display the solr table search index schema:

    DESCRIBE ACTIVE SEARCH INDEX SCHEMA ON solr;
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <schema name="autoSolrSchema" version="1.5">
      <types>
        <fieldType class="org.apache.solr.schema.TextField" name="TextField">
          <analyzer>
            <tokenizer class="solr.WikipediaTokenizerFactory"/>
          </analyzer>
        </fieldType>
        <fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
      </types>
      <fields>
        <field indexed="true" multiValued="false" name="body" stored="true" type="TextField"/>
        <field indexed="true" multiValued="false" name="title" stored="true" type="TextField"/>
        <field docValues="true" indexed="true" multiValued="false" name="id" stored="true" type="StrField"/>
        <field docValues="true" indexed="true" multiValued="false" name="date" stored="true" type="StrField"/>
      </fields>
      <uniqueKey>id</uniqueKey>
    </schema>
  10. Execute queries against the table using the index:

    • Return the titles of articles that contain the word national:

      SELECT title FROM solr WHERE solr_query='title:national';

      Seven records are returned.

       title
      --------------------------------------------------------------------------
                                            Bolivia national football team 1999
                                            Bolivia national football team 2000
                                          Kenya national under-20 football team
                                            Bolivia national football team 2001
                                            Bolivia national football team 2002
                                       Israel men's national inline hockey team
       List of French born footballers who have played for other national teams
      
      (7 rows)

Using secure cluster

Information about running the Wikipedia demo on a secure cluster.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com