Multi-faceted search using healthcare data

This quickstart example provides an overview of creating and altering search indexes using CQL index management commands.

Prerequisites

  1. Create a demo keyspace with a replication factor of at least 1 in the search datacenter, see Creating a demo keyspace for tutorials.

  2. Download the health_data.csv onto a search node.

Procedure

  1. Launch cqlsh on a search node:

    1. Determine which nodes in the cluster are running a search workload:

      dsetool status
      DSE Search operations are available only on search-enabled nodes. DataStax recommends single workload datacenters.

      The following example shows a development environment where all nodes in the cluster are in the same physical location, on the same rack, and the nodes have been separated into datacenters based on their workloads:

      DC: Main       Workload: Cassandra       Graph: no
      ======================================================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --   Address          Load             Owns               VNodes       Rack         Health [0,1]
      UN   10.10.10.111   15.51 MiB        ?                    8            rack1        0.90
      UN   10.10.10.113   19.51 MiB        ?                    8            rack1        0.90
      
      DC: Search            Workload: Search          Graph: no
      ======================================================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --   Address          Load             Owns               VNodes       Rack         Health [0,1]
      UN   10.10.10.108   18.13 MiB        ?                    8            rack1        0.90
      UN   10.10.10.110   17.4 MiB         ?                    8            rack1        0.90
    2. Launch a cqlsh session on a search node from the directory that contains the health_data.csv:

      cd ~ && cqlsh -k demo

      A CQL sessions starts using the demo keyspace.

      Connected to cluster1 at 10.10.10.108:9042.
      [cqlsh 5.0.1 | Cassandra 3.11.0.1805 | DSE 5.1.3 | CQL spec 3.4.4 | Native protocol v4]
      Use HELP for help.
      cqlsh:demo>
      The active keyspace name appears in the cqlsh prompt.
  2. Set up the health_data table with data:

    1. Create the table:

      USE healthcare;
      
      // tag::create-table[]
      CREATE TABLE IF NOT EXISTS healthcare.health_data (
      "id" INT,
      "num_smokers" INT,
      "age" INT,
      "age_unit" VARCHAR,
      "age_months" INT,
      "major_medical_coverage" VARCHAR,
      "dental_coverage" VARCHAR,
      "routine_medical_coverage" VARCHAR,
      "employer_paid_plan" VARCHAR,
      "secondary_smoke" VARCHAR,
      "county" VARCHAR,
      "screening_month" VARCHAR,
      "pets" VARCHAR,
      "asthma" VARCHAR,
      "bronchitis" VARCHAR,
      "goiter" VARCHAR,
      "hay_fever" VARCHAR,
      "thyroid_disease" VARCHAR,
      "chronic_bronchitis" VARCHAR,
      "diagnosed_asthma" VARCHAR,
      "diagnosed_cataracts" VARCHAR,
      "diagnosed_emphysema" VARCHAR,
      "diagnosed_goiter" VARCHAR,
      "diagnosed_gout" VARCHAR,
      "diagnosed_hay_fever" VARCHAR,
      "diagnosed_lupus" VARCHAR,
      "diagnosed_other_cancer" VARCHAR,
      "diagnosed_skin_cancer" VARCHAR,
      "diagnosed_stroke" VARCHAR,
      "diagnosed_thyroid_disease" VARCHAR,
      "diagnosed_congestive_heart_failure" VARCHAR,
      "ethnicity" VARCHAR,
      "exam_status" VARCHAR,
      "family_sequence" INT,
      "family_size" INT,
      "fips" VARCHAR,
      "grade_completed" VARCHAR,
      "household_size" INT,
      "health_status" VARCHAR,
      "marital_status" VARCHAR,
      "bird" VARCHAR,
      "cat" VARCHAR,
      "dog" VARCHAR,
      "fish" VARCHAR,
      "other_pet" VARCHAR,
      "race" VARCHAR,
      "race_ethnicity" VARCHAR,
      "gender" VARCHAR,
      "birthplace" VARCHAR,
      "annual_income_20000" VARCHAR,
      "income_group" INT,
      "monthly_income_total" INT,
      PRIMARY KEY ("id", "age"))
      WITH gc_grace_seconds = 0;
      // end::create-table[]
      After loading data that contains null values, temporarily set the grace period to zero to clean up tombstones.
    2. Load data from the CSV file:

      COPY health_data (
         "id",
         "num_smokers",
         "age",
         "age_unit",
         "age_months",
         "major_medical_coverage",
         "dental_coverage",
         "routine_medical_coverage",
         "employer_paid_plan",
         "secondary_smoke",
         "county",
         "screening_month",
         "pets",
         "asthma",
         "bronchitis",
         "goiter",
         "hay_fever",
         "thyroid_disease",
         "chronic_bronchitis",
         "diagnosed_asthma",
         "diagnosed_cataracts",
         "diagnosed_emphysema",
         "diagnosed_goiter",
         "diagnosed_gout",
         "diagnosed_hay_fever",
         "diagnosed_lupus",
         "diagnosed_other_cancer",
         "diagnosed_skin_cancer",
         "diagnosed_stroke",
         "diagnosed_thyroid_disease",
         "diagnosed_congestive_heart_failure",
         "ethnicity",
         "exam_status",
         "family_sequence",
         "family_size",
         "fips",
         "grade_completed",
         "household_size",
         "health_status",
         "marital_status",
         "bird",
         "cat",
         "dog",
         "fish",
         "other_pet",
         "race",
         "race_ethnicity",
         "gender",
         "birthplace",
         "annual_income_20000",
         "income_group",
         "monthly_income_total") 
      FROM 'health_data.csv';
    3. Verify the number of records:

      SELECT COUNT(*) FROM demo.health_data ;
       count
      -------
       20050
      
      (1 rows)
      
      Warnings :
      Aggregation query used without partition key
  3. Create the search index:

    CREATE SEARCH INDEX ON demo.health_data
    WITH COLUMNS * {excluded:false}, age_months, monthly_income_total {excluded: true};
    • Only columns identified in the COLUMNS options are included. All columns are included when this option is omitted.

  4. Display the schema:

    DESCRIBE ACTIVE SEARCH INDEX SCHEMA ON demo.health_data;
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <schema name="autoSolrSchema" version="1.5">
      <types>
        <fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
        <fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
      </types>
      <fields>
        <field indexed="true" multiValued="false" name="grade_completed" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_thyroid_disease" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="pets" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="secondary_smoke" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_lupus" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="gender" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="birthplace" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="income_group" stored="true" type="TrieIntField"/>
        <field indexed="true" multiValued="false" name="marital_status" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="age_months" stored="true" type="TrieIntField"/>
        <field indexed="true" multiValued="false" name="bird" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="hay_fever" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_hay_fever" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="routine_medical_coverage" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="annual_income_20000" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="exam_status" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="other_pet" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_stroke" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="employer_paid_plan" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="family_sequence" stored="true" type="TrieIntField"/>
        <field indexed="true" multiValued="false" name="diagnosed_cataracts" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="major_medical_coverage" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_gout" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="age_unit" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="goiter" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="chronic_bronchitis" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="county" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="num_smokers" stored="true" type="TrieIntField"/>
        <field indexed="true" multiValued="false" name="screening_month" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_emphysema" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_other_cancer" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="id" stored="true" type="TrieIntField"/>
        <field indexed="true" multiValued="false" name="dental_coverage" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="health_status" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="fish" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="dog" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="asthma" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="ethnicity" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="age" stored="true" type="TrieIntField"/>
        <field indexed="true" multiValued="false" name="diagnosed_asthma" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="race_ethnicity" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_congestive_heart_failure" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="family_size" stored="true" type="TrieIntField"/>
        <field indexed="true" multiValued="false" name="race" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="thyroid_disease" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="bronchitis" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="household_size" stored="true" type="TrieIntField"/>
        <field indexed="true" multiValued="false" name="cat" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_goiter" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="diagnosed_skin_cancer" stored="true" type="StrField"/>
        <field indexed="true" multiValued="false" name="fips" stored="true" type="StrField"/>
      </fields>
      <uniqueKey>(id,age)</uniqueKey>
    </schema>

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com