Multi-faceted search using healthcare data
This quick start example provides an overview of creating and altering search indexes using CQL index management commands.
Prerequisites
-
Create a
healthcarekeyspace with a replication factor of at least 1 in the search datacenter, see Creating a healthcare keyspace for tutorials. -
Download and extract
health_data.csv.ziponto a search node.
Procedure
|
This procedure was written for DSE 5.1 but can also be performed with DSE 6.8. |
-
Launch cqlsh on a search node:
-
Determine which nodes in the cluster are running a search workload:
dsetool statusDSE Search operations are available only on search-enabled nodes. DataStax recommends single workload datacenters.
The following example shows a development environment where all nodes in the cluster are in the same physical location, on the same rack, and the nodes have been separated into datacenters based on their workloads:
DC: Main Workload: Cassandra Graph: no ====================================================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns VNodes Rack Health [0,1] UN 10.10.10.111 15.51 MiB ? 8 rack1 0.90 UN 10.10.10.113 19.51 MiB ? 8 rack1 0.90 DC: Search Workload: Search Graph: no ====================================================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns VNodes Rack Health [0,1] UN 10.10.10.108 18.13 MiB ? 8 rack1 0.90 UN 10.10.10.110 17.4 MiB ? 8 rack1 0.90 -
Launch a
cqlshsession on a search node from the directory that contains thehealth_data.csv:cd ~ && cqlsh -k healthcareA CQL sessions starts using the demo keyspace.
Connected to cluster1 at 10.10.10.108:9042. [cqlsh 5.0.1 | Cassandra 3.11.0.1805 | {product-short} 5.1.3 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh:healthcare>The active keyspace name appears in the
cqlshprompt.
-
-
Set up the
health_datatable with data:-
Create the table:
USE healthcare; // START-create-table CREATE TABLE IF NOT EXISTS healthcare.health_data ( "id" INT, "num_smokers" INT, "age" INT, "age_unit" VARCHAR, "age_months" INT, "major_medical_coverage" VARCHAR, "dental_coverage" VARCHAR, "routine_medical_coverage" VARCHAR, "employer_paid_plan" VARCHAR, "secondary_smoke" VARCHAR, "county" VARCHAR, "screening_month" VARCHAR, "pets" VARCHAR, "asthma" VARCHAR, "bronchitis" VARCHAR, "goiter" VARCHAR, "hay_fever" VARCHAR, "thyroid_disease" VARCHAR, "chronic_bronchitis" VARCHAR, "diagnosed_asthma" VARCHAR, "diagnosed_cataracts" VARCHAR, "diagnosed_emphysema" VARCHAR, "diagnosed_goiter" VARCHAR, "diagnosed_gout" VARCHAR, "diagnosed_hay_fever" VARCHAR, "diagnosed_lupus" VARCHAR, "diagnosed_other_cancer" VARCHAR, "diagnosed_skin_cancer" VARCHAR, "diagnosed_stroke" VARCHAR, "diagnosed_thyroid_disease" VARCHAR, "diagnosed_congestive_heart_failure" VARCHAR, "ethnicity" VARCHAR, "exam_status" VARCHAR, "family_sequence" INT, "family_size" INT, "fips" VARCHAR, "grade_completed" VARCHAR, "household_size" INT, "health_status" VARCHAR, "marital_status" VARCHAR, "bird" VARCHAR, "cat" VARCHAR, "dog" VARCHAR, "fish" VARCHAR, "other_pet" VARCHAR, "race" VARCHAR, "race_ethnicity" VARCHAR, "gender" VARCHAR, "birthplace" VARCHAR, "annual_income_20000" VARCHAR, "income_group" INT, "monthly_income_total" INT, PRIMARY KEY ("id", "age")) WITH gc_grace_seconds = 0; // END-create-tableAfter loading data that contains null values, temporarily set the grace period to zero to clean up tombstones.
-
Use DataStax Bulk Loader to load data from the
health_data.csvfile:dsbulk load -f dsbulkConfigFile.conf -url health_data.csv -k healthcare -t health_data \ -header false --schema.allowExtraFields true --schema.allowMissingFields trueIf the
health_data.csvis not in the directory where you launchcqlsh, specify the full path to the file. Also, thedsbulkConfigFile.conffile referenced in thedsbulkcommand includes theschema.mappingdefinition for thehealth_datatable.For example:
schema.mapping = "0=id, 1=num_smokers, 2=age, 3=age_unit, 4=age_months, 5=major_medical_coverage, 6=dental_coverage, 7=routine_medical_coverage, 8=employer_paid_plan, 9=secondary_smoke, 10=county, 11=screening_month, 12=pets, ... 50=income_group, 51=monthly_income_total"
The script loads 20050 rows.
Operation directory: /home/myaccount/search-demos/healthcare_demo/logs/LOAD_20190308-194845-563653 total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches 20,050 | 0 | 2,295 | 0.62 | 0.28 | 83.97 | 274.73 | 333.45 | 1.00 Operation LOAD_20190308-194845-563653 completed successfully in 8 seconds. Last processed positions can be found in positions.txt -
Verify the number of records:
SELECT COUNT(*) FROM healthcare.health_data ;count ------- 20050 (1 rows) Warnings : Aggregation query used without partition key
-
-
Create the search index:
CREATE SEARCH INDEX ON healthcare.health_data WITH COLUMNS * {excluded:false}, age_months, monthly_income_total {excluded: true} ;-
Only columns identified in the
COLUMNSoptions are included. All columns are included when this option is omitted.
-
-
Display the schema:
DESCRIBE ACTIVE SEARCH INDEX SCHEMA ON healthcare.health_data;<?xml version="1.0" encoding="UTF-8" standalone="no"?> <schema name="autoSolrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.StrField" name="StrField"/> <fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/> </types> <fields> <field indexed="true" multiValued="false" name="grade_completed" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_thyroid_disease" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="pets" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="secondary_smoke" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_lupus" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="gender" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="birthplace" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="income_group" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="marital_status" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="age_months" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="bird" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="hay_fever" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_hay_fever" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="routine_medical_coverage" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="annual_income_20000" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="exam_status" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="other_pet" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_stroke" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="employer_paid_plan" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="family_sequence" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="diagnosed_cataracts" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="major_medical_coverage" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_gout" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="age_unit" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="goiter" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="chronic_bronchitis" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="county" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="num_smokers" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="screening_month" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_emphysema" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_other_cancer" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="id" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="dental_coverage" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="health_status" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="fish" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="dog" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="asthma" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="ethnicity" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="age" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="diagnosed_asthma" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="race_ethnicity" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_congestive_heart_failure" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="family_size" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="race" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="thyroid_disease" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="bronchitis" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="household_size" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="cat" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_goiter" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_skin_cancer" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="fips" stored="true" type="StrField"/> </fields> <uniqueKey>(id,age)</uniqueKey> </schema>