Multi-faceted search using healthcare data
This quick start example provides an overview of creating and altering search indexes using CQL index management commands.
Find more information about this example in the healthcare tutorial repo in GitHub.
Prerequisites
-
Create a
healthcarekeyspace with a replication factor of at least 1 in the search datacenter, see Creating a healthcare keyspace for tutorials. -
Download the
health_data.csvonto a search node.
Procedure
|
This procedure was written for DSE 5.1 but can also be performed with DSE 6.8. |
-
Launch cqlsh on a search node:
-
Determine which nodes in the cluster are running a search workload:
dsetool statusDSE Search operations are available only on search-enabled nodes. DataStax recommends single workload datacenters.
The following example shows a development environment where all nodes in the cluster are in the same physical location, on the same rack, and the nodes have been separated into datacenters based on their workloads:
DC: Main Workload: Cassandra Graph: no ====================================================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns VNodes Rack Health [0,1] UN 10.10.10.111 15.51 MiB ? 8 rack1 0.90 UN 10.10.10.113 19.51 MiB ? 8 rack1 0.90 DC: Search Workload: Search Graph: no ====================================================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns VNodes Rack Health [0,1] UN 10.10.10.108 18.13 MiB ? 8 rack1 0.90 UN 10.10.10.110 17.4 MiB ? 8 rack1 0.90 -
Launch a
cqlshsession on a search node from the directory that contains thehealth_data.csv:cd ~ && cqlsh -k healthcareA CQL sessions starts using the demo keyspace.
Connected to cluster1 at 10.10.10.108:9042. [cqlsh 5.0.1 | Cassandra 3.11.0.1805 | {product-short} 5.1.3 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh:healthcare>The active keyspace name appears in the
cqlshprompt.
-
-
Set up the
health_datatable with data:-
Create the table:
USE healthcare; // START-create-table CREATE TABLE IF NOT EXISTS healthcare.health_data ( "id" INT, "num_smokers" INT, "age" INT, "age_unit" VARCHAR, "age_months" INT, "major_medical_coverage" VARCHAR, "dental_coverage" VARCHAR, "routine_medical_coverage" VARCHAR, "employer_paid_plan" VARCHAR, "secondary_smoke" VARCHAR, "county" VARCHAR, "screening_month" VARCHAR, "pets" VARCHAR, "asthma" VARCHAR, "bronchitis" VARCHAR, "goiter" VARCHAR, "hay_fever" VARCHAR, "thyroid_disease" VARCHAR, "chronic_bronchitis" VARCHAR, "diagnosed_asthma" VARCHAR, "diagnosed_cataracts" VARCHAR, "diagnosed_emphysema" VARCHAR, "diagnosed_goiter" VARCHAR, "diagnosed_gout" VARCHAR, "diagnosed_hay_fever" VARCHAR, "diagnosed_lupus" VARCHAR, "diagnosed_other_cancer" VARCHAR, "diagnosed_skin_cancer" VARCHAR, "diagnosed_stroke" VARCHAR, "diagnosed_thyroid_disease" VARCHAR, "diagnosed_congestive_heart_failure" VARCHAR, "ethnicity" VARCHAR, "exam_status" VARCHAR, "family_sequence" INT, "family_size" INT, "fips" VARCHAR, "grade_completed" VARCHAR, "household_size" INT, "health_status" VARCHAR, "marital_status" VARCHAR, "bird" VARCHAR, "cat" VARCHAR, "dog" VARCHAR, "fish" VARCHAR, "other_pet" VARCHAR, "race" VARCHAR, "race_ethnicity" VARCHAR, "gender" VARCHAR, "birthplace" VARCHAR, "annual_income_20000" VARCHAR, "income_group" INT, "monthly_income_total" INT, PRIMARY KEY ("id", "age")) WITH gc_grace_seconds = 0; // END-create-tableAfter loading data that contains null values, temporarily set the grace period to zero to clean up tombstones.
-
Use DataStax Bulk Loader to load data from the
health_data.csvfile:dsbulk load -f dsbulkConfigFile.conf -url health_data.csv -k healthcare -t health_data \ -header false --schema.allowExtraFields true --schema.allowMissingFields trueIf the
health_data.csvis not in the directory where you launchcqlsh, specify the full path to the file. Also, thedsbulkConfigFile.conffile referenced in thedsbulkcommand includes theschema.mappingdefinition for thehealth_datatable.For example:
schema.mapping = "0=id, 1=num_smokers, 2=age, 3=age_unit, 4=age_months, 5=major_medical_coverage, 6=dental_coverage, 7=routine_medical_coverage, 8=employer_paid_plan, 9=secondary_smoke, 10=county, 11=screening_month, 12=pets, ... 50=income_group, 51=monthly_income_total"
The script loads 20050 rows.
Operation directory: /home/myaccount/search-demos/healthcare_demo/logs/LOAD_20190308-194845-563653 total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms | batches 20,050 | 0 | 2,295 | 0.62 | 0.28 | 83.97 | 274.73 | 333.45 | 1.00 Operation LOAD_20190308-194845-563653 completed successfully in 8 seconds. Last processed positions can be found in positions.txt -
Verify the number of records:
SELECT COUNT(*) FROM healthcare.health_data ;count ------- 20050 (1 rows) Warnings : Aggregation query used without partition key
-
-
Create the search index:
CREATE SEARCH INDEX ON healthcare.health_data WITH COLUMNS * {excluded:false}, age_months, monthly_income_total {excluded: true} ;-
Only columns identified in the
COLUMNSoptions are included. All columns are included when this option is omitted.
-
-
Display the schema:
DESCRIBE ACTIVE SEARCH INDEX SCHEMA ON healthcare.health_data;<?xml version="1.0" encoding="UTF-8" standalone="no"?> <schema name="autoSolrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.StrField" name="StrField"/> <fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/> </types> <fields> <field indexed="true" multiValued="false" name="grade_completed" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_thyroid_disease" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="pets" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="secondary_smoke" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_lupus" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="gender" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="birthplace" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="income_group" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="marital_status" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="age_months" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="bird" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="hay_fever" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_hay_fever" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="routine_medical_coverage" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="annual_income_20000" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="exam_status" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="other_pet" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_stroke" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="employer_paid_plan" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="family_sequence" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="diagnosed_cataracts" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="major_medical_coverage" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_gout" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="age_unit" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="goiter" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="chronic_bronchitis" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="county" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="num_smokers" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="screening_month" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_emphysema" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_other_cancer" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="id" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="dental_coverage" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="health_status" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="fish" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="dog" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="asthma" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="ethnicity" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="age" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="diagnosed_asthma" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="race_ethnicity" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_congestive_heart_failure" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="family_size" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="race" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="thyroid_disease" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="bronchitis" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="household_size" stored="true" type="TrieIntField"/> <field indexed="true" multiValued="false" name="cat" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_goiter" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="diagnosed_skin_cancer" stored="true" type="StrField"/> <field indexed="true" multiValued="false" name="fips" stored="true" type="StrField"/> </fields> <uniqueKey>(id,age)</uniqueKey> </schema>