Example: Analyzing spatial data

Use DataStax Enterprise with the integrated GIS Tools for Hadoop and custom ESRI-import tool.

This example shows how to use DataStax Enterprise with the integrated GIS Tools for Hadoop and custom ESRI-import tool for the following tasks:
  • Create a CQL table to accommodate ESRI earthquake data.
  • Load ESRI earthquake data from a CSV file into Cassandra.
  • Load county geographic information from a JSON file into Hive.
  • Analyze the data to determine the location of earthquakes.

The example assumes that you started DataStax Enterprise as a Hadoop-enabled analytics node.

Procedure

  1. Download the CSV and JSON files from the DataStax web site for this example.
  2. Unzip the file into a directory.

    The gis.zip file contains earthquakes.csv and california-counties.json.

  3. In cqlsh, create and use a keyspace.
    cqlsh> CREATE KEYSPACE gis WITH replication = {'class': 'NetworkTopologyStrategy', 'Analytics': 1 };
    
    cqlsh> USE gis;
  4. Create a schema for the earthquake data in earthquakes.csv.
    cqlsh:gis> CREATE TABLE earthquakes ( 
                 datetime text PRIMARY KEY, 
                 latitude double, 
                 longitude double, 
                 depth double, 
                 magnitude double, 
                 magtype text, 
                 nbstations int, 
                 gap double, 
                 distance double, 
                 rms double, 
                 source text, 
                 eventid int 
               );
    Although the earthquake dates are in ISO 8601 format, the schema uses the text type for the datetime column because 1898 - 2011 is outside the timestamp type range.
  5. Copy the data in the CSV file to the table using the path that you chose for the CSV file.
    cqlsh:gis> COPY earthquakes (datetime, latitude, longitude, depth, magnitude, magtype, nbstations, gap, distance, rms, source, eventid) FROM 'path/earthquakes.csv' WITH HEADER = 'true';
  6. Start a Hive client.
  7. From Hive, access the gis database in Cassandra.
    hive> USE gis;
  8. In Hive, create a managed table named counties that defines a schema for the California counties data.
    
    hive> CREATE TABLE IF NOT EXISTS counties (
            Area string, 
            Perimeter string, 
            State string, 
            County string, 
            Name string, 
            BoundaryShape binary
          )
          ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'              
          STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
          OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
  9. Load the ESRI county data into the table. Use the path to the california-counties.json file you downloaded.
    
    hive> LOAD DATA LOCAL INPATH 'path/california-counties.json' OVERWRITE INTO TABLE counties;
    The output looks something like this:
    Copying data from file:/Users/me/builds/dse-4.x/bin/california-counties.json
    Copying file: file:/Users/me/builds/dse-4.x/bin/california-counties.json
    Loading data to table gis.counties
    Table gis.counties stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 1028330, raw_data_size: 0]
    OK
  10. In Hive, create temporary functions for the geometry API calls.
    hive> create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
                    
    hive> create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
  11. Join the counties and earthquake tables, and query the data to determine the number of earthquakes in each county.
    hive> SELECT counties.name, count(*) cnt FROM counties
            JOIN earthquakes
            WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
            GROUP BY counties.name
            ORDER BY cnt desc;
    The MapReduce job runs, and the output appears.
    Kern 36
    San Bernardino 35
    Imperial 28
    Inyo 20
    Los Angeles 18
    Monterey 14
    Riverside 14
    Santa Clara 12
    Fresno 11
    San Benito 11
    San Diego 7
    Santa Cruz 5
    San Luis Obispo 3
    Ventura 3
    Orange 2
    San Mateo 1