Use DataStax Enterprise with the integrated GIS Tools for Hadoop and custom
ESRI-import tool.
This example shows how to use DataStax Enterprise with the integrated GIS Tools for
Hadoop and custom ESRI-import tool for the following tasks:
- Create a CQL table to accommodate ESRI earthquake data.
- Load ESRI earthquake data from a CSV file into Cassandra.
- Load county geographic information from a JSON file into Hive.
- Analyze the data to determine the location of earthquakes.
The example assumes that you started DataStax
Enterprise as a Hadoop-enabled analytics node.
Procedure
-
Download the CSV and JSON files from
the DataStax web site for this example.
-
Unzip the file into a directory.
The gis.zip file contains earthquakes.csv and california-counties.json.
-
In cqlsh, create and use a keyspace.
cqlsh> CREATE KEYSPACE gis WITH replication = {'class': 'NetworkTopologyStrategy', 'Analytics': 1 };
cqlsh> USE gis;
-
Create a schema for the earthquake data in earthquakes.csv.
cqlsh:gis> CREATE TABLE earthquakes (
datetime text PRIMARY KEY,
latitude double,
longitude double,
depth double,
magnitude double,
magtype text,
nbstations int,
gap double,
distance double,
rms double,
source text,
eventid int
);
Although the earthquake dates are in ISO 8601 format, the schema uses the text
type for the datetime column because 1898 - 2011 is outside the
timestamp type range.
-
Copy the data in the CSV file to the table using the path that you chose for
the CSV file.
cqlsh:gis> COPY earthquakes (datetime, latitude, longitude, depth, magnitude, magtype, nbstations, gap, distance, rms, source, eventid) FROM 'path/earthquakes.csv' WITH HEADER = 'true';
-
Start a Hive client.
-
From Hive, access the gis database in Cassandra.
-
In Hive, create a managed table named counties that defines a schema for the
California counties data.
hive> CREATE TABLE IF NOT EXISTS counties (
Area string,
Perimeter string,
State string,
County string,
Name string,
BoundaryShape binary
)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
-
Load the ESRI county data into the table. Use the path to the
california-counties.json file you downloaded.
hive> LOAD DATA LOCAL INPATH 'path/california-counties.json' OVERWRITE INTO TABLE counties;
The output looks something like
this:
Copying data from file:/Users/me/builds/dse-4.x/bin/california-counties.json
Copying file: file:/Users/me/builds/dse-4.x/bin/california-counties.json
Loading data to table gis.counties
Table gis.counties stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 1028330, raw_data_size: 0]
OK
-
In Hive, create temporary functions for the geometry API calls.
hive> create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
hive> create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
-
Join the counties and earthquake tables, and query the data to determine the
number of earthquakes in each county.
hive> SELECT counties.name, count(*) cnt FROM counties
JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt desc;
The MapReduce job runs, and the output
appears.
Kern 36
San Bernardino 35
Imperial 28
Inyo 20
Los Angeles 18
Monterey 14
Riverside 14
Santa Clara 12
Fresno 11
San Benito 11
San Diego 7
Santa Cruz 5
San Luis Obispo 3
Ventura 3
Orange 2
San Mateo 1