Example: Explore library data
Install library data that is encoded in UTF-8 format and use a pig script.
This example uses library data from the Institute of Library and Museum Services, encoded in UTF-8 format. Download the formatted data for this example now. You can use two files installed with DataStax Enterprise to copy/paste code and run a pig script instead of stepping through this example manually.
- Packaged installs: /usr/share/dse-demos/pig/cql
- You can copy/paste the commands in steps 2-3 from this document or from the library-populate-cql.txt file.
- You can execute steps 7-10 automatically by running the library-cql.pig script.
- Tarball installs: install-location/demos/pig/cql
Procedure
-
Unzip libdata.csv.zip and give yourself permission to access
the downloaded file. On the Linux command line, for example:
$ chmod 777 libdata.csv
-
Create and use a keyspace called libdata.
cqlsh:libdata> CREATE KEYSPACE libdata WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; cqlsh:libdata> USE libdata;
-
Create a table for the library data that you downloaded.
cqlsh:libdata> CREATE TABLE libout ("STABR" TEXT, "FSCSKEY" TEXT, "FSCS_SEQ" TEXT, "LIBID" TEXT, "LIBNAME" TEXT, "ADDRESS" TEXT, "CITY" TEXT, "ZIP" TEXT, "ZIP4" TEXT, "CNTY" TEXT, "PHONE" TEXT, "C_OUT_TY" TEXT, "C_MSA" TEXT, "SQ_FEET" INT, "F_SQ_FT" TEXT, "L_NUM_BM" INT, "F_BKMOB" TEXT, "HOURS" INT, "F_HOURS" TEXT, "WKS_OPEN" INT, "F_WKSOPN" TEXT, "YR_SUB" INT, "STATSTRU" INT, "STATNAME" INT, "STATADDR" INT, "LONGITUD" FLOAT, "LATITUDE" FLOAT, "FIPSST" INT, "FIPSCO" INT, "FIPSPLAC" INT, "CNTYPOP" INT, "LOCALE" TEXT, "CENTRACT" FLOAT, "CENBLOCK" INT, "CDCODE" TEXT, "MAT_CENT" TEXT, "MAT_TYPE" INT, "CBSA" INT, "MICROF" TEXT, PRIMARY KEY ("FSCSKEY", "FSCS_SEQ"));
-
Import data into the libout table from the libdata.csv
file that you downloaded.
cqlsh:libdata> COPY libout ("STABR","FSCSKEY","FSCS_SEQ","LIBID","LIBNAME", "ADDRESS","CITY","ZIP","ZIP4","CNTY","PHONE","C_OUT_TY", "C_MSA","SQ_FEET","F_SQ_FT","L_NUM_BM","F_BKMOB","HOURS", "F_HOURS","WKS_OPEN","F_WKSOPN","YR_SUB","STATSTRU","STATNAME", "STATADDR","LONGITUD","LATITUDE","FIPSST","FIPSCO","FIPSPLAC", "CNTYPOP","LOCALE","CENTRACT","CENBLOCK","CDCODE","MAT_CENT", "MAT_TYPE","CBSA","MICROF") FROM 'libdata.csv' WITH HEADER=TRUE;
In the FROM clause of the COPY command, use the path to libdata.csv in your environment.
-
Check that the libout table contains the data you copied from the downloaded
file.
cqlsh:libdata> SELECT count(*) FROM libdata.libout LIMIT 20000; count ------- 17598
-
Create a table to hold results of Pig relations.
cqlsh:libdata> CREATE TABLE libsqft ( year INT, state TEXT, sqft BIGINT, PRIMARY KEY (year, state) );
-
Using Pig, add a plan to load the data from the Cassandra libout table to a Pig
relation.
grunt> libdata = LOAD 'cql://libdata/libout' USING CqlStorage();
-
Add logic to remove data about outlet types other than books-by-mail (BM). The
C_OUT_TY column uses BM and other abbreviations to identify these library outlet
types:
- CE–Central Library
- BR–Branch Library
- BS–Bookmobile(s)
- BM–Books-by-Mail Only
grunt> book_by_mail = FILTER libdata BY C_OUT_TY == 'BM'; grunt> DUMP book_by_mail;
-
Add logic to filter out the library data that has missing building size data,
define the schema for libdata_buildings, and group data by state. The STABR
column contains the state codes. GROUP creates the state_grouped relation. Pig
gives the grouping field the default alias group. Process each row to
generate a derived set of rows that aggregate the square footage of each state
group.
grunt> libdata_buildings = FILTER libdata BY SQ_FEET > 0; grunt> state_flat = FOREACH libdata_buildings GENERATE STABR AS State,SQ_FEET AS SquareFeet; grunt> state_grouped = GROUP state_flat BY State; grunt> state_footage = FOREACH state_grouped GENERATE group as State,SUM(state_flat.SquareFeet) AS TotalFeet:int; grunt> DUMP state_footage;
The MapReduce job completes successfully and the output shows the square footage of the buildings.
. . . (UT,1510353) (VA,4192931) (VI,31875) (VT,722629) (WA,3424639) (WI,5661236) (WV,1075356) (WY,724821)
-
Add logic to filter the data by year, state, and building size, and save the
relation to Cassandra using the cql:// URL. The URL includes a prepared
statement, described
later.
grunt> insert_format= FOREACH state_footage GENERATE TOTUPLE(TOTUPLE('year',2011),TOTUPLE('state',State)),TOTUPLE(TotalFeet); grunt> STORE insert_format INTO 'cql://libdata/libsqft?output_query=UPDATE%20libdata. libsqft%20SET%20sqft%20%3D%20%3F' USING CqlStorage;
When the MapReduce job completes, a message appears that the records were written successfully.
-
In CQL, query the libsqft table to see the Pig results now stored in
Cassandra.
SELECT * FROM libdata.libsqft;
year | state | sqft ------+-------+---------- 2011 | AK | 570178 2011 | AL | 2792246 . . . 2011 | WV | 1075356 2011 | WY | 724821