Example: Explore library data

Install library data that is encoded in UTF-8 format and use a pig script.

This example uses library data from the Institute of Library and Museum Services, encoded in UTF-8 format. Download the formatted data for this example now.

DataStax Enterprise installs files in the following directory that you can use to run through this example using a pig script instead of running Pig commands manually.

Installer-Services and Package installations: /usr/share/demos/pig/cql
Installer-No Services and Tarball installations: install-location/demos/pig/cql

Using the files is optional. To use the files, copy/paste the commands in steps 2-3 from the library-populate-cql.txt file and execute steps 7-10 automatically by running the library-cql.pig script.

Procedure

Unzip libdata.csv.zip and give yourself permission to access the downloaded file. On the Linux command line, for example:
```
$ chmod 777 libdata.csv
```

Create and use a keyspace called libdata.

cqlsh:libdata> CREATE KEYSPACE libdata WITH replication =
                 {'class': 'SimpleStrategy', 'replication_factor': 1 };

cqlsh:libdata> USE libdata;

Create a table for the library data that you downloaded.

cqlsh:libdata> CREATE TABLE libout ("STABR" TEXT, "FSCSKEY" TEXT, "FSCS_SEQ" TEXT,
                 "LIBID" TEXT, "LIBNAME" TEXT, "ADDRESS" TEXT, "CITY" TEXT,
                 "ZIP" TEXT, "ZIP4" TEXT, "CNTY" TEXT, "PHONE" TEXT, "C_OUT_TY" TEXT,
                 "C_MSA" TEXT, "SQ_FEET" INT, "F_SQ_FT" TEXT, "L_NUM_BM" INT,
                 "F_BKMOB" TEXT, "HOURS" INT, "F_HOURS" TEXT, "WKS_OPEN" INT,
                 "F_WKSOPN" TEXT, "YR_SUB" INT, "STATSTRU" INT, "STATNAME" INT,
                 "STATADDR" INT, "LONGITUD" FLOAT, "LATITUDE" FLOAT, "FIPSST" INT,
                 "FIPSCO" INT, "FIPSPLAC" INT, "CNTYPOP" INT, "LOCALE" TEXT,
                 "CENTRACT" FLOAT, "CENBLOCK" INT, "CDCODE" TEXT, "MAT_CENT" TEXT,
                 "MAT_TYPE" INT, "CBSA" INT, "MICROF" TEXT,
                 PRIMARY KEY ("FSCSKEY", "FSCS_SEQ"));

Import data into the libout table from the libdata.csv file that you downloaded.

cqlsh:libdata> COPY libout ("STABR","FSCSKEY","FSCS_SEQ","LIBID","LIBNAME",
                 "ADDRESS","CITY","ZIP","ZIP4","CNTY","PHONE","C_OUT_TY",
                 "C_MSA","SQ_FEET","F_SQ_FT","L_NUM_BM","F_BKMOB","HOURS",
                 "F_HOURS","WKS_OPEN","F_WKSOPN","YR_SUB","STATSTRU","STATNAME",
                 "STATADDR","LONGITUD","LATITUDE","FIPSST","FIPSCO","FIPSPLAC",
                 "CNTYPOP","LOCALE","CENTRACT","CENBLOCK","CDCODE","MAT_CENT",
                 "MAT_TYPE","CBSA","MICROF") FROM 'libdata.csv' WITH HEADER=TRUE;

In the FROM clause of the COPY command, use the path to libdata.csv in your environment.

Check that the libout table contains the data you copied from the downloaded file.
```
cqlsh:libdata> SELECT count(*) FROM libdata.libout LIMIT 20000;

  count
 -------
  17598
```

Create a table to hold results of Pig relations.

cqlsh:libdata> CREATE TABLE libsqft (
                 year INT,
                 state TEXT,
                 sqft BIGINT,
                 PRIMARY KEY (year, state)
               );

Using Pig, add a plan to load the data from the Cassandra libout table to a Pig relation.
```
grunt> libdata = LOAD 'cql://libdata/libout' USING CqlNativeStorage();
```
Add logic to remove data about outlet types other than books-by-mail (BM). The C_OUT_TY column uses BM and other abbreviations to identify these library outlet types:
- CE–Central Library
- BR–Branch Library
- BS–Bookmobile(s)
- BM–Books-by-Mail Only
```
grunt> book_by_mail = FILTER libdata BY C_OUT_TY == 'BM';
grunt> DUMP book_by_mail;
```
Add logic to filter out the library data that has missing building size data, define the schema for libdata_buildings, and group data by state. The STABR column contains the state codes. GROUP creates the state_grouped relation. Pig gives the grouping field the default alias group. Process each row to generate a derived set of rows that aggregate the square footage of each state group.
```
grunt> libdata_buildings = FILTER libdata BY SQ_FEET > 0;
grunt> state_flat = FOREACH libdata_buildings GENERATE STABR AS State,SQ_FEET AS SquareFeet;
grunt> state_grouped = GROUP state_flat BY State;
grunt> state_footage = FOREACH state_grouped GENERATE
                         group as State,SUM(state_flat.SquareFeet)
                         AS TotalFeet:int;
grunt> DUMP state_footage;
```
The MapReduce job completes successfully and the output shows the square footage of the buildings.
```
. . .
(UT,1510353)
(VA,4192931)
(VI,31875)
(VT,722629)
(WA,3424639)
(WI,5661236)
(WV,1075356)
(WY,724821)
```
Add logic to filter the data by year, state, and building size, and save the relation to Cassandra using the cql:// URL. The URL includes a prepared statement, described later.
```
grunt> insert_format= FOREACH state_footage GENERATE
         TOTUPLE(TOTUPLE('year',2011),TOTUPLE('state',State)),TOTUPLE(TotalFeet);
grunt> STORE insert_format INTO 'cql://libdata/libsqft?output_query=UPDATE%20libdata.
         libsqft%20USING%20TTL%20300%20SET%20sqft%20%3D%20%3F' USING CqlNativeStorage;
```
The prepared statement includes a TTL that causes the data to expire in 5 minutes. Decoded the prepared statement looks like this:
```
UPDATE libdata.libsqft USING TTL 300 SET sqft = ?
```

In CQL, query the libsqft table to see the Pig results now stored in Cassandra.

cqlsh> SELECT * FROM libdata.libsqft;


 year | state | sqft
------+-------+----------
 2011 |    AK |   570178
 2011 |    AL |  2792246

 . . .

 2011 |    WV |  1075356
 2011 |    WY |   724821