Example: Explore library data

Install library data that is encoded in UTF-8 format and use a pig script.

This example uses library data from the Institute of Library and Museum Services, encoded in UTF-8 format. Download the formatted data for this example now.

DataStax Enterprise installs files in the following directory that you can use to run through this example using a pig script instead of running Pig commands manually.

  • Installer-Services and Package installations: /usr/share/demos/pig/cql
  • Installer-No Services and Tarball installations: install-location/demos/pig/cql
Using the files is optional. To use the files, copy/paste the commands in steps 2-3 from the library-populate-cql.txt file and execute steps 7-10 automatically by running the library-cql.pig script.

Procedure

  1. Unzip libdata.csv.zip and give yourself permission to access the downloaded file. On the Linux command line, for example:
    $ chmod 777 libdata.csv
  2. Create and use a keyspace called libdata.
    cqlsh:libdata> CREATE KEYSPACE libdata WITH replication =
                     {'class': 'SimpleStrategy', 'replication_factor': 1 };
    
    cqlsh:libdata> USE libdata;
  3. Create a table for the library data that you downloaded.
    cqlsh:libdata> CREATE TABLE libout ("STABR" TEXT, "FSCSKEY" TEXT, "FSCS_SEQ" TEXT,
                     "LIBID" TEXT, "LIBNAME" TEXT, "ADDRESS" TEXT, "CITY" TEXT,
                     "ZIP" TEXT, "ZIP4" TEXT, "CNTY" TEXT, "PHONE" TEXT, "C_OUT_TY" TEXT,
                     "C_MSA" TEXT, "SQ_FEET" INT, "F_SQ_FT" TEXT, "L_NUM_BM" INT,
                     "F_BKMOB" TEXT, "HOURS" INT, "F_HOURS" TEXT, "WKS_OPEN" INT,
                     "F_WKSOPN" TEXT, "YR_SUB" INT, "STATSTRU" INT, "STATNAME" INT,
                     "STATADDR" INT, "LONGITUD" FLOAT, "LATITUDE" FLOAT, "FIPSST" INT,
                     "FIPSCO" INT, "FIPSPLAC" INT, "CNTYPOP" INT, "LOCALE" TEXT,
                     "CENTRACT" FLOAT, "CENBLOCK" INT, "CDCODE" TEXT, "MAT_CENT" TEXT,
                     "MAT_TYPE" INT, "CBSA" INT, "MICROF" TEXT,
                     PRIMARY KEY ("FSCSKEY", "FSCS_SEQ"));
  4. Import data into the libout table from the libdata.csv file that you downloaded.
    cqlsh:libdata> COPY libout ("STABR","FSCSKEY","FSCS_SEQ","LIBID","LIBNAME",
                     "ADDRESS","CITY","ZIP","ZIP4","CNTY","PHONE","C_OUT_TY",
                     "C_MSA","SQ_FEET","F_SQ_FT","L_NUM_BM","F_BKMOB","HOURS",
                     "F_HOURS","WKS_OPEN","F_WKSOPN","YR_SUB","STATSTRU","STATNAME",
                     "STATADDR","LONGITUD","LATITUDE","FIPSST","FIPSCO","FIPSPLAC",
                     "CNTYPOP","LOCALE","CENTRACT","CENBLOCK","CDCODE","MAT_CENT",
                     "MAT_TYPE","CBSA","MICROF") FROM 'libdata.csv' WITH HEADER=TRUE;

    In the FROM clause of the COPY command, use the path to libdata.csv in your environment.

  5. Check that the libout table contains the data you copied from the downloaded file.
    cqlsh:libdata> SELECT count(*) FROM libdata.libout LIMIT 20000;
    
      count
     -------
      17598
  6. Create a table to hold results of Pig relations.
    cqlsh:libdata> CREATE TABLE libsqft (
                     year INT,
                     state TEXT,
                     sqft BIGINT,
                     PRIMARY KEY (year, state)
                   );
  7. Using Pig, add a plan to load the data from the Cassandra libout table to a Pig relation.
    grunt> libdata = LOAD 'cql://libdata/libout' USING CqlNativeStorage();
  8. Add logic to remove data about outlet types other than books-by-mail (BM). The C_OUT_TY column uses BM and other abbreviations to identify these library outlet types:
    • CE–Central Library
    • BR–Branch Library
    • BS–Bookmobile(s)
    • BM–Books-by-Mail Only
    grunt> book_by_mail = FILTER libdata BY C_OUT_TY == 'BM';
    grunt> DUMP book_by_mail;
  9. Add logic to filter out the library data that has missing building size data, define the schema for libdata_buildings, and group data by state. The STABR column contains the state codes. GROUP creates the state_grouped relation. Pig gives the grouping field the default alias group. Process each row to generate a derived set of rows that aggregate the square footage of each state group.
    grunt> libdata_buildings = FILTER libdata BY SQ_FEET > 0;
    grunt> state_flat = FOREACH libdata_buildings GENERATE STABR AS State,SQ_FEET AS SquareFeet;
    grunt> state_grouped = GROUP state_flat BY State;
    grunt> state_footage = FOREACH state_grouped GENERATE
                             group as State,SUM(state_flat.SquareFeet)
                             AS TotalFeet:int;
    grunt> DUMP state_footage;

    The MapReduce job completes successfully and the output shows the square footage of the buildings.

    . . .
    (UT,1510353)
    (VA,4192931)
    (VI,31875)
    (VT,722629)
    (WA,3424639)
    (WI,5661236)
    (WV,1075356)
    (WY,724821)
  10. Add logic to filter the data by year, state, and building size, and save the relation to Cassandra using the cql:// URL. The URL includes a prepared statement, described later.
    grunt> insert_format= FOREACH state_footage GENERATE
             TOTUPLE(TOTUPLE('year',2011),TOTUPLE('state',State)),TOTUPLE(TotalFeet);
    grunt> STORE insert_format INTO 'cql://libdata/libsqft?output_query=UPDATE%20libdata.
             libsqft%20USING%20TTL%20300%20SET%20sqft%20%3D%20%3F' USING CqlNativeStorage;
    The prepared statement includes a TTL that causes the data to expire in 5 minutes. Decoded the prepared statement looks like this:
    UPDATE libdata.libsqft USING TTL 300 SET sqft = ?
  11. In CQL, query the libsqft table to see the Pig results now stored in Cassandra.
    cqlsh> SELECT * FROM libdata.libsqft;
    
     year | state | sqft
    ------+-------+----------
     2011 |    AK |   570178
     2011 |    AL |  2792246
    
     . . .
    
     2011 |    WV |  1075356
     2011 |    WY |   724821