Example: Explore library data

Install library data that is encoded in UTF-8 format and use a pig script.

This example uses library data from the Institute of Library and Museum Services, encoded in UTF-8 format. Download the formatted data for this example now. You can use two files installed with DataStax Enterprise to copy/paste code and run a pig script instead of stepping through this example manually.

  • Packaged installs: /usr/share/dse-demos/pig/cql
    • You can copy/paste the commands in steps 2-3 from this document or from the library-populate-cql.txt file.
    • You can execute steps 7-10 automatically by running the library-cql.pig script.
  • Tarball installs: install-location/demos/pig/cql

Procedure

  1. Unzip libdata.csv.zip and give yourself permission to access the downloaded file. On the Linux command line, for example:
    $ chmod 777 libdata.csv
  2. Create and use a keyspace called libdata.
    cqlsh:libdata> CREATE KEYSPACE libdata WITH replication =
                     {'class': 'SimpleStrategy', 'replication_factor': 1 };
    
    cqlsh:libdata> USE libdata;
  3. Create a table for the library data that you downloaded.
    cqlsh:libdata> CREATE TABLE libout ("STABR" TEXT, "FSCSKEY" TEXT, "FSCS_SEQ" TEXT,
                     "LIBID" TEXT, "LIBNAME" TEXT, "ADDRESS" TEXT, "CITY" TEXT,
                     "ZIP" TEXT, "ZIP4" TEXT, "CNTY" TEXT, "PHONE" TEXT, "C_OUT_TY" TEXT,
                     "C_MSA" TEXT, "SQ_FEET" INT, "F_SQ_FT" TEXT, "L_NUM_BM" INT,
                     "F_BKMOB" TEXT, "HOURS" INT, "F_HOURS" TEXT, "WKS_OPEN" INT,
                     "F_WKSOPN" TEXT, "YR_SUB" INT, "STATSTRU" INT, "STATNAME" INT,
                     "STATADDR" INT, "LONGITUD" FLOAT, "LATITUDE" FLOAT, "FIPSST" INT,
                     "FIPSCO" INT, "FIPSPLAC" INT, "CNTYPOP" INT, "LOCALE" TEXT,
                     "CENTRACT" FLOAT, "CENBLOCK" INT, "CDCODE" TEXT, "MAT_CENT" TEXT,
                     "MAT_TYPE" INT, "CBSA" INT, "MICROF" TEXT,
                     PRIMARY KEY ("FSCSKEY", "FSCS_SEQ"));
  4. Import data into the libout table from the libdata.csv file that you downloaded.
    cqlsh:libdata> COPY libout ("STABR","FSCSKEY","FSCS_SEQ","LIBID","LIBNAME",
                     "ADDRESS","CITY","ZIP","ZIP4","CNTY","PHONE","C_OUT_TY",
                     "C_MSA","SQ_FEET","F_SQ_FT","L_NUM_BM","F_BKMOB","HOURS",
                     "F_HOURS","WKS_OPEN","F_WKSOPN","YR_SUB","STATSTRU","STATNAME",
                     "STATADDR","LONGITUD","LATITUDE","FIPSST","FIPSCO","FIPSPLAC",
                     "CNTYPOP","LOCALE","CENTRACT","CENBLOCK","CDCODE","MAT_CENT",
                     "MAT_TYPE","CBSA","MICROF") FROM 'libdata.csv' WITH HEADER=TRUE;

    In the FROM clause of the COPY command, use the path to libdata.csv in your environment.

  5. Check that the libout table contains the data you copied from the downloaded file.
    cqlsh:libdata> SELECT count(*) FROM libdata.libout LIMIT 20000;
    
      count
     -------
      17598
  6. Create a table to hold results of Pig relations.
    cqlsh:libdata> CREATE TABLE libsqft (
                     year INT,
                     state TEXT,
                     sqft BIGINT,
                     PRIMARY KEY (year, state)
                   );
  7. Using Pig, add a plan to load the data from the Cassandra libout table to a Pig relation.
    grunt> libdata = LOAD 'cql://libdata/libout' USING CqlStorage();
  8. Add logic to remove data about outlet types other than books-by-mail (BM). The C_OUT_TY column uses BM and other abbreviations to identify these library outlet types:
    • CE–Central Library
    • BR–Branch Library
    • BS–Bookmobile(s)
    • BM–Books-by-Mail Only
    grunt> book_by_mail = FILTER libdata BY C_OUT_TY == 'BM';
    grunt> DUMP book_by_mail;
  9. Add logic to filter out the library data that has missing building size data, define the schema for libdata_buildings, and group data by state. The STABR column contains the state codes. GROUP creates the state_grouped relation. Pig gives the grouping field the default alias group. Process each row to generate a derived set of rows that aggregate the square footage of each state group.
    grunt> libdata_buildings = FILTER libdata BY SQ_FEET > 0;
    grunt> state_flat = FOREACH libdata_buildings GENERATE STABR AS State,SQ_FEET AS SquareFeet;
    grunt> state_grouped = GROUP state_flat BY State;
    grunt> state_footage = FOREACH state_grouped GENERATE
                             group as State,SUM(state_flat.SquareFeet)
                             AS TotalFeet:int;
    grunt> DUMP state_footage;

    The MapReduce job completes successfully and the output shows the square footage of the buildings.

    . . .
    (UT,1510353)
    (VA,4192931)
    (VI,31875)
    (VT,722629)
    (WA,3424639)
    (WI,5661236)
    (WV,1075356)
    (WY,724821)
  10. Add logic to filter the data by year, state, and building size, and save the relation to Cassandra using the cql:// URL. The URL includes a prepared statement, described later.
    grunt> insert_format= FOREACH state_footage GENERATE
             TOTUPLE(TOTUPLE('year',2011),TOTUPLE('state',State)),TOTUPLE(TotalFeet);
    grunt> STORE insert_format INTO 'cql://libdata/libsqft?output_query=UPDATE%20libdata.
             libsqft%20SET%20sqft%20%3D%20%3F' USING CqlStorage;

    When the MapReduce job completes, a message appears that the records were written successfully.

  11. In CQL, query the libsqft table to see the Pig results now stored in Cassandra.
    SELECT * FROM libdata.libsqft;
    
     year | state | sqft
    ------+-------+----------
     2011 |    AK |   570178
     2011 |    AL |  2792246
    
     . . .
    
     2011 |    WV |  1075356
     2011 |    WY |   724821