Using Shark to query data

Shark stores metadata in the Cassandra keyspace called HiveMetaStore. External tables are not stored unless explicitly requested. Shark depends on Hive for parsing and for some optimization translations.

Shark stores metadata in the Cassandra keyspace called HiveMetaStore. External tables are not stored unless explicitly requested. Shark depends on Hive for parsing and for some optimization translations. You can use Shark just as you use Hive. The following example assumes that you ran the Portfolio Manager demo using Hadoop to generate the data for the example. For more examples, refer to Hive documentation. The backend implementation of Hive and Shark differ, but the user interface and query language are interchangeable for the most part.

Note: DataStax Enterprise does not support SharkServer2.

Limitations 

  • Adding the short cut "_cached" suffix to Shark table names does not work in Shark 1.1. Use TBLPROPERTIES ("shark.cache" = "true") instead.

    See the Spark User Guide.

  • When you create a table that includes TBLPROPERTIES ("shark.cache" = "true"), the table continues to exist after the session ends, which is reasonable for sharkserver.

    For other Shark jobs, you should drop the table when the session ends. Do not design applications to depend on this persistence between sessions.

  • The Shark JavaAPI does not work with Spark 1.1.

    The workaround is change the Spark serializer to Kryo. Configure the spark.serializer org.apache.spark.serializer.KryoSerializer as described in "Spark 1.1.0 Available Properties". Using dse spark-submit reads configuration options, including spark.serializer from the spark-defaults.conf file.

  • When using the Distribute By syntax to distribute the rows among reducers, also use the Sort By syntax to guarantee sorting.
    For example:
    select p_mfgr, p_brand, s, round(sum(s) over w1 ,2) as s1 from mfgr_price_view 
      window w1 as (distribute by p_mfgr sort by p_brand rows between 2 preceding and current row);

After starting a Spark node, use dse commands to launch Shark. You can use the Cassandra specific properties (-Dname=value) to start Shark.

DataStax Enterprise supports these commands for launching Shark on the Datastax Enterprise command line:

dse shark
Launches the Shark shell.
dse shark --service sharkserver -p <port>
Launches the Shark server

Starting and stopping a Shark client 

If you do not need to keep Shark memory tables persistent between sessions, start a Shark standalone client, use this dse command on the dse command line. On Ubuntu, for example:

$ dse shark

Use the -skipRddReload flag to skip reloading data into memory tables when you start Shark.

The shark command line prompt appears:
Starting the Shark Command Line Client
    
    shark>

To stop the Shark client:

shark> exit;

You can also start a Shark as a server to provide Shark service to clients.

Starting the Shark server 

You can keep Shark memory tables persistent and run applications between sessions if you use the Shark server instead of the client. To start the Shark server:

$ dse shark --service sharkserver -p <port number>

For example:

$ dse shark --service sharkserver -p 10000

Connect a Shark client to the server:

$ dse shark -h localhost -p 10000
    [localhost:10000] shark>

Using Shark

  1. Start DataStax Enterprise in Spark mode.
  2. Start Shark.
    $ dse shark
    Starting the Shark Command Line Client
    . . .
    2014-03-14 20:37:09.315:INFO:oejs.AbstractConnector:Started SelectChannelConnector@0.0.0.0:4040
    Reloading cached RDDs from previous Shark sessions... (use -skipRddReload flag to skip reloading)
  3. Enter these queries to analyze the portfolio data.
    shark> USE PortfolioDemo;
    OK
    Time taken: 0.384 seconds
    
    shark> DESCRIBE StockHist;
    Output is:
    OK
    key                   string                from deserializer   
    column1               string                from deserializer   
    value                 double                from deserializer   
    Time taken: 0.208 seconds
  4. Continue querying the data by selecting the count from the Stocks table and then select ten stocks, ordered by value.
    shark> SELECT count(*) FROM Stocks;
    OK
    2759
    Time taken: 9.899 seconds
    
    shark> SELECT * FROM Stocks ORDER BY value DESC LIMIT 10;
    OK
    XIN price 99.95643836954761
    JQC price 99.92873883263657
    SBH price 99.87928626341066
    CCJ price 99.83980527070464
    QXM price 99.72161816290533
    DPC price 99.70004934561737
    AVT price 99.69106570398871
    ANW price 99.69009660302422
    PMO price 99.67491825839043
    WMT price 99.67281873305834
    Time taken: 2.204 seconds
  5. Use the Explain command in Shark to get specific Hive and Shark information.
    shark> EXPLAIN SELECT * FROM Stocks ORDER BY value DESC LIMIT 10;

    After listing some Hive information in the abstract syntax tree, you see the Shark query plan. At this point, Spark Worker page lists the Shark application that you are running.

    shark> exit;
  6. Exit Shark.