Data access using storage handlers

To execute Pig programs directly on data that is stored in Cassandra, use one of the DataStax Enterprise storage handlers.

The DataStax Enterprise Pig driver uses the Cassandra File System (CassandraFS) instead of the Hadoop distributed file system (HDFS). Apache Cassandra, on the other hand, includes a Pig driver that uses the Hadoop Distributed File System (HDFS).

To execute Pig programs directly on data stored in Cassandra, you use one of the DataStax Enterprise storage handlers:

Table Format Storage Handler URL
CQL CqlNativeStorage() cql:// DataStax 4.0.4 and later
CQL CqlStorage() cql://
storage engine CassandraStorage() cassandra://

The CqlStorage handler is deprecated and slated for removal at some point in the future. Use the CqlNativeStorage handler and the cql:// URL for new pig applications. Migrate all tables to CqlNativeStorage as soon as possible in preparation for the removal of the CqlStorage handler.

Migrating compact tables with clustering columns to CqlNativeStorage format 

The CqlNativeStorage handler uses native paging through the DataStax java driver to communicate with the underlying Cassandra cluster. In DataStax Enterprise 4.0.4, to use applications having compact tables with clustering columns in the CqlStorage format, you need to migrate tables to the CqlNativeStorage format. Attempting to run Pig commands on compact tables in the CqlStorage format results in an exception. You can, however, run Pig commands on non-compact tables in the CqlStorage format.
To migrate tables from CqlStorage to CqlNativeStorage format:
  1. Identify Pig functions that interact with compact tables in CqlStorage format. For example, suppose you identify a command that adds logic to load the data to a Pig relation from the compact table tab in keyspace ks.
    x = LOAD 'cql://ks/tab' USING CqlStorage();       -- Old function
  2. Change CqlStorage() to USING CqlNativeStorage().
    x = LOAD 'cql://ks/tab' USING CqlNativeStorage(); -- New function

URL format for CqlNativeStorage

The url format for CqlNative Storage is:

cql://[username:password@]<keyspace>/<table>[?
  [page_size=<size>]
  [&columns=<col1,col2>]
  [&output_query=<prepared_statement_query>]
  [&cql_input=<prepared_statement_query>]
  [&where_clause=<clause>]
  [&split_size=<size>]
  [&partitioner=<partitioner>]
  [&use_secondary=true|false]]
  [&init_address=<host>]
  [&native_port=<port>]]
Where:
  • page_size -- the number of rows per page
  • columns -- the select columns of CQL query
  • output_query -- the CQL query for writing in a prepared statement format
  • input_cql -- the CQL query for reading in a prepared statement format
  • where_clause -- the where clause on the index columns, which needs url encoding
  • split_size -- number of rows per split
  • partitioner -- Cassandra partitioner
  • use_secondary -- to enable pig filter partition push down
  • init_address -- the IP address of the target node
  • native_port -- the listen address of the target node

URL format for CqlStorage

The url format for CqlStorage is:

cql://[username:password@]<keyspace>/<table>[?
  [page_size=<size>]
  [&columns=<col1,col2>]
  [&output_query=<prepared_statement_query>]
  [&where_clause=<clause>]
  [&split_size=<size>]
  [&partitioner=<partitioner>]
  [&use_secondary=true|false]]
  [&init_address=<host>]
  [&rpc_port=<port>]]
Where:
  • page_size -- the number of rows per page
  • columns -- the select columns of CQL query
  • output_query -- the CQL query for writing in a prepared statement format
  • where_clause -- the where clause on the index columns, which needs url encoding
  • split_size -- number of rows per split
  • partitioner -- Cassandra partitioner
  • use_secondary -- to enable pig filter partition push down
  • init_address -- the IP address of the target node
  • rpc_port -- the listen address of the target node
Working with legacy Cassandra tables

Use the CassandraStorage() handler and cfs:// url to work with Cassandra tables that are in the storage engine (CLI/Thrift) format in Pig. Legacy tables are created using Thrift, CLI, or using the WITH COMPACT STORAGE directive in CQL. Thrift applications require that you configure Cassandra for connection to your application using the rpc connections instead of the default native_transport connection.

URL format for CassandraStorage

The URL format for CassandraStorage is:

cassandra://[username:password@]<keyspace>/<columnfamily>[?slice_start=<start>&slice_end=<end>
   [&reversed=true]
   [&limit=1]
   [&allow_deletes=true]
   [&widerows=true]
   [&use_secondary=true]
   [&comparator=<comparator>]
   [&split_size=<size>]
   [&partitioner=<partitioner>]
   [&init_address=<host>]
   [&rpc_port=<port>]]