Data access using storage handlers (deprecated)

To execute Pig programs directly on data that is stored in Cassandra, use one of the DataStax Enterprise storage handlers.

Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are deprecated. Pig is also deprecated and will be removed when Hadoop is removed.

The DataStax Enterprise Pig driver uses the Cassandra File System (CFS) instead of the Hadoop distributed file system (HDFS). Apache Cassandra, on the other hand, includes a Pig driver that uses the Hadoop Distributed File System (HDFS).

To execute Pig programs directly on data stored in Cassandra, you use one of the DataStax Enterprise storage handlers:

Table Format Storage Handler URL Description
CQL CqlNativeStorage() cql:// Use with DataStax Enterprise 4.7 and later.
CQL CqlStorage() cql:// Deprecated.
storage engine CassandraStorage() cassandra:// Deprecated.
Note: The CqlStorage handler and the CassandraStorage handler are deprecated and will be removed in a future Cassandra release. Use the CqlNativeStorage handler and the cql:// url for new pig applications. DataStax recommends migrating all tables to CqlNativeStorage as soon as possible in preparation for the removal of the CqlStorage and the CassandraStorage handlers.

Migrating compact tables with clustering columns to CqlNativeStorage format 

The CqlNativeStorage handler uses native paging through the DataStax Java driver to communicate with the underlying Cassandra cluster. To use applications that have compact tables with clustering columns in the CqlStorage format, you need to migrate tables to the CqlNativeStorage format. Attempting to run Pig commands on compact tables in the CqlStorage format results in an exception. You can, however, run Pig commands on non-compact tables in the CqlStorage format.
To migrate tables from CqlStorage to CqlNativeStorage format:
  1. Identify Pig functions that interact with compact tables in CqlStorage format. For example, suppose you identify a command that adds logic to load the data to a Pig relation from the compact table tab in keyspace ks.
    x = LOAD 'cql://ks/tab' USING CqlStorage();       -- Old function 
    x = LOAD 'cql://ks/tab' USING CassandraStorage(); -- Old function 
  2. Change CqlStorage() or CassandraStorage() to USING CqlNativeStorage().
    x = LOAD 'cql://ks/tab' USING CqlNativeStorage(); -- New function

URL format for CqlNativeStorage

The URL format for CqlNativeStorage is:

cql://[username:password@]keyspace/table[?
  [page_size=size]
  [&columns=col1,col2]
  [&output_query=prepared_statement_query]
  [&cql_input=prepared_statement_query]
  [&where_clause=clause]
  [&split_size=size]
  [&partitioner=partitioner]
  [&use_secondary=true|false]]
  [&init_address=host]
  [&native_port=port]]
where:
  • page_size -- the number of rows per page
  • columns -- the select columns of CQL query
  • output_query -- the CQL query for writing in a prepared statement format
  • input_cql -- the CQL query for reading in a prepared statement format
  • where_clause -- the where clause on the index columns, which needs URL encoding
  • split_size -- number of rows per split
  • partitioner -- Cassandra partitioner
  • use_secondary -- to enable pig filter partition push down
  • init_address -- the IP address of the target node
  • native_port -- the listen address of the target node
where:
  • page_size -- the number of rows per page
  • columns -- the select columns of CQL query
  • output_query -- the CQL query for writing in a prepared statement format
  • where_clause -- the where clause on the index columns, which needs URL encoding
  • split_size -- number of rows per split
  • partitioner -- Cassandra partitioner
  • use_secondary -- to enable pig filter partition push down
  • init_address -- the IP address of the target node
  • rpc_port -- the listen address of the target node

Working with legacy Cassandra tables

Use the CqlNativeStorage() handler and cfs:// URL to work with Cassandra tables that are in the storage engine (CLI/Thrift) format in Pig. Legacy tables are created using Thrift, CLI, or using the WITH COMPACT STORAGE directive in CQL. Thrift applications require that you configure Cassandra for connection to your application using the rpc connections instead of the default native transport for CqlNativeStorage connection.