Data access using Pig

To execute Pig programs directly on data that is stored in Cassandra, use one of the DataStax Enterprise storage handlers.

DataStax Enterprise includes a custom storage handler for Cassandra that you use to execute Pig programs directly on data stored in Cassandra. The DataStax Enterprise Pig driver uses the Cassandra File System (CassandraFS) instead of the Hadoop distributed file system (HDFS). Apache Cassandra, on the other hand, includes a Pig driver that uses the Hadoop Distributed File System (HDFS).

In DataStax Enterprise 3.1.2 and later, use one of these storage handlers and urls to transform Cassandra data using Pig:

Table Format	Storage Handler	URL
CQL	CqlStorage()	cql://
storage engine	CassandraStorage()	cassandra://

Working with legacy Cassandra tables

You use the CassandraStorage() handler and cfs:// url instead of CqlStorage() and cql:// to work with Cassandra tables that are in the storage engine (CLI/Thrift) format in Pig. Legacy tables are created using the WITH COMPACT STORAGE directive in CQL or are created using Thrift or the CLI.

URL format for CassandraStorage

The URL format for CassandraStorage is:

cassandra://[username:password@]<keyspace>/<columnfamily>[?slice_start=<start>&slice_end=<end>
   [&reversed=true]
   [&limit=1]
   [&allow_deletes=true]
   [&widerows=true]
   [&use_secondary=true]
   [&comparator=<comparator>]
   [&split_size=<size>]
   [&partitioner=<partitioner>]
   [&init_address=<host>]
   [&rpc_port=<port>]]