Data access using storage handlers
To execute Pig programs directly on data that is stored in Cassandra, use one of the DataStax Enterprise storage handlers.
The DataStax Enterprise Pig driver uses the Cassandra File System (CFS) instead of the Hadoop distributed file system (HDFS). Apache Cassandra, on the other hand, includes a Pig driver that uses the Hadoop Distributed File System (HDFS).
To execute Pig programs directly on data stored in Cassandra, you use one of the DataStax Enterprise storage handlers:
Table Format | Storage Handler | URL | Description |
---|---|---|---|
CQL | CqlNativeStorage() | cql:// | Use with DataStax Enterprise 4.7. |
CQL | CqlStorage() | cql:// | Deprecated. |
storage engine | CassandraStorage() | cassandra:// | Use with Cassandra tables in the storage engine (CLI/Thrift) format. |
The CqlStorage handler is deprecated and will be removed in a future Cassandra release. Use the CqlNativeStorage handler and the cql:// url for new pig applications. DataStax recommends migrating all tables to CqlNativeStorage as soon as possible in preparation for the removal of the CqlStorage handler.
Migrating compact tables with clustering columns to CqlNativeStorage format
The CqlNativeStorage handler uses native paging through the DataStax Java driver to communicate with the underlying Cassandra cluster. Use applications having compact tables with clustering columns in the CqlStorage format, you need to migrate tables to the CqlNativeStorage format. Attempting to run Pig commands on compact tables in the CqlStorage format results in an exception. You can, however, run Pig commands on non-compact tables in the CqlStorage format.- Identify Pig functions that interact with compact tables in CqlStorage
format. For example, suppose you identify a command that adds logic to load
the data to a Pig relation from the compact table tab in keyspace
ks.
x = LOAD 'cql://ks/tab' USING CqlStorage(); -- Old function
- Change CqlStorage() to USING
CqlNativeStorage().
x = LOAD 'cql://ks/tab' USING CqlNativeStorage(); -- New function
URL format for CqlNativeStorage
The URL format for CqlNativeStorage is:
cql://[username:password@]<keyspace>/<table>[?
[page_size=<size>]
[&columns=<col1,col2>]
[&output_query=<prepared_statement_query>]
[&cql_input=<prepared_statement_query>]
[&where_clause=<clause>]
[&split_size=<size>]
[&partitioner=<partitioner>]
[&use_secondary=true|false]]
[&init_address=<host>]
[&native_port=<port>]]
- page_size -- the number of rows per page
- columns -- the select columns of CQL query
- output_query -- the CQL query for writing in a prepared statement format
- input_cql -- the CQL query for reading in a prepared statement format
- where_clause -- the where clause on the index columns, which needs URL encoding
- split_size -- number of rows per split
- partitioner -- Cassandra partitioner
- use_secondary -- to enable pig filter partition push down
- init_address -- the IP address of the target node
- native_port -- the listen address of the target node
URL format for CqlStorage
The URL format for CqlStorage is:
cql://[username:password@]<keyspace>/<table>[?
[page_size=<size>]
[&columns=<col1,col2>]
[&output_query=<prepared_statement_query>]
[&where_clause=<clause>]
[&split_size=<size>]
[&partitioner=<partitioner>]
[&use_secondary=true|false]]
[&init_address=<host>]
[&rpc_port=<port>]]
- page_size -- the number of rows per page
- columns -- the select columns of CQL query
- output_query -- the CQL query for writing in a prepared statement format
- where_clause -- the where clause on the index columns, which needs URL encoding
- split_size -- number of rows per split
- partitioner -- Cassandra partitioner
- use_secondary -- to enable pig filter partition push down
- init_address -- the IP address of the target node
- rpc_port -- the listen address of the target node
Working with legacy Cassandra tables
Use the CassandraStorage() handler and cfs:// URL to work with Cassandra tables that are in the storage engine (CLI/Thrift) format in Pig. Legacy tables are created using Thrift, CLI, or using the WITH COMPACT STORAGE directive in CQL. Thrift applications require that you configure Cassandra for connection to your application using the rpc connections instead of the default native transport for CassandraStorage connection.
The URL format for CassandraStorage is:
cassandra://[username:password@]<keyspace>/<columnfamily>[?slice_start=<start>&slice_end=<end>
[&reversed=true]
[&limit=1]
[&allow_deletes=true]
[&widerows=true]
[&use_secondary=true]
[&comparator=<comparator>]
[&split_size=<size>]
[&partitioner=<partitioner>]
[&init_address=<host>]
[&rpc_port=<port>]]