Data access using storage handlers (deprecated)
To execute Pig programs directly on data that is stored in Cassandra, use one of the DataStax Enterprise storage handlers.
Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are deprecated. Pig is also deprecated and will be removed when Hadoop is removed.
The DataStax Enterprise Pig driver uses the Cassandra File System (CFS) instead of the Hadoop distributed file system (HDFS). Apache Cassandra, on the other hand, includes a Pig driver that uses the Hadoop Distributed File System (HDFS).
To execute Pig programs directly on data stored in Cassandra, you use one of the DataStax Enterprise storage handlers:
Table Format | Storage Handler | URL | Description |
---|---|---|---|
CQL | CqlNativeStorage() | cql:// | Use with DataStax Enterprise 4.7 and later. |
CQL | CqlStorage() | cql:// | Deprecated. |
storage engine | CassandraStorage() | cassandra:// | Deprecated. |
Migrating compact tables with clustering columns to CqlNativeStorage format
The CqlNativeStorage handler uses native paging through the DataStax Java driver to communicate with the underlying Cassandra cluster. To use applications that have compact tables with clustering columns in the CqlStorage format, you need to migrate tables to the CqlNativeStorage format. Attempting to run Pig commands on compact tables in the CqlStorage format results in an exception. You can, however, run Pig commands on non-compact tables in the CqlStorage format.- Identify Pig functions that interact with compact tables in CqlStorage
format. For example, suppose you identify a command that adds logic to load
the data to a Pig relation from the compact table tab in keyspace
ks.
x = LOAD 'cql://ks/tab' USING CqlStorage(); -- Old function
x = LOAD 'cql://ks/tab' USING CassandraStorage(); -- Old function
- Change CqlStorage() or CassandraStorage() to USING
CqlNativeStorage().
x = LOAD 'cql://ks/tab' USING CqlNativeStorage(); -- New function
URL format for CqlNativeStorage
The URL format for CqlNativeStorage is:
cql://[username:password@]keyspace/table[?
[page_size=size]
[&columns=col1,col2]
[&output_query=prepared_statement_query]
[&cql_input=prepared_statement_query]
[&where_clause=clause]
[&split_size=size]
[&partitioner=partitioner]
[&use_secondary=true|false]]
[&init_address=host]
[&native_port=port]]
- page_size -- the number of rows per page
- columns -- the select columns of CQL query
- output_query -- the CQL query for writing in a prepared statement format
- input_cql -- the CQL query for reading in a prepared statement format
- where_clause -- the where clause on the index columns, which needs URL encoding
- split_size -- number of rows per split
- partitioner -- Cassandra partitioner
- use_secondary -- to enable pig filter partition push down
- init_address -- the IP address of the target node
- native_port -- the listen address of the target node
- page_size -- the number of rows per page
- columns -- the select columns of CQL query
- output_query -- the CQL query for writing in a prepared statement format
- where_clause -- the where clause on the index columns, which needs URL encoding
- split_size -- number of rows per split
- partitioner -- Cassandra partitioner
- use_secondary -- to enable pig filter partition push down
- init_address -- the IP address of the target node
- rpc_port -- the listen address of the target node
Working with legacy Cassandra tables
Use the CqlNativeStorage() handler and cfs:// URL to work with Cassandra tables that are in the storage engine (CLI/Thrift) format in Pig. Legacy tables are created using Thrift, CLI, or using the WITH COMPACT STORAGE directive in CQL. Thrift applications require that you configure Cassandra for connection to your application using the rpc connections instead of the default native transport for CqlNativeStorage connection.