Creating or altering CQL data from Hive

Use a Hive external table to create or alter CQL data from Hive. A counterpart to the Hive database/external table must pre-exist in Cassandra as an keyspace/table.

You need to use a Hive external table to create or alter CQL data from Hive. A counterpart to the Hive database/external table must pre-exist in Cassandra as an keyspace/table. When you use a Hive database name that matches a Cassandra keyspace name, DataStax Enterprise 4.0.4 and later automatically generates a Hive external table for each table in the keyspace. If the auto-created external table does not suit your needs, you create a custom external table using different TBL and SERDEPROPERTIES. Use the CREATE EXTERNAL TABLE statement to create such a table.

To use Hive with legacy tables, such as those created using Thrift or the CLI, see DataStax Enterprise 3.0 documentation. Thrift applications require that you configure Cassandra for connection to your application using the rpc connections instead of the default native_transport connection.

Creating a custom external table

This example assumes you created the cassandra_keyspace and exampletable in "Browsing Cassandra tables in Hive". A Hive example table is auto-created when you run the USE cassandra_keyspace command on the Hive command line. If you want to use a Hive database or table of a different name than the auto-created ones, but with the same or a similar schema, you customize the auto-created external table, as shown in this example. The example uses the Hive database named bigdata instead cassandra_keyspace, and the example uses a table named MyHiveTable instead of exampletable. The example specifies the CQL keyspace and table names in the external table definition using the TBLPROPERTIES clause to use the CQL-defined schema.

Creating a custom external table

hive> CREATE DATABASE bigdata;
hive> USE bigdata;
hive> CREATE EXTERNAL TABLE MyHiveTable 
        ( key int,  data string ) 
        STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler' 
        TBLPROPERTIES ( "cassandra.ks.name" = "cassandra_keyspace" , 
          "cassandra.cf.name" = "exampletable" , 
          "cassandra.ks.repfactor" = "2" , 
          "cassandra.ks.strategy" = 
            "org.apache.cassandra.locator.SimpleStrategy" );

Updating metadata in Hive when altering tables

When you run ALTER TABLE, the metadata in Hive is not updated and subsequent Hive and SparkSQL queries fail.

Workaround

Enter the hive shell:
```
$ dse hive
```

In the hive shell, drop the table:

hive> DROP TABLE your_keyspace.your_table;

To allow Hive to refresh the metadata:
```
hive> USE your_keyspace;
```

Hive to Cassandra type mapping

In the previous version of DataStax Enterprise, the Hive date, decimal and varchar data types were not supported. The following table, which shows CQL, Cassandra internal storage engine (used by legacy tables), and Hive data type mapping, includes these Hive types and mapping information:

CQL	Cassandra Internal	Hive
ascii	AsciiType	string
bigint	LongType	bigint
boolean	BooleanType	boolean
counter	CounterColumnType	bigint
decimal	DecimalType	decimal (new)
double	DoubleType	double
float	FloatType	float
inet	InetAddressType	binary
int	Int32Type	int
text	UTF8Type	string
timestamp	TimestampType	date (new)
timestamp	TimestampType	timestamp (change)
timeuuid	TimeUUIDType	binary
uuid	UUIDType	binary
varint	IntegerType	binary
varchar	UTF8Type	varchar (new)
other	other	binary

The InetAddressType stores the raw IP address in network byte order.

Using TBLPROPERTIES and SERDEPROPERTIES

In an external table definition, the TBLPROPERTIES clause maps the Hive database to a CQL table and can include MapReduce properties, Cassandra database configuration, and native protocol properties for the table. The SERDEPROPERTIES clause specifies the properties used when serializing/deserializing data passed between the Hive table and Cassandra. You can add a WITH SERDEPROPERTIES clause to map meaningful column names in Hive to the Cassandra partition key, column names, and column values. You can change these properties on the fly. Using the Hive SET command, you can configure properties in the hive session. The settings become effective for the next query.

The following table lists general properties used in the TBLPROPERTIES or SERDEPROPERTIES clause or both. The next section lists optional native protocol properties for use with the DataStax Java Driver.

TBL and SERDE properties
General Property	TBL/SERDE	Description
cassandra.cf.name	both	Cassandra table name
cassandra.columns.mapping	both	Mapping of Hive to legacy Cassandra columns
cassandra.consistency.level	both	Consistency level - default ONE
cassandra.cql3.type	both	CQL types
cassandra.host	both	IP of a Cassandra node to connect to
cassandra.input.split.size	both	MapReduce split size
cassandra.ks.name	both	Cassandra keyspace name
cassandra.page.size	SERDE	Fetch tables with many columns by page size
cassandra.partitioner	both	Partitioner (default = configured partitioner)
cassandra.port	both	Cassandra RPC port - default 9160
cql3.output.query	TBL	A prepared statement for storing alterations to a CQL users table
cql3.partition.key	both	CQL partition key, a comma-separated list of partition and clustering keys (DataStax 4.0.4 and later)
cql3.pushdown.enable	TBL	True (default) enable pushdown predicate
cql3.update.columns	both	Used with INSERT INTO SELECT

Required table properties

When you create an external table in Hive, you need to specify these properties:

cassandra.ks.name
cassandra.cf.name

Other frequently-used properties are:

cql3.output.query
cql3.partition.key (DataStax Enterprise 4.0.4 and later)

In DataStax Enterprise 4.0.4 and later, you use the SHOW CREATE TABLE <CQL table name> command at the Hive prompt to see the auto-created external table. The output helps you see how to format the cql3.partition.key in your custom external table. For example, the output of a table having following CQL composite partition key, has the 'cql3.partition.key'='key,event_id' Hive property syntax.

PRIMARY KEY ((key, event_id), num_responses)

Required storage handler

Also required in the external table definition is the CQL storage handler: org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler. The storage handler accesses and stores Cassandra data back to Cassandra.

About the cassandra.input.split.size

The cassandra.input.split.size property configures the number of rows processed per mapper (64k rows per split). The default is 64 * 1024

Partitioner use by Hive

You do not need to specify cassandra.partitioner. Your configured partitioner is used by Hive. For example, Hive uses this property value if you use the Cassandra 2.0 default partitioner:

"cassandra.partitioner" = "org.apache.cassandra.dht.Murmur3Partitioner"

Creating or altering CQL data from Hive and MapReduce performance show examples of using some of these properties.

Accessing tables with many columns

In the TBLPROPERTIES clause, set the cassandra.page.size to fetch a table with many columns at once by page size.

Optional native protocol properties

DataStax Enterprise 4.0.4 and later support the following optional properties for the native protocol.

cassandra.input.native.port
cassandra.input.native.core.connections.per.host
cassandra.input.native.max.connections.per.host
cassandra.input.native.min.simult.reqs.per.connection
cassandra.input.native.max.simult.reqs.per.connection
cassandra.input.native.connection.timeout
cassandra.input.native.read.connection.timeout
cassandra.input.native.receive.buffer.size
cassandra.input.native.send.buffer.size
cassandra.input.native.solinger
cassandra.input.native.tcp.nodelay
cassandra.input.native.reuse.address
cassandra.input.native.keep.alive
cassandra.input.native.auth.provider
cassandra.input.native.ssl.trust.store.path
cassandra.input.native.ssl.key.store.path
cassandra.input.native.ssl.trust.store.password
cassandra.input.native.ssl.key.store.password
cassandra.input.native.ssl.cipher.suites

Inspecting an auto-created, external table (DataStax Enterprise 4.0.4 and later)

In Hive, you can use the SHOW CREATE TABLE <CQL table name> command to see the schema of a auto-created external table. The output of this command can help you construct a custom Hive external table definition. Assuming you created the table in "Browsing through Cassandra tables in Hive", use the SHOW CREATE TABLE command to see the schema of exampletable.

hive> SHOW CREATE TABLE exampletable;
OK
CREATE EXTERNAL TABLE exampletable(
  key int COMMENT 'from deserializer', 
  data string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe' 
STORED BY 
  'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler' 
WITH SERDEPROPERTIES ( 
  'serialization.format'='1', 
  'cassandra.columns.mapping'='key,data')
LOCATION
  'cfs://127.0.0.1/user/hive/warehouse/cassandra_keyspace.db/exampletable'
TBLPROPERTIES (
  'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner', 
  'cql3.partition.key'='key', 
  'cassandra.ks.name'='cassandra_keyspace', 
  'cassandra.cf.name'='exampletable', 
  'auto_created'='true')
Time taken: 0.028 seconds, Fetched: 18 row(s)

The CREATE EXTERNAL TABLE definition uses the cql3.partition.key TBL property, available only in DataStax Enterprise 4.0.4 and later.

Updating metadata in Hive when altering tables

When you run ALTER TABLE, the metadata in Hive is not updated and subsequent Hive and SparkSQL queries fail.

Workaround

Enter the hive shell:
```
$ dse hive
```

In the hive shell, drop the table:

hive> DROP TABLE your_keyspace.your_table;

To allow Hive to refresh the metadata:
```
hive> USE your_keyspace;
```