Creating or altering CQL data from Hive

Use a Hive external table to create or alter CQL data from Hive. A counterpart to the Hive database/external table must pre-exist in Cassandra as an keyspace/table.

You need to use a Hive external table to create or alter CQL data from Hive. First, you map a Hive database to a Cassandra keyspace by using the keyspace name in the table properties of the CREATE EXTERNAL TABLE definition. Optionally, if your Hive database and Cassandra keyspace use different names, you can declare keyspace properties in your external table definition using the TBLPROPERTIES clause. The keyspace and table must pre-exist in Cassandra.

SimpleStrategy keyspace example

hive> CREATE EXTERNAL TABLE MyHiveTable 
        ( m string , n string , o string , p string ) 
        STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler' 
        TBLPROPERTIES ( "cassandra.ks.name" = "MyCassandraKS" , 
          "cassandra.cf.name" = "mycasstable" , 
          "cassandra.ks.repfactor" = "2" , 
          "cassandra.ks.strategy" = 
            "org.apache.cassandra.locator.SimpleStrategy" );

NetworkToplogyStrategy keyspace example

To create a Hive database that maps to an existing keyspace that uses the NetworkTopologyStrategy replication strategy, use the cassandra.ks.stratOptions property to define the replication factors for data centers:

hive> CREATE EXTERNAL TABLE MyHiveTable
        (m string, n string, o string, p string)
        STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
        TBLPROPERTIES ( "cassandra.ks.name" = "MyCassandraKS",
          "cassandra.cf.name" = "mycasstable",
          "cassandra.ks.stratOptions" = "DC1:1, DC2:2, DC3:1",
          "cassandra.ks.strategy" =
            "org.apache.cassandra.locator.NetworkTopologyStrategy");

To create the table in a single node cluster, use:

 "cassandra.ks.stratOptions" = "Analytics:1",

Next, you can use a Hive client and custom storage handlers to access data in the Cassandra tables. After modifying data in Hive, you can store changes back to Cassandra using the CQL 3 storage handler: org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler.

A CQL table can use a simple CQL primary key, a compound primary key, or composite partition key. The way you access data in a CQL table from Hive is the same regardless of the type of key used by the CQL table.

To use Hive with legacy tables, such as those created using Thrift or the CLI, see DataStax Enterprise 3.0 documentation.

Hive to Cassandra type mapping

To create Hive tables, you need to understand how data types in Cassandra map to Hive data types. This table shows CQL, Cassandra internal storage engine (used by legacy tables), and Hive data type mapping:

CQL 3	Cassandra Internal	Hive
ascii	AsciiType	string
bigint	LongType	bigint
boolean	BooleanType	boolean
counter	CounterColumnType	bigint
decimal	DecimalType	binary
double	DoubleType	double
float	FloatType	float
inet	InetAddressType	binary
int	Int32Type	int
text/varchar	UTF8Type	string
timestamp	DateType	timestamp
timeuuid	TimeUUIDType	binary
uuid	UUIDType	binary
varint	IntegerType	binary
other	other	binary

The InetAddressType stores the raw IP address in network byte order.

Using TBLPROPERTIES and SERDEPROPERTIES

In an external table definition, the TBLPROPERTIES clause specifies CassandraFS and MapReduce properties for the table. The SERDEPROPERTIES clause specifies the properties used when serializing/deserializing data passed between the Hive table and Cassandra. You can add a WITH SERDEPROPERTIES clause to map meaningful column names in Hive to the Cassandra partition key, column names, and column values. You can change these properties on the fly. Using the Hive SET command, you can configure properties in the hive session. The settings become effective for the next query.

The following table lists properties used in the TBLPROPERTIES or SERDEPROPERTIES clause or both. The Type column lists how to declare properties in the table definition, as a TBLPROPERTIES (TBL), a SERDEPROPERTIES (SERDE) or both.

Property	Used in Hive Clause	Description
cassandra.cf.name	both	Cassandra table name
cassandra.columns.mapping	both	Mapping of Hive to legacy Cassandra columns
cassandra.consistency.level	both	Consistency level - default ONE
cassandra.cql3.type	both	CQL types
cassandra.host	both	IP of a Cassandra node to connect to
cassandra.input.split.size	both	MapReduce split size
cassandra.ks.name	both	Cassandra keyspace name
cassandra.ks.repfactor	both	Cassandra replication factor
cassandra.ks.strategy	both	Replication strategy class
cassandra.ks.stratOptions	both	Strategy options
cassandra.page.size	SERDE	Fetch tables with many columns by page size
cassandra.partitioner	both	Partitioner (default = configured partitioner)
cassandra.port	both	Cassandra RPC port - default 9160
cql3.output.query	TBL	A prepared stagement for storing alterations to a CQL users table
cql3.pushdown.enable	TBL	True (default) enable pushdown predicate

When you create an external table in Hive, you need to specify these properties:

cassandra.ks.name
cassandra.cf.name
cassandra.ks.repfactor (if SimpleStrategy is used)
cassandra.ks.strategy
cassandra.ks.stratOptions (if NetworkTopologyStrategy is used)
cql3.output.query

cassandra.input.split.size

The cassandra.input.split.size property configures the number of rows processed per mapper (64k rows per split). The default is 64 * 1024

Partitioner use by Hive

You do not need to specify cassandra.partitioner. Your configured partitioner is used by Hive. For example, Hive uses this property value if you use the Cassandra 1.2 default partitioner:

"cassandra.partitioner" = "org.apache.cassandra.dht.Murmur3Partitioner"

The "Creating or altering CQL data from Hive" and "MapReduce performance" sections show examples of using some of these properties.

Accessing tables with many columns

In the TBLPROPERTIES clause, set the cassandra.page.size to fetch a table with many columns at once by page size.