Using Hive

DataStax Enterprise includes a Cassandra-enabled Hive MapReduce client. Hive is a data warehouse system for Hadoop that projects a relational structure onto data stored in Hadoop-compatible file systems. You query the data using a SQL-like language called HiveQL.

You can start the Hive client on any analytics node and run MapReduce queries directly on data already stored in Cassandra. Using the DataStax Enterprise ODBC driver for Hive, a JDBC compliant user interface can connect to Hive from the Hive server.

Why Hive

By using Hive, you typically eliminate boilerplate MapReduce code and enjoy productivity gains. The large base of SQL users can master HiveQL quickly. Hive has a large set of standard functions, such as mathematical and string functions. You can use Hive for queries that Cassandra as a NoSQL database does not support, such as joins. DataStax Enterprise support of Hive facilitates the migration of data to DataStax Enterprise from a Hive warehouse. Hive capabilities are extensible through a Hive user-defined function (UDF), which DataStax Enterprise supports.

Typical uses for Hive are:

Reporting
User engagement and impression click count applications
Ad hoc analysis
Machine learning
Advertising optimization

Hive in DataStax Enterprise

DataStax Enterprise analytics nodes store Hive table structures in the CassandraFS instead of in a Hadoop Distributed File System (HDFS). You layer a Hive table definition onto a directory in the file system or onto a Cassandra CQL table. The Hive table definition describes the layout of the data and is stored in the HiveMetaStore keyspace. DataStax Enterprise implements the Hive metastore as the HiveMetaStore keyspace within Cassandra. Unlike open source Hive, there is no need to run the metastore as a standalone database to support multiple users.

The consistency level of Hadoop nodes is ONE by default, but when processing Hive queries, if DataStax Enterprise can guarantee that a replicas are in the same data center, the consistency level of LOCAL_ONE is used.

You can map an existing CQL table to Hive a table by creating an external table. The external table data source is external to Hive, located in CQL. When you drop a Hive external table, only the table metadata stored in the HiveMetaStore keyspace is removed. The data persists in CQL.

Instead of an external table, you can use a Hive managed table. Hive manages storing and deleting the data in this type of table. The data source can be a flat file that you put on the CassandraFS (using a hadoop -fs command) or the file can be elsewhere, such as on an operating system file system. To load the file, use the LOAD [LOCAL] DATA INPATH, INSERT INTO, or INSERT OVERWRITE Hive commands.

Hive metastore configuration

The HiveMetaStore in DataStax Enterprise supports multiple users and requires no configuration except increasing the default replication factor of the keyspace. The default replication for system keyspaces is 1. This replication factor is suitable for development and testing, not for a production environment. To avoid production problems, alter the replication factor of these system keyspaces from 1 to a higher number.

HiveMetaStore
cfs
cfs_archive keyspaces

To prevent missing data problems or data unavailable exceptions after altering keyspaces that contain any data, run nodetool repair as shown in these examples.

Supported Hive features

The Hive component in DataStax Enterprise 4.0 has been upgraded to Hive 0.12. The new version of Hive includes new querying capabilities, data type mapping, and performance enhancements. The following Hive 0.12 features are supported:

Windowing functions
- RANK
- LEAD/LAG
- ROW_NUMBER
- FIRST_VALUE, LAST_VALUE
Aggregate OVER functions with PARTITION BY and ORDER BY

DataStax Enterprise supports most CQL and Cassandra internal data types. DataStax provides a Hive user-defined function (UDF) for working with unsupported types, such as blob:

org.apache.hadoop.hive.cassandra.ql.udf.UDFStringToCassandraBinary

This UDF converts from Hive Strings to native Cassandra types. Due to limitations in Hive, the UDF can be used only to convert Hive Strings to string primitives, not collections that are arrays and maps of strings. It is not possible to use the UDF to convert, for example, an array of strings representing inet addresses to an array of InetAddress columns in Cassandra.

CQL collections and the CQL composite partition keys are now supported.

The Hive examples in this documentation show how to use new features:

There are also examples of using external and managed tables.