Using Hive with BYOH (Deprecated)

Apache Hive is a data warehouse system for Hadoop that projects a relational structure onto data that is stored in Hadoop-compatible file systems. BYOH Hive is deprecated and will be removed in a future release.

BYOH Hive is deprecated and will be removed in a future release.

Apache Hive is a data warehouse system for Hadoop that projects a relational structure onto data stored in Hadoop-compatible file systems. Documentation about DataStax Enterprise DSE Hadoop provides a general introduction to Hive for new users.

BYOH capabilities connect DataStax Enterprise to a Hive MapReduce client in the external Hadoop system for querying the data using a SQL-like language called HiveQL.

Start Hive on a Cassandra BYOH node, and then run MapReduce queries directly on data outside or inside Cassandra. Use a Hive managed table to query data outside of Cassandra. Hive manages storing and deleting the data in a Hive managed table. Use a Hive external table to query data in Cassandra. Cassandra manages storing and deleting the data in a Hive external table.

Starting Hive 

To start Hive use this byoh command:

$ bin/byoh hive

The output should look something like this:

/usr/lib/dse/resources/cassandra/conf

Logging initialized using configuration in jar:file:/usr/lib/hive12/lib/hive-common-0.12.0.jar!/hive-logback.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-logback12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive12/lib/slf4j-logback12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/dse/resources/dse/lib/slf4j-logback12-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.LogbackLoggerFactory]
hive>

Accessing data outside Cassandra 

At the Hive prompt, you can create and query the Hive managed table. For example, you can query a flat file that you put on the HDFS (using a hadoop -fs command) or the file can be elsewhere, such as on an operating system file system.

Accessing data in Cassandra 

Use the DataStax Enterprise custom metastore in the BYOH model to map Cassandra tables to Hive tables automatically. The keyspace and table must pre-exist in Cassandra. You create a schema representing your table using the dse hive-schema command. The command dumps your entire schema, or part of it, to standard output. Next, in the Hive client, you pass the table containing the map to Hive using the byoh hive -f command. DataStax Enterprise creates the Hive external table. Finally, create or alter CQL data from Hive.

The syntax of the hive-schema command is:

bin/dse hive-schema -keyspace testks -table testa testb -exclude testc testd

The hive-schema command options are:

-all
Include all keyspaces and tables
-decimal
Decimal parameters in form precision, scale for Hive 0.13 and later
-exclude
Exclude these tables
-help
Provide hive-schema command usage
-keyspace
Include these keyspaces
-table
Include these tables

To dump all Cassandra keyspaces and tables to a file called byoh_automap, for example, use this command:

$ dse hive-schema -all > byoh_automap

To start Hive and pass the hive-schema:

$ byoh hive -f byoh_automap

Running the Hive demo 

The Hive demo creates a keyspace and table in Cassandra using cqlsh, creates a Hive external table, and then queries the table from Hive.

Note: DataStax Demos do not work with either LDAP or internal authorization (username/password) enabled.
  1. Create a Cassandra keyspace and table using cqlsh.
    cqlsh> CREATE KEYSPACE cassandra_keyspace WITH replication =
             {'class': 'NetworkTopologyStrategy', 'Cassandra': 1};
    cqlsh> use cassandra_keyspace;
    cqlsh:cassandra_keyspace> CREATE TABLE exampletable ( key int PRIMARY KEY , data text );
    cqlsh:cassandra_keyspace> INSERT INTO exampletable (key, data ) VALUES ( 1, 'This data can be read automatically in hive');
  2. On the command line, use the dse hive-schema command to create an automap file:
    $ bin/dse hive-schema -keyspace cassandra_keyspace -table exampletable

    The output is:

    CREATE DATABASE IF NOT EXISTS cassandra_keyspace;
    
    USE cassandra_keyspace;
    
    CREATE EXTERNAL TABLE IF NOT EXISTS exampletable ( 
      key int  COMMENT 'Auto-created based on org.apache.cassandra.db.marshal.Int32Type from Column Family meta data',
      data string  COMMENT 'Auto-created based on org.apache.cassandra.db.marshal.UTF8Type from Column Family meta data')
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe' 
    STORED BY
      'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler' 
    WITH SERDEPROPERTIES (
      'serialization.format'='1',
      'cassandra.columns.mapping'='key,data')
    TBLPROPERTIES (
      'auto_created' = 'true',
      'cassandra.partitioner' = 'org.apache.cassandra.dht.Murmur3Partitioner',
      'cql3.partition.key' = 'key',
      'cassandra.ks.name' = 'cassandra_keyspace',
      'cassandra.cf.name' = 'exampletable');
  3. To start Hive and pass the hive-schema:
    $ byoh hive -f byoh_automap
    SLF4J: Found binding in [jar:file:/home/automaton/dse-4.6.0/resources/dse/lib/slf4j-logback12-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.LogbackLoggerFactory]
    OK
    Time taken: 5.15 seconds
    OK
    Time taken: 0.008 seconds
    OK
    Time taken: 3.085 seconds
  4. Start Hive using the byoh hive command to access the Cassandra table.
    $ bin/byoh hive
  5. In Hive, use the Cassandra keyspace and query the Cassandra table.
    hive> use cassandra_keyspace;
    OK
    Time taken: 5.264 seconds
    
    hive> select * from exampletable;
    OK
    1 This data can be read automatically in hive
    Time taken: 3.815 seconds, Fetched: 1 row(s)