Using Hive with BYOH (Deprecated)
Apache Hive is a data warehouse system for Hadoop that projects a relational structure onto data that is stored in Hadoop-compatible file systems. BYOH Hive is deprecated and will be removed in a future release.
BYOH Hive is deprecated and will be removed in a future release.
Apache Hive is a data warehouse system for Hadoop that projects a relational structure onto data stored in Hadoop-compatible file systems. Documentation about DataStax Enterprise DSE Hadoop provides a general introduction to Hive for new users.
BYOH capabilities connect DataStax Enterprise to a Hive MapReduce client in the external Hadoop system for querying the data using a SQL-like language called HiveQL.
Start Hive on a Cassandra BYOH node, and then run MapReduce queries directly on data outside or inside Cassandra. Use a Hive managed table to query data outside of Cassandra. Hive manages storing and deleting the data in a Hive managed table. Use a Hive external table to query data in Cassandra. Cassandra manages storing and deleting the data in a Hive external table.
Starting Hive
To start Hive use this byoh command:
$ bin/byoh hive
The output should look something like this:
/usr/lib/dse/resources/cassandra/conf
Logging initialized using configuration in jar:file:/usr/lib/hive12/lib/hive-common-0.12.0.jar!/hive-logback.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-logback12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive12/lib/slf4j-logback12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/dse/resources/dse/lib/slf4j-logback12-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.LogbackLoggerFactory]
hive>
Accessing data outside Cassandra
At the Hive prompt, you can create and query the Hive managed table. For example, you can query a flat file that you put on the HDFS (using a hadoop -fs command) or the file can be elsewhere, such as on an operating system file system.
Accessing data in Cassandra
Use the DataStax Enterprise custom metastore in the BYOH model to map Cassandra tables to
Hive tables automatically. The keyspace and table must pre-exist in Cassandra. You create a
schema representing your table using the dse hive-schema
command. The
command dumps your entire schema, or part of it, to standard output. Next, in the Hive
client, you pass the table containing the map to Hive using the byoh hive
-f
command. DataStax Enterprise creates the Hive external table. Finally, create
or alter CQL data from Hive.
The syntax of the hive-schema command is:
bin/dse hive-schema -keyspace testks -table testa testb -exclude testc testd
The hive-schema
command options are:
- -all
- Include all keyspaces and tables
- -decimal
- Decimal parameters in form precision, scale for Hive 0.13 and later
- -exclude
- Exclude these tables
- -help
- Provide hive-schema command usage
- -keyspace
- Include these keyspaces
- -table
- Include these tables
To dump all Cassandra keyspaces and tables to a file called byoh_automap, for example, use this command:
$ dse hive-schema -all > byoh_automap
To start Hive and pass the hive-schema:
$ byoh hive -f byoh_automap
Running the Hive demo
The Hive demo creates a keyspace and table in Cassandra using cqlsh, creates a Hive external table, and then queries the table from Hive.
- Create a Cassandra keyspace and table using cqlsh.
cqlsh> CREATE KEYSPACE cassandra_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'Cassandra': 1}; cqlsh> use cassandra_keyspace; cqlsh:cassandra_keyspace> CREATE TABLE exampletable ( key int PRIMARY KEY , data text ); cqlsh:cassandra_keyspace> INSERT INTO exampletable (key, data ) VALUES ( 1, 'This data can be read automatically in hive');
- On the command line, use the dse hive-schema command to create an automap
file:
$ bin/dse hive-schema -keyspace cassandra_keyspace -table exampletable
The output is:
CREATE DATABASE IF NOT EXISTS cassandra_keyspace; USE cassandra_keyspace; CREATE EXTERNAL TABLE IF NOT EXISTS exampletable ( key int COMMENT 'Auto-created based on org.apache.cassandra.db.marshal.Int32Type from Column Family meta data', data string COMMENT 'Auto-created based on org.apache.cassandra.db.marshal.UTF8Type from Column Family meta data') ROW FORMAT SERDE 'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe' STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler' WITH SERDEPROPERTIES ( 'serialization.format'='1', 'cassandra.columns.mapping'='key,data') TBLPROPERTIES ( 'auto_created' = 'true', 'cassandra.partitioner' = 'org.apache.cassandra.dht.Murmur3Partitioner', 'cql3.partition.key' = 'key', 'cassandra.ks.name' = 'cassandra_keyspace', 'cassandra.cf.name' = 'exampletable');
- To start Hive and pass the
hive-schema:
$ byoh hive -f byoh_automap SLF4J: Found binding in [jar:file:/home/automaton/dse-4.6.0/resources/dse/lib/slf4j-logback12-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.LogbackLoggerFactory] OK Time taken: 5.15 seconds OK Time taken: 0.008 seconds OK Time taken: 3.085 seconds
- Start Hive using the byoh hive command to access the Cassandra
table.
$ bin/byoh hive
- In Hive, use the Cassandra keyspace and query the Cassandra
table.
hive> use cassandra_keyspace; OK Time taken: 5.264 seconds hive> select * from exampletable; OK 1 This data can be read automatically in hive Time taken: 3.815 seconds, Fetched: 1 row(s)