Example: Work with an unsupported data type

DataStax Enterprise provides a user defined function (UDF) for converting Hive binary data into string representations of CQL types.

DataStax Enterprise provides a user defined function (UDF) for converting Hive binary data into string representations of CQL types. Hive cannot auto-create an external table that maps to the unsupported types, such as Cassandra blobs. You have to create a custom external table in Hive and map these types to binary. To read the data in Hive, use a provided UDF to convert the data.

Create the keyspace and two tables in cqlsh

This example first creates a keyspace and two tables in cqlsh and inserts data of every supported type into the tables.

  1. Start cqlsh. For example, on Linux.
    ./cqlsh
  2. In cqlsh, create a keyspace:
    cqlsh> CREATE KEYSPACE cql3ks WITH replication =
          { 'class': 'NetworkTopologyStrategy',
          'Analytics': '1' };
  3. Using cqlsh, create a table in the cql3ks keyspace having columns of every CQL data type.
    cql> USE cql3ks;
    
    cql3ks> CREATE TABLE genericAdd (
              key ascii PRIMARY KEY, a bigint, b blob, c boolean,
              d decimal, e double, f float, g inet, h int, i text,
              j timestamp, k uuid, l timeuuid, m varint);
  4. Insert some data into the table.
    cql3ks> INSERT INTO genericAdd (
              key,a,b,c,d,e,f,g,h,i,j,k,l,m)
              VALUES ('KeyOne', 100005, 0xBEEFFEED, true, 3.5,-1231.4,
              3.14, '128.2.4.1', 42, 'SomeText', '2008-10-03',
              e3d81c40-1961-11e3-8ffd-0800200c9a66,
              f078d660-1961-11e3-8ffd-0800200c9a66, 1000000);
  5. Create a second table, genericToAdd, containing every data type and insert different data into the table.
    cql3ks> CREATE TABLE genericToAdd (
              id int PRIMARY KEY, key ascii, a bigint, b blob, c boolean,
              d decimal, e double, f float, g inet, h int, i text,
              j timestamp, k uuid, l timeuuid, m varint);
  6. Insert some data into the second table.
    cql3ks> INSERT INTO genericToAdd (
              id,key,a,b,c,d,e,f,g,h,i,j,k,l,m)
              VALUES (1,'Oneness',1, 0x11111111, true, 1.11,-1111.1,1.11,
              '111.1.1.1', 11,'11111','1999-11-01',
              e3d81c40-1961-11e3-8ffd-0800200c9a66,
              f078d660-1961-11e3-8ffd-0800200c9a66, 1);

Create an external table in Hive

Next, create an external table in Hive that maps to the table in Cassandra. You cannot use the auto-created table because Hive cannot represent the blob type in a comprehensible format. After creating the custom external table, you can perform alterations of the CQL tables from Hive. You insert data from the second CQL table into the first CQL table from Hive. Using a UDF, you query the external table in Hive. You need to use the UDF because the data is of the unsupported blob type.
  1. Create a table in Hive that includes a cql3.output.query property that has the value of a prepared statement for inserting the data from the second, genericToAdd, table into the first, genericAdd, table.

    The last couple of lines in the following statement need to be free of line breaks. If you copy/paste this statement directly from the documentation and do not remove line breaks, an error occurs in the subsequent step.

    hive> CREATE EXTERNAL TABLE hive_genericadd ( key string, a bigint, b binary, c boolean, d decimal, e double, f float, g binary, h int, i string, j timestamp, k binary, l binary, m binary) STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler' TBLPROPERTIES ( "cassandra.ks.name" = "cql3ks", "cassandra.cf.name" = "genericadd", "cql3.partition.key"="key", 
    "cql3.output.query" = "INSERT INTO cql3ks.genericadd (key,a,b,c,d,e,f,g,h,i,j,k,l,m) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
  2. Use the INSERT statement to start the MapReduce job that inserts the data from the second CQL table into the first one.
    hive> INSERT INTO TABLE hive_genericadd SELECT key,a,b,c,d,e,f,g,h,i,j,k,l,m FROM cql3ks.generictoadd;

    The MapReduce job runs.

    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    . . .
    Job 0: Map: 2   HDFS Read: 0 HDFS Write: 0 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 33.278 seconds
  3. Create an alias for the UDF provided by DataStax.
    hive> CREATE TEMPORARY FUNCTION c_to_string AS 'org.apache.hadoop.hive.cassandra.ql.udf.UDFCassandraBinaryToString';
  4. Select the data of the unsupported blob type from the Hive table by calling the UDF.
    hive> select c_to_string(b, 'blob') from hive_genericadd;

The MapReduce job runs, and the output correctly displays the values:

Total MapReduce jobs = 1
. . .
Job 0: Map: 2   HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
beeffeed
11111111