Using an external file system (deprecated)

You can map a file in an external file system, such as S3 native file system to a table in Hive.

Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are deprecated. Hive is also deprecated and will be removed when Hadoop is removed.

You can map a file in an external file system, such as S3 native file system to a table in Hive. The DSE Hadoop cluster continues to use the Cassandra File System (CFS). The data source is external to Hive, located in S3 for example. You create a Hive external table for querying the data in an external file system. When you drop the external table, only the table metadata that is stored in the HiveMetaStore keyspace is removed. The data persists in the external file system.



First, set up the hive-site.xml and core-site.xml files, and then create an external table as described in this procedure.

Procedure

  1. Open the hive-site.xml file for editing.
    There are two instances of the hive-site.xml file.

    For use with Spark, the default location of the hive-site.xml file is:

    Installer-Services and Package installations /etc/dse/spark/hive-site.xml
    Installer-No Services and Tarball installations install_location/resources/spark/conf/hive-site.xml

    For use with Hive, the default location of the hive-site.xml file is:

    Installer-Services and Package installations /etc/dse/hive/hive-site.xml
    Installer-No Services and Tarball installations install_location/resources/hive/conf/hive-site.xml
  2. Add a property to hive-site.xml to set the default file system to be the native S3 block file system. Use fs.default.name as the name of the file system and the location of the bucket as the value. For example, if the S3 bucket name is mybucket:
    <property>
      <name>fs.default.name</name>
      <value>s3n://mybucket</value>
    </property>
  3. Save the file.
  4. Open the core-site.xml file for editing.
    The default location of the core-site.xml file depends on the type of installation:
    Installer-Services and Package installations /etc/dse/hadoop/conf/core-site.xml
    Installer-No Services and Tarball installations install_location/resources/hadoop/conf/core-site.xml
  5. Add these properties to core-site.xml to specify the access key ID and the secret access key credentials for accessing the native S3 block file system:
    <property>
      <name>fs.s3n.awsAccessKeyId</name>
      <value>ID</value>
    </property>
    
    <property>
      <name>fs.s3n.awsSecretAccessKey</name>
      <value>Secret</value>
    </property>
  6. Save the file and restart Cassandra.
  7. Create a directory in s3n://mybucket named, for example, mydata_dir.
  8. Create a data file named mydata.txt, for example. Delimit fields using =.
    "key1"=100
    "key2"=200 
    "key3"=300
  9. Put the data file that you created in s3n://mybucket/mydata_dir.
  10. Using cqlsh, create a keyspace and a CQL table schema to accommodate the data on S3.
    cqlsh> CREATE KEYSPACE s3_counterpart WITH replication =
             {'class': 'NetworkTopologyStrategy', 'Analytics': 1};
    cqlsh> USE s3_counterpart;
    cqlsh:s3_counterpart> CREATE TABLE mytable
                                ( key text PRIMARY KEY , data int );
  11. Start Hive, and on the Hive command line, create an external table for the data on S3. Specify the S3 file name as shown in this example.
    hive> CREATE EXTERNAL TABLE mytable (key STRING, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '=' STORED AS TEXTFILE LOCATION 's3n://mybucket/mydata_dir/';
    Now, having the S3 data in Hive, you can query the data using Hive.
  12. Select all the data in the file on S3.
    SELECT * from mytable;
    OK
    key1  100
    key2  200
    key3  300