Configuring an external Hadoop system

Perform configuration tasks after you install DataStax Enterprise.

You perform a few configuration tasks after installation of DataStax Enterprise.
  • Configure Kerberos on the Hadoop cluster.
  • Configure Java on the Hadoop cluster.
  • Install Hive 0.12 on the Hadoop cluster.
  • Configure BYOH environment variables on nodes in the BYOH datacenter.

Configuring Kerberos (optional) 

To use Kerberos to protect your data, configure Hadoop security under Kerberos on your Hadoop cluster. For information about configuring Hadoop security, see "Using Cloudera Manager to Configure Hadoop Security" or the Hortonworks documentation.

Configuring Java 

BYOH requires that the external Hadoop system use the same Java version as DataStax Enterprise. Ensure that the Cloudera and Hortonworks clusters are configured to use it.

Configuring Hive 

Configure nodes to use Hive or Pig, generally the one that is provided with Cloudera or Hortonworks. Additional configuration is not required for BYOH with Apache and Cloudera versions of Hive versions 0.11 to 0.14.

  1. If your Hadoop distribution is a version of Hive other than 0.11 to 0.14, follow these steps to install one of the supported versions.
  2. For example, download Hive 0.12 http://apache.mirrors.pair.com/hive/hive-0.12.0/hive-0.12.0.tar.gz.
  3. Unpack the archive to install Hive 0.12.
    $ tar -xzvf hive-0.12.0.tar.gz
  4. If you move the Hive installation, avoid writing over the earlier version that was installed by Cloudera Manager or Ambari. For example, rename the Hive fork if necessary.
  5. Move the Hive you installed to the following location:
    $ sudo mv hive-0.12.0 /usr/lib/hive12
After making the changes, restart the external Hadoop system. For example, restart the CDH cluster from the Cloudera Manager-Cloudera Management Service drop-down. Finally, configure BYOH environment variables before using DataStax Enterprise.

Configuring BYOH environment variables 

The DataStax Enterprise installation includes the byoh-env.sh configuration file that sets up the DataStax Enterprise environment. Make these changes on all nodes in the BYOH datacenter. BYOH automatically extracts the Hive version from $HIVE_HOME/lib/hive-exec*.jar file name.

  1. Open the byoh-env.sh file.
    The default location of the byoh-env.sh file depends on the type of installation:
    Installer-Services and Package installations /etc/dse/byoh-env.sh
    Installer-No Services and Tarball installations install_location/bin/byoh-env.sh
  2. Set the DSE_HOME environment variable to the DataStax Enterprise installation directory.
    • Package installations: :
      export DSE_HOME="/etc/dse"
    • Installer-Services installations:
      export DSE_HOME="/usr/share/dse"
    • Installer-No Services and Tarball installations:
      export DSE_HOME="install_location"
  3. Edit the byoh-env.sh file to point the BYOH configuration to the Hive version and the Pig version.
    HIVE_HOME="/usr/lib/hive" 
    PIG_HOME="/usr/lib/pig"
    Note: You can manually change the Hive version in the HIVE_VERSION environment variable in hive-env.sh.
  4. Check that other configurable variables match the location of components in your environment.
  5. Configure the byoh-env.sh for using Pig by editing the IP addresses to reflect your environment. On a single node, cluster for example:
    export PIG_INITIAL_ADDRESS=127.0.0.1
    export PIG_OUTPUT_INITIAL_ADDRESS=127.0.0.1
    export PIG_INPUT_INITIAL_ADDRESS=127.0.0.1
  6. If a Hadoop data node is not running on the local machine, configure the DATA_NODE_LIST and NAME_NODE variables as follows:
    • DATA_NODE_LIST

      Provide a comma-separated list of Hadoop data node IP addresses this machine can access. The list is set to mapreduce.job.hdfs-servers in the client configuration.

    • NAME_NODE
      Provide the name or IP address of the name node. For example:
      export DATA_NODE_LIST="192.168.1.1, 192.168.1.2, 192.168.1.3"
      export NAME_NODE="localhost"
    If a Hadoop data node is running on the local machine, leave these variables blank. For example:
    export DATA_NODE_LIST=
    export NAME_NODE=