BYOH Introduction

DataStax Enterprise (DSE) works with external Hadoop systems in a bring your own Hadoop (BYOH) model. Use BYOH to run DSE with a separate Hadoop cluster from a different vendor.

Hadoop is a software framework for distributed processing of large data sets using MapReduce programs. DataStax Enterprise (DSE) works with these external Hadoop systems in a bring your own Hadoop (BYOH) model. Use BYOH when you want to run DSE with a separate Hadoop cluster, from a different vendor. Supported vendors are:

Hadoop 2.x data warehouse implementations Cloudera 4.5, 4.6, and 5.0.x
Hortonworks 1.3.3 and 2.0.x

You can use Hadoop in one of the following modes:

External Hadoop
Uses the Hadoop distribution provided by Cloudera (CDH) or Hortonworks (HDP).
Internal Hadoop
Uses the DSE Hadoop integrated with DataStax Enterprise.

For legacy purposes, DataStax Enterprise 4.5 includes DSE Hadoop 1.0.4 with built-in Hadoop trackers.

Use cases for BYOH are:

Bi-directional data movement between Cassandra in DataStax Enterprise and the Hadoop Distributed File System (HDFS)
Hive queries against Cassandra data in DataStax Enterprise
Data combination (joins) between Cassandra and HDFS data
ODBC access to Cassandra data through Hive

Components

This table compares DSE Hadoop with the external Hadoop system in the BYOH model:

**Comparison of DSE Hadoop and the BYOH model**
Component	DSE Integrated Hadoop Owner	BYOH Owner	DSE Interaction
Job tracker	DSE Cluster	Hadoop Cluster	Optional
Task tracker	DSE Cluster	Hadoop Cluster	Co-located with BYOH nodes
Pig	Distributed with DSE	Distribution chosen by operator	Can launch from task trackers
Hive	Distributed with DSE	Distribution chosen by operator	Can launch from task trackers
HDFS/CFS	CFS	HDFS	Block storage

BYOH installation and configuration overview

The procedure for installing and configuring DataStax Enterprise for BYOH is straight-forward. First, ensure that you meet the prerequisites. Next, install DataStax Enterprise on all nodes in the Cloudera or Hortonworks cluster and on additional nodes outside the Hadoop cluster. Install several Cloudera or Hortonworks components on the additional nodes and deploy those nodes in a virtual BYOH data center. Finally, configure DataStax Enterprise BYOH environment variables on each node in the BYOH data center to point to the Hadoop cluster, as shown in the following diagram:

DataStax Enterprise runs only on BYOH nodes, and uses Hadoop components to integrate BYOH and Hadoop. You never start up the DataStax Enterprise installations on the Hadoop cluster.

MapReduce process

In a typical Hadoop cluster, Task Tracker and Data Node services run on each node. A Job Tracker service running on one of the master nodes coordinates MapReduce jobs between the Task Trackers, which pull data locally from data node. For the latest versions of Hadoop using YARN, Node Manager services replace Task Trackers and the Resource Manager service replaces the Job Tracker.

In contrast with the typical Hadoop cluster, in the BYOH model DSE Cassandra services can take the place of the Data Node service in MapReduce jobs, providing data directly to the Task Trackers/Node Managers, as shown in the following diagram. For simplicity purposes, the diagram uses the following nomenclature:

Task Tracker--Means Task Tracker or Node Manager.
Job Tracker--Means Job Tracker or Resource Manager.

A MapReduce service runs on each BYOH node along with optional MapReduce, Hive, and Pig clients. To take advantage of the performance benefits offered by Cassandra, BYOH handles frequently accessed hot data. The Hadoop cluster handles less-frequently and rarely accessed cold data. You design the MapReduce application to store output in Cassandra or Hadoop.

The following diagram shows the data flow of a job in a BYOH data center. The Job Tracker/Resource Manager (JT/RM) receives MapReduce input from the client application. The JT/RM sends a MapReduce job request to the Task Trackers/Node Managers (TT/NM) and optional clients, MapReduce, Hive, and Pig. The data is written to Cassandra and results sent back to the client.

BYOH workflow

BYOH clients submit Hive jobs to the Hadoop job tracker or ResourceManager in the case of YARN. If Cassandra is the source of the data, the job tracker evaluates the job, and the ColumnFamilyInputFormat creates input splits and assigns tasks to the various task trackers in the Cassandra node setup (giving the jobs local data access). The Hadoop job runs until the output phase.

During the output phase if Cassandra is the target of the output, the HiveCqlOutputFormat writes the data back into Cassandra from the various reducers. During the reduce step, if data is written back to Cassandra, locality is not a concern and data gets written normally into the cluster. For Hadoop in general, this pattern is the same. When spilled to disk, results are written to separate files, partial results for each reducer. When written to HDFS, the data is written back from each of the reducers.

Intermediate MapReduce files are stored on the local disk or in temporary HDFS tables, depending on configuration, but never in CFS. Using the BYOH model, Hadoop MapReduce jobs can access Cassandra as a data source and write results back to Cassandra or Hadoop.