About DSE Analytics

Use DSE Analytics to analyze huge databases. DSE Analytics includes integration with Apache Spark. BYOH (bring your own Hadoop) and DSE Hadoop are deprecated for use with DataStax Enterprise and will be removed in DataStax Enterprise 5.1.

Use DSE Analytics to analyze huge databases. DSE Analytics provides real-time, streaming, and batch analytics with built-in integration with Apache Spark, a distributed, parallel data processing engine.

DSE Analytics features

No single point of failure: DSE Hadoop (deprecated) supports a peer-to-peer, distributed cluster for running MapReduce jobs. Being peers, any node in the cluster can load data files, and any analytics node can assume the responsibilities of Job Tracker for MapReduce jobs.
Job Tracker management: DSE Analytics provides automatic Job Tracker and Spark Master management.
Analytics without ETL: Using DSE Hadoop, you run MapReduce jobs directly against data in Cassandra. You can perform real-time and analytics workloads at the same time without one workload affecting the performance of the other. Starting some cluster nodes as Hadoop analytics nodes and others as pure Cassandra real-time nodes automatically replicates data between nodes.
Hive support: Hive, a data warehouse system, facilitates data summarization, ad hoc queries, and the analysis of large data sets that are stored in Hadoop-compatible file systems. Any ODBC or JDBC compliant user interface connects to Hive from the server. Using the Cassandra-enabled Hive MapReduce client in DataStax Enterprise, you project a relational structure onto Hadoop data in CFS, and query the data using CQL, an SQL-like language.
DataStax Enterprise file system (DSEFS): DSEFS (DataStax Enterprise file system) is a new distributed file system within DataStax Enterprise that is intended primarily to provide fault tolerance for Spark streaming use cases and Write Ahead Logging (WAL). DSEFS is more performant than CFS (Cassandra File System).
Hadoop MapReduce with multiple Cassandra File Systems (CFS): Cassandra File System (CFS) is a Hadoop Distributed File System (HDFS)-compatible storage layer. DataStax Enterprise replaces HDFS with CFS to run MapReduce jobs on Cassandra's peer-to-peer, fault-tolerant, and scalable architecture. You can create additional CFS to organize and optimize data.

The following DSE Analytics features are deprecated for use with DataStax Enterprise:

Pig support (deprecated)

The Cassandra-enabled Pig MapReduce client that is included with DSE Hadoop is a high-level platform for creating MapReduce programs to use with Hadoop. You can analyze large data sets by running jobs in MapReduce mode and Pig programs directly on data that is stored in Cassandra.

Mahout support (deprecated)

Apache Mahout, included with DSE Hadoop, offers machine learning libraries. Machine learning improves a system, such as the system that recreates the Google Priority Inbox, based on past experience or examples.

BYOH (deprecated)

Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are also deprecated. A bring your own Hadoop (BYOH) model gives organizations who are already running Hadoop that is implemented by Cloudera or Hortonworks a way to use these implementations with DataStax Enterprise. This model provides better performance through custom, better-tuned Hadoop than earlier DataStax Enterprise versions.

Integration of Apache Sqoop (deprecated)

You can import RDBMS data to Cassandra and export Cassandra CQL data to an RDBMS.

DSE Hadoop (deprecated)

Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are also deprecated. Hadoop is integrated with DataStax Enterprise with the following Hive and Pig tools:

Support for the native protocol in Hive.
Auto-creation of Hive databases and external tables for each CQL keyspace and table.
A cql3.partition.key property that maps Hive tables to CQL compound primary keys and composite partition keys.
Support for HiveServer2.
Integration of the Beeline command shell.
Support for expiring data in columns by setting TTL (time to live) on Hive tables.
Support for expiring data by setting TTL on Pig data using the cql:// URL, which includes a prepared statement. See step 10 of the library demo.