About DSE Analytics

DataStax Enterprise serves the analytics market with significant features for analyzing huge databases.

DataStax Enterprise targets the analytics market with significant features for analyzing huge databases:

Apache Spark
A fast alternative to Hadoop. Spark is a distributed, parallel, batch data processing engine based on the Resilient Distributed Datasets (RDD) concept instead of MapReduce upon which Hadoop is based.
Shark
A Hive-like language built on top of Spark. The connection of Spark to Cassandra executes performant analytical queries independent of Hadoop. Shark's Hive-like language simplifies the transition for Hive users. The connection of Spark to Cassandra provides faster data analysis than the typical MapReduce job.
BYOH
A bring your own Hadoop (BYOH) model gives organizations, who are already running late models of Hadoop implemented by Cloudera or Hortonworks, a way to use these implementations with DataStax Enterprise. This model provides better performance through custom, better-tuned Hadoop than previous DataStax Enterprise versions.
Improved integration of Apache Sqoop
You can import RDBMS data to Cassandra and export Cassandra CQL data to an RDBMS.
DSE Hadoop
Hadoop is integrated with DataStax Enterprise and has the following Hive and Pig tools:
- Support for the native protocol in Hive.
- Auto-creation of Hive databases and external tables for each CQL keyspace and table.
- A cql3.partition.key property that maps Hive tables to CQL compound primary keys and composite partition keys.
- Support for HiveServer2.
- Integration of the HiveServer2 Beeline command shell.
- Support for expiring data in columns by setting TTL (time to live) on Hive tables.
- Support for expiring data by setting the TTL on Pig data using the cql:// URL, which includes a prepared statement. See step 10 of the library demo.

DSE Analytics features

No single point of failure
DSE Hadoop supports a peer-to-peer, distributed cluster for running MapReduce jobs. Being peers, any node in the cluster can load data files, and any analytics node can assume the responsibilities of Job Tracker for MapReduce jobs.
Job Tracker management
DSE Hadoop can automatically select Job Tracker and reserve Job Tracker nodes that take over in the event of a problem that would affect availability. The Job Tracker and reserve Job Tracker nodes can also be explicitly set.
Multiple Job Trackers
You can run one or more Job Tracker services across multiple data centers and create multiple keyspaces per data center. Using this capability has performance, data replication, and other benefits.
Hadoop MapReduce using multiple Cassandra File Systems (CFS)
Cassandra File System (CFS) is a Hadoop Distributed File System (HDFS)-compatible storage layer. DataStax Enterprise replaces HDFS with CFS to run MapReduce jobs on Cassandra's peer-to-peer, fault-tolerant, and scalable architecture. You can create additional CFS to organize and optimize data.
Analytics without ETL
Using DSE Hadoop, you run MapReduce jobs directly against data in Cassandra. You can perform real-time and analytics workloads at the same time without one workload affecting the performance of the other. Starting some cluster nodes as Hadoop analytics nodes and others as pure Cassandra real-time nodes automatically replicates data between nodes.
Hive Support
Hive, a data warehouse system, facilitates data summarization, ad hoc queries, and the analysis of large data sets that are stored in Hadoop-compatible file systems. Any JDBC compliant user interface connects to Hive from the server. Using the Cassandra-enabled Hive MapReduce client in DataStax Enterprise, you project a relational structure onto Hadoop data in CFS, and query the data using a SQL-like language.
Pig Support
The Cassandra-enabled Pig MapReduce client that is included with DSE Hadoop is a high-level platform for creating MapReduce programs used with Hadoop. You can analyze large data sets by running jobs in MapReduce mode and Pig programs directly on data that is stored in Cassandra.
Mahout support
Apache Mahout, included with DSE Hadoop, offers machine learning libraries. Machine learning improves a system, such as the system that recreates the Google Priority Inbox, based on past experience or examples.