About DSE Analytics

Use DSE Analytics to analyze huge databases. DSE Analytics includes integration with Apache Spark, BYOH (bring your own Hadoop), and DSE Hadoop.

Use DSE Analytics to analyze huge databases. DSE Analytics includes integration with Apache Spark, BYOH (bring your own Hadoop), and DSE Hadoop:
Apache Spark
A fast alternative to Hadoop. Spark is a distributed, parallel, batch data processing engine based on the Resilient Distributed Datasets (RDD) concept.
A bring your own Hadoop (BYOH) model gives organizations who are already running Hadoop that is implemented by Cloudera or Hortonworks a way to use these implementations with DataStax Enterprise. This model provides better performance through custom, better-tuned Hadoop than earlier DataStax Enterprise versions.
Improved integration of Apache Sqoop
You can import RDBMS data to Cassandra and export Cassandra CQL data to an RDBMS.
DSE Hadoop
Hadoop is integrated with DataStax Enterprise with the following Hive and Pig tools:
  • Support for the native protocol in Hive.
  • Auto-creation of Hive databases and external tables for each CQL keyspace and table.
  • A cql3.partition.key property that maps Hive tables to CQL compound primary keys and composite partition keys.
  • Support for HiveServer2.
  • Integration of the Beeline command shell.
  • Support for expiring data in columns by setting TTL (time to live) on Hive tables.
  • Support for expiring data by setting TTL on Pig data using the cql:// URL, which includes a prepared statement. See step 10 of the library demo.

DSE Analytics features

No single point of failure
DSE Hadoop supports a peer-to-peer, distributed cluster for running MapReduce jobs. Being peers, any node in the cluster can load data files, and any analytics node can assume the responsibilities of Job Tracker for MapReduce jobs.
Job Tracker management
DSE Hadoop can automatically select Job Tracker and reserve Job Tracker nodes that take over in the event of a problem that would affect availability. The Job Tracker and reserve Job Tracker nodes can also be explicitly set.
Multiple Job Trackers
You can run one or more Job Tracker services across multiple datacenters and create multiple keyspaces per datacenter. Using this capability has performance, data replication, and other benefits.
Hadoop MapReduce with multiple Cassandra File Systems (CFS)
Cassandra File System (CFS) is a Hadoop Distributed File System (HDFS)-compatible storage layer. DataStax Enterprise replaces HDFS with CFS to run MapReduce jobs on Cassandra's peer-to-peer, fault-tolerant, and scalable architecture. You can create additional CFS to organize and optimize data.
Analytics without ETL
Using DSE Hadoop, you run MapReduce jobs directly against data in Cassandra. You can perform real-time and analytics workloads at the same time without one workload affecting the performance of the other. Starting some cluster nodes as Hadoop analytics nodes and others as pure Cassandra real-time nodes automatically replicates data between nodes.
Hive support
Hive, a data warehouse system, facilitates data summarization, ad hoc queries, and the analysis of large data sets that are stored in Hadoop-compatible file systems. Any ODBC or JDBC compliant user interface connects to Hive from the server. Using the Cassandra-enabled Hive MapReduce client in DataStax Enterprise, you project a relational structure onto Hadoop data in CFS, and query the data using CQL, an SQL-like language.
Pig support
The Cassandra-enabled Pig MapReduce client that is included with DSE Hadoop is a high-level platform for creating MapReduce programs to use with Hadoop. You can analyze large data sets by running jobs in MapReduce mode and Pig programs directly on data that is stored in Cassandra.
Mahout support
Apache Mahout, included with DSE Hadoop, offers machine learning libraries. Machine learning improves a system, such as the system that recreates the Google Priority Inbox, based on past experience or examples.