Introduction to DSE Analytics

DataStax Enterprise serves the analytics market with significant features for analyzing huge databases.

DataStax Enterprise targets the analytics market with significant features for analyzing huge databases:

Apache Spark
A fast alternative to Hadoop is a distributed, parallel, batch data processing engine that is based on the Resilient Distributed Datasets (RDD) concept instead of the MapReduce concept upon which Hadoop is based.
Apache Shark
A Hive-like language built on top of Spark. The connection of Spark to Cassandra executes performant analytical queries independent of Hadoop. Shark is a Hive-like language built on top of Spark, simplifying the transition for Hive users. The connection of Spark to Cassandra provides faster data analysis than the typical MapReduce job.
BYOH
A bring your own Hadoop (BYOH) model gives organizations, who are already running late models of Hadoop implemented by Cloudera or Hortonworks, a way to use these implementations with DataStax Enterprise. This model provides better performance through custom, better-tuned Hadoop than previous DataStax Enterprise versions.
Improved integration of Apache Sqoop
You can import RDBMS data to Cassandra and export Cassandra CQL data to an RDBMS.
DSE Hadoop
The legacy Hadoop 1.0.4 that is integrated with DataStax Enterprise has the following Hive and Pig tools:
- Support for the native protocol in Hive including the addition of 19 new Hive TBLPROPERTIES to support the native protocol
- Auto-creation of Hive databases and external tables for each CQL keyspace and table
- A new cql3.partition.key property that maps Hive tables to CQL compound primary keys and composite partition keys
- Support for HiveServer2
- Integration of the HiveServer2 Beeline command shell
- Support for expiring data in columns by setting TTL (time to live) on Hive tables.
- Support for expiring data by setting the TTL on Pig data using the cql:// URL, which includes a prepared statement shown in step 10 of the library demo.

DSE Analytics features

No Single Point of Failure
DSE Hadoop supports a peer-to-peer, distributed cluster for running MapReduce jobs. Being peers, any node in the cluster can load data files, and any analytics node can assume the responsibilities of job tracker for MapReduce jobs.
Reserve Job Tracker
DSE Hadoop keeps a job tracker in reserve to take over in the event of a problem that would affect availability.
Multiple Job Trackers
In the Cassandra File System (CFS), you can run one or more job tracker services across multiple data centers and create multiple keyspaces per data center. Using this capability has performance, data replication, and other benefits.
Hadoop MapReduce using Multiple Cassandra File Systems
CassandraFS is an HDFS-compatible storage layer. DataStax replaces HDFS with CassandraFS to run MapReduce jobs on Cassandra's peer-to-peer, fault-tolerant, and scalable architecture.You can create additional CFSs to organize and optimize Hadoop data.
Analytics Without ETL
Using DSE Hadoop, you run MapReduce jobs directly against your data in Cassandra. You can perform real-time and analytics workloads at the same time without one workload affecting the performance of the other. Starting some cluster nodes as Hadoop analytics nodes and others as pure Cassandra real-time nodes automatically replicates data between nodes.
Hive Support
Hive, a data warehouse system, facilitates data summarization, ad-hoc queries, and the analysis of large data sets stored in Hoop-compatible file systems. Any JDBC compliant user interface connects to Hive from the server. Using the Cassandra-enabled Hive MapReduce client in DataStax Enterprise, you project a relational structure onto Hadoop data in the Cassandra file systems, and query the data using a SQL-like language. Cassandra nodes share the Hive metastore automatically, eliminating repetitive HIVE configuration steps.
Pig Support
The Cassandra-enabled Pig MapReduce client that is included with DSE Hadoop is a high-level platform for creating MapReduce programs used with Hadoop. You can analyze large data sets by running jobs in MapReduce mode and Pig programs directly on data stored in Cassandra.
Mahout support
Apache Mahout, included with DSE Hadoop, offers machine learning libraries. Machine learning improves a system, such as the one that recreates the Google priority inbox, based on past experience or examples.