Architecture in brief
Essential information for understanding and using DataStax Enterprise.
cassandra.yaml
The location of the cassandra.yaml file depends on the type of installation:Package installations | /etc/dse/cassandra/cassandra.yaml |
Tarball installations | installation_location/resources/cassandra/conf/cassandra.yaml |
This page provides essential information for understanding and using DataStax Enterprise. Because DataStax Enterprise differs from a relational database, you will save yourself a lot of time by reading and understanding each section on this page.
How DataStax Enterprise works
Built on Apache Cassandra®, DataStax Enterprise (DSE) adds operational reliability hardened by the Fortune 100 and supports more workloads from graph to search to analytics. Easily deploy the only active-everywhere database platform that runs wherever needed: on-premises, across regions or clouds.
DSE is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based on the understanding that system and hardware failures can and do occur.
DSE addresses the problem of failures by employing a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster. Each node frequently exchanges state information about itself and other nodes across the cluster using peer-to-peer gossip communication protocol. A sequentially written commit log on each node captures write activity to ensure data durability. Data is then indexed and written to an in-memory structure, called a memtable, which resembles a write-back cache.
Each time the memory structure is full, the data is written to disk in an SSTables data file. All writes are automatically partitioned and replicated throughout the cluster. DSE periodically consolidates SSTables using compaction, which discards obsolete data marked for deletion with a tombstone. A tombstone is a marker in a row that indicates a column will be deleted. During compaction, marked columns are deleted. To ensure all data across the cluster stays consistent, various repair mechanisms are employed.
The DSE database is a partitioned row store database, where rows are organized into tables with a (required) primary key. The database's architecture allows any authorized user to connect to any node in any datacenter and access data using the CQL language. For ease of use, CQL uses a similar syntax to SQL and works with table data. Developers can access CQL through CQL shell (cqlsh) commands, DataStax Studio, and by using drivers for application languages. Typically, a cluster has one keyspace per application composed of many different tables.
Client read or write requests can be sent to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator node for that particular client operation. The coordinator acts as a proxy between the client application and the nodes that own the data being requested. The coordinator determines which nodes in the ring should get the request based on how the cluster is configured.
Key structures
- Node
- Where you store your data. It is the basic database infrastructure component.
- Cluster
- A group of distributed nodes for storing data. A cluster can have a single node, single datacenter, or multiple datacenters.
- Datacenter
- A group of related nodes configured together within a cluster for replication purposes. A datacenter can be a physical datacenter or virtual datacenter. Using separate datacenters prevents transactions from being impacted by other workloads and lowers latency. Depending on the replication factor, data can be written to multiple datacenters. Datacenters must never span physical locations.
- Replication
- The process of storing copies of data on multiple nodes. Replication ensures reliability and fault tolerance. The number of copies is set by the replication factor.
- Commit log
- All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.
- SSTable
- A sorted string table (SSTable) is an immutable data file to which the database writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each database table.
- tombstone
- A marker in a row that indicates a column will be deleted. During compaction, marked columns are deleted.
- CQL Table
- A collection of ordered columns fetched by table row. A table consists of columns and has a primary key.
Key components for configuring DataStax Enterprise
- Gossip
- A peer-to-peer communication protocol to discover and share location and state information about the other nodes in a DSE cluster. Gossip information is persisted locally by each node to use immediately when a node restarts.
- Partitioner
- A partitioner distributes data evenly across the nodes in the cluster for load balancing.
- Replication factor
- Replication is the process of storing copies of data on multiple nodes. Replication ensures reliability and fault tolerance. The number of copies is set by the replication factor.
- Replica placement strategy
- A replication strategy determines which nodes to place replicas on. The first replica of data is simply the first copy; it is not unique in any sense. The NetworkTopologyStrategy is highly recommended for most deployments because it is much easier to expand to multiple datacenters when required by future expansion.
- Snitch
- A snitch maps from the IP addresses of nodes to physical and virtual locations, such as racks and datacenters. Snitches inform the database about the network topology so that requests are routed efficiently and allows the database to distribute replicas by grouping machines into datacenters and racks.
- cassandra.yaml configuration file
- The main configuration file for setting the initialization properties for a cluster, caching parameters for tables, properties for tuning and resource utilization, timeout settings, client connections, backups, and security. By default, a node is configured to store the data it manages in a directory set in the cassandra.yaml file.
- The configuration file for DSE Advanced Security, DSE Search, DSE Graph, and DSE Analytics.
- System keyspace table properties
- You set storage configuration attributes on a per-keyspace or per-table basis programmatically or using a client application, such as CQL shell (cqlsh).
- Loading and unloading data
- Use the DataStax Bulk Loader tool to efficiently load and unload DSE data. The tool
migrates data into DSE from another DSE or Apache Cassandra™ cluster.
- Unloads data from any Cassandra 2.1 or later data source
- Loads data into DSE 5.0 or later
- Supports CSV and JSON formats
Basic concepts for data modeling
- Design the data model
- The design of the data model is based on the queries you want to perform, not on modeling entities and relationships like you do for relational databases.
- Keyspace
- The outermost grouping of data, similar to a schema in a relational database. All tables belong to a keyspace. A keyspace is the defining container for replication.
- Table
- A table stores data based on a primary key, which consists of a partition key and
optional clustering columns. Materialized views can also be added for high cardinality data.
- A partition key defines the node on which the data is stored, and divides data into logical groups. Define partition keys that evenly distribute the data and also satisfy specific queries. Query and write requests across multiple partitions should be avoided if possible.
- A clustering column defines the sort order of rows within a partition. When defining a clustering column, consider the purpose of the data. For example, retrieving the most recent transactions, sorted by date, in descending order.
- Materialized views are tables built from another table's data with a new primary key and new properties. Queries are optimized by the primary key definition. Data in the materialized view is automatically updated by changes to the source table.
Note: In earlier versions of DataStax Enterprise and Apache Cassandra™, a column family was synonymous in many respects, to a table. - More information on data modeling
-
- in the CQL documentation.
- Using CQL in the CQL documentation.
- in the CQL documentation.
- Getting Started with Time Series Data Modeling white paper.
- Getting Started with User Profile Data Modeling white paper.
- Become a Super Modeler webinar.
- The Data Model is Dead, Long Live the Data Model webinar.