About the Cassandra File System (CFS)
Analytics jobs often require a distributed file system. DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CFS).
Analytics jobs require a distributed file system. DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CFS). See also the DataStax Enterprise file system (DSEFS). DSEFS is a new distributed file system within DataStax Enterprise that is intended primarily for Spark streaming use cases and Write Ahead Logging (WAL).
When an analytics node starts up, DataStax Enterprise creates a default CFS rooted at
cfs:/ and an archive file system named
cfs-archive, which is rooted at
cfs-archive:/. CFS is available only on analytics nodes.
DataStax Enterprise creates a keyspace for the cfs-archive
file
system, and every other CFS file system. The keyspace name is similar to the file
system name except the hyphen in the name is replaced by an underscore. For example,
the cfs-archive file system keyspace is
cfs_archive.
The Cassandra File System (CFS) is accessed as part of the Hadoop File System (HDFS) using the configured authentication. If you encrypt the CFS keyspace sblocks and inode tables, all CFS data is encrypted.
Increasing the replication factor of default CFS keyspaces
You must increase the replication factor of default CFS keyspaces to prevent problems when running analytics jobs.
Configuring a CFS superuser
A CFS superuser is the DataStax Enterprise daemon user, the user who starts DataStax Enterprise. A cassandra superuser, set up using the CQL CREATE ROLE command, is also a CFS superuser.
A CFS superuser can modify files in the CFS without any restrictions. Files that a superuser adds to the CFS are password-protected.
Deleting files from the CFS
Cassandra does not immediately remove deleted data from disk when you use the
dse hadoop fs -rm file
command. Instead,
Cassandra treats the deleted data like any data that is deleted from Cassandra. A
tombstone is written to indicate the new data status. Data that is marked with a
tombstone exists for a configured time period (defined by the
gc_grace_seconds value that is set on the table). When the
grace period expires, the compaction process permanently deletes the data. You do
not have to manually remove expired data.
Checkpointing with the CFS
DataStax Enterprise does not support checkpointing to CFS.
Managing the CFS consistency level
The default read and write consistency level for CFS is LOCAL_QUORUM
or QUORUM, depending on the keyspace replication strategy,
SimpleStrategy
or NetworkTopologyStrategy
,
respectively. You can change the consistency level by specifying a value for
dse.consistencylevel.read
and
dse.consistencylevel.write
properties in the
core-site.xml file.
Using multiple Cassandra File Systems
- To isolate analytics jobs
- To configure keyspace replication by job
- To segregate file systems in different physical datacenters
- To separate analytics data in some other way
Procedure
To create an additional CFS: