About the Cassandra File System (CFS) - deprecated

Analytics jobs require a distributed file system. DataStax Enterprise provides a replacement for the Apache Hadoop® Distributed File System (HDFS) called the Cassandra File System (CFS). See also the DataStax Enterprise file system (DSEFS). DSEFS is the default distributed file system on DSE Analytics nodes.

When an analytics node starts up with CFS enabled, DataStax Enterprise creates a CFS rooted at cfs:/ and an archive file system named cfs-archive, which is rooted at cfs-archive:/. CFS is available only on analytics nodes. DataStax Enterprise creates a keyspace for the cfs-archive file system, and every other CFS file system. The keyspace name is similar to the file system name except the hyphen in the name is replaced by an underscore. For example, the cfs-archive file system keyspace is cfs_archive.

CFS locations must be specified using the cfs:/ prefix and the hostname of an analytics node. For example, cfs://node2/tmp.

The default location of the core-site.xml file depends on the type of installation:

Package installations and Installer-Services installations: /etc/dse/hadoop2-client/core-site.xml
Tarball installations and Installer-No Services installations: <installation_location>/resources/hadoop2-client/conf/core-site.xml

Increasing the replication factor of default CFS keyspaces

You must increase the replication factor of default CFS keyspaces to prevent problems when running analytics jobs.

Encrypting CFS keyspace data

Spark accesses the Cassandra File System (CFS) as part of the Apache Hadoop® File System (HDFS) using the configured authentication. If you encrypt the CFS keyspace sblocks and inode tables, all CFS data is encrypted.

Configuring a CFS superuser

A CFS superuser is the DataStax Enterprise daemon user, the user who starts DataStax Enterprise. A cassandra superuser, set up using the CQL CREATE ROLE command, is also a CFS superuser.

A CFS superuser can modify files in the CFS without any restrictions. Files that a superuser adds to the CFS are password-protected.

Deleting files from the CFS

DSE does not immediately remove deleted data from disk when you use the dse hadoop fs -rm file command. Instead, DSE treats the deleted data like any data that is deleted from the database, and it writes a tombstone to indicate the status of the data. Data that is marked with a tombstone exists for a configured time period (defined by the gc_grace_seconds value that is set on the table). When the grace period expires, the compaction process permanently deletes the data. You do not have to manually remove expired data.

Checkpointing with the CFS

DataStax Enterprise does not support checkpointing to CFS.

Managing the CFS consistency level

The default read and write consistency level for CFS is LOCAL_QUORUM or QUORUM, depending on the keyspace replication strategy, SimpleStrategy or NetworkTopologyStrategy, respectively. You can change the consistency level by specifying a value for dse.consistencylevel.read and dse.consistencylevel.write properties in the core-site.xml file.

Setting CFS as the default distributed file system in DataStax Enterprise

DSEFS is the default distributed file system.

To make CFS the default file system, add the following properties to the core-site.xml Hadoop configuration file:

<configuration>
  ...
  <property>
    <name>fs.default.name</name>
    <value>cfs://127.0.0.1/</value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>cfs://127.0.0.1/</value>
  </property>
</configuration>

Replace 127.0.0.1 with the value of broadcast_rpc_address set in cassandra.yaml.

The location of the cassandra.yaml file depends on the type of installation:

Package installations and Installer-Services installations: /etc/dse/cassandra/cassandra.yaml
Tarball installations and Installer-No Services installations: <installation_location>/resources/cassandra/conf/cassandra.yaml

Using multiple Cassandra File Systems

You can use more than one CFS. Typical reasons for using an additional CFS are:

To isolate analytics jobs
To configure keyspace replication by job
To segregate file systems in different physical datacenters
To separate analytics data in some other way

Procedure

To create an additional CFS:

Open the core-site.xml file for editing.
Add one or more property elements to core-site.xml using this format:
```
<property>
  <name>fs.cfs-file_system_name.impl</name>
  <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value>
</property>
```
With multiple CFS, you must override the default file system name for the newly created CFS to avoid conflicts with existing CFS on other datacenters. Each datacenter requires a unique default file system. For example, instead of the default value cfs://127.0.0.1/, specify a unique file system name for the new CFS, like cfs-myfs://127.0.0.1/:
```
<property>
    <name>fs.cfs-myfs.impl</name>
    <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>cfs-myfs://127.0.0.1/</value>
  </property>
```
Save the file and restart DSE.

DataStax Enterprise creates the new CFS.
To access the new CFS, construct a URL using the following format:
```
cfs-file_system_name:path
```
For example, assuming the new file system name is NewCassandraFS use the managing:tools/dse/commands-about.adoc[dse commands] to put data on the new CFS.
```
dse hadoop fs -put /tmp/giant_log.gz cfs-NewCassandraFS://hostname/tmp &
dse hadoop fs distcp hdfs:/// cfs-NewCassandraFS://hostname/
```