About the Cassandra File System (CFS)

A Hive or Pig analytics job requires a Hadoop file system to function. For use with DSE Hadoop, DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CFS).

A Hive or Pig analytics job requires a Hadoop file system to function. DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CFS). When an analytics node starts up, DataStax Enterprise creates a default CFS rooted at cfs:/ and an archive file system named cfs-archive, which is rooted at cfs-archive:/. Cassandra creates a keyspace for the cfs-archive file system, and every other CFS file system. The keyspace name is similar to the file system name except the hyphen in the name is replaced by an underscore. For example, the cfs-archive file system keyspace is cfs_archive. You need to increase the replication factor of default CFS keyspaces to prevent problems when running Hadoop jobs.

Configuring a CFS superuser 

A CFS superuser is the DataStax Enterprise daemon user, the user who starts DataStax Enterprise. A cassandra superuser, set up using the CQL CREATE USER command, is also a CFS superuser.

A CFS superuser can modify files in the CFS without any restrictions. Files that a superuser adds to the CFS are password-protected.

Deleting files from the CFS 

Cassandra does not immediately remove deleted data from disk when you use the dse hadoop fs -rm file command. Instead, Cassandra treats the deleted data like any data that is deleted from Cassandra. A tombstone is written to indicate the new data status. Data that is marked with a tombstone exists for a configured time period (defined by the gc_grace_seconds value that is set on the table). When the grace period expires, the compaction process permanently deletes the data. You do not have to manually remove expired data.

Checkpointing with the CFS 

Note: DataStax Enterprise does not support checkpointing to CFS.
When internal DataStax Enterprise authentication and checkpointing are on, checkpointing uses the Hadoop configuration that is separate from the configuration values in the spark-env.sh Spark configuration file that are passed into the application. DataStax Enterprise uses all of the default generated Hadoop xml files in /dse/resources/hadoop/conf. You can manually add properties to the Hadoop configuration files for parameters that are dependent on the submitted application. For example, to pass the Cassandra parameters for password authentication from the Spark configuration to the Hadoop configuration that is used for checkpointing:
Configuration hadoopConf = new Configuration();
    hadoopConf.set("cassandra.username", sparkConf.get("cassandra.username"));
    hadoopConf.set("cassandra.password", sparkConf.get("cassandra.password"));
    return JavaStreamingContext.getOrCreate(CHECKPOINT_DIR, hadoopConf, contextFactory);

Managing the CFS consistency level 

The default read and write consistency level for CFS is LOCAL_QUORUM or QUORUM, depending on the keyspace replication strategy, SimpleStrategy or NetworkTopologyStrategy, respectively. You can change the consistency level by specifying a value for dse.consistencylevel.read and dse.consistencylevel.write properties in the core-site.xml file.

Using multiple Cassandra File Systems 

You can use more than one CFS. Some typical reasons for using an additional CFS are:
  • To isolate Hadoop-related jobs
  • To configure keyspace replication by job
  • To segregate file systems in different physical data centers
  • To separate Hadoop data in some other way

Creating an additional CFS 

Procedure

  1. Open the core-site.xml file for editing.
    The default location of the core-site.xml file depends on the type of installation:
    Installer-Services and Package installations /etc/dse/hadoop/conf/core-site.xml
    Installer-No Services and Tarball installations install_location/resources/hadoop/conf/core-site.xml
  2. Add one or more property elements to core-site.xml using this format:
    <property>
      <name>fs.cfs-<filesystemname>.impl</name>
      <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value>
    </property>
  3. Save the file and restart Cassandra.

    DataStax Enterprise creates the new CFS.

  4. To access the new CFS, construct a URL using the following format:
    cfs-<filesystemname>:<path>

    For example, assuming the new file system name is NewCassandraFS use the dse commands to put data on the new CFS.

    dse hadoop fs -put /tmp/giant_log.gz cfs-NewCassandraFS://cassandrahost/tmp
    
    dse hadoop fs distcp hdfs:/// cfs-NewCassandraFS:///