About the Cassandra File System
A Hive or Pig analytics job requires a Hadoop file system to function. For use with DSE Hadoop, DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CFS).
A Hive or Pig analytics job requires a Hadoop file system to function. DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CassandraFS), which serves this purpose. When an analytics node starts up, DataStax Enterprise creates a default CassandraFS rooted at cfs:/ and an archive file system named cfs-archive.
Configuring a CFS superuser
A CFS superuser is the DSE daemon user, the user who starts DataStax Enterprise. A cassandra superuser, set up using the CQL CREATE USER command, is also a CFS superuser.
A CFS superuser can modify files in the CassandraFS without any restrictions. Files that a superuser adds to the CassandraFS are password-protected.
Deleting files from the Cassandra File System
Cassandra does not immediately remove deleted data from disk when you use
the dse hadoop fs -rm
file
command. Instead, Cassandra treats the deleted
data like any data deleted from Cassandra. A tombstone is written to indicate the
new data status. Data marked with a tombstone exist for a configured time period
(defined by the gc_grace_seconds value set on the table). When the grace period
expires, the compaction process permanently deletes the
data. You do not have to manually remove expired data.
Using multiple Cassandra File Systems
- To isolate hadoop-related jobs
- To configure keyspace replication by job
- To segregate file systems in different physical data centers
- To separate Hadoop data in some other way