About the Cassandra File System
A Hive or Pig analytics job requires a Hadoop file system to function. For use with DSE Hadoop, DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CFS).
A Hive or Pig analytics job requires a Hadoop file system to function. DataStax Enterprise provides a replacement for the Hadoop Distributed File System (HDFS) called the Cassandra File System (CassandraFS), which serves this purpose. When an analytics node starts up, DataStax Enterprise creates a default CassandraFS rooted at cfs:/ and an archive file system named cfs-archive.
A CFS superuser is the DSE daemon user, the user who starts DataStax Enterprise. A cassandra superuser, set up using the CQL CREATE USER command, is also a CFS superuser.
A CFS superuser can modify files in the CassandraFS without any restrictions. Files that a superuser adds to the CassandraFS are password-protected.
Cassandra does not immediately remove deleted data from disk when you use
dse hadoop fs -rm
file command. Instead, Cassandra treats the deleted
data like any data deleted from Cassandra. A tombstone is written to indicate the
new data status. Data marked with a tombstone exist for a configured time period
(defined by the gc_grace_seconds value set on the table). When the grace period
expires, the compaction process permanently deletes the
data. You do not have to manually remove expired data.
- To isolate hadoop-related jobs
- To configure keyspace replication by job
- To segregate file systems in different physical data centers
- To separate Hadoop data in some other way
Open the core-site.xml file for editing. The location of
the file depends on your installation:
- Tarball installs: /etc/dse/hadoop
- Packaged installation: install_location/resources/hadoop/conf
Add one or more property elements to core-site.xml using this format:
<property> <name>fs.cfs-<filesystemname.impl></name> <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value> </property>
Save the file and restart Cassandra.
DSE creates the new CassandraFS.
To access the new CassandraFS, construct a URL using the following format:
For example, assuming the new file system name is NewCassandraFS use the dse commands to put data on the new CassandraFS.
dse hadoop fs -put /tmp/giant_log.gz cfs-NewCassandraFS://cassandrahost/tmp dse hadoop fs distcp hdfs:/// cfs-NewCassandraFS:///