Using the cfs-archive to store huge files

The Cassandra File System (CFS) consists of two layers: cfs and cfs-archive. Using cfs-archive is recommended for long-term storage of huge files.

The CassandraFS consists of two layers, cfs and cfs-archive that you access using these Hadoop shell commands and URIs:

  • cfs:// for the cassandra layer
  • cfs-archive:// for the cassandra archive layer

Using cfs-archive is highly recommended for long-term storage of huge files, such as those having terabytes of data. On the contrary, using cfs is not recommended because the data on this layer undergoes the compaction process periodically, as it should. Hadoop uses the cfs layer for many small files and temporary data, which need to be cleaned up after deletions occur. When you use the cfs layer instead of the cfs-archive layer, compaction of huge files can take too long, for example, days. Files stored on the cfs-archive layer, on the other hand, do not undergo compaction automatically. You can manually start compaction using the nodetool compact command.

Example: Store a file on cfs-archive 

This example shows how to store a file on cfs-archive using the Hadoop shell commands from the DataStax Enterprise installation directory on Linux:

  1. Create a directory on the cfs-archive layer. You need to use an additional forward slash, as described earlier:
    bin/dse hadoop fs -mkdir cfs-archive:///20140401
  2. Use the Hadoop shell put command and an absolute path name to store the file on the cfs-archive layer.
    bin/dse hadoop fs -put big_archive.csv cfs-archive:///20140401/big_archive.csv
  3. Check that the file is stored in on the cfs-archive.
    bin/dse hadoop fs -ls cfs-archive:///20140401/

Example: Migrate a file from SQL to text on cfs-archive 

This example shows how to migrate the data from the MySQL table the archive directory cfs-archive/npa_nxx.

  1. Run the sqoop demo.
  2. Use the dse command in the bin directory to migrate the data from the MySQL table to text files in the npa_nxx directory of cfs-archive. Specify the IP address of the host in the --target-dir option.
    $ sudo ./dse sqoop import --connect
        jdbc:mysql://127.0.0.1/npa_nxx_demo
        --username root
        --password <password>
        --table npa_nxx
        --target-dir cfs-archive://127.0.0.1/npa_nxx