Using DSEFS

Steps to use DSEFS, configure data replication, and other functions, including setting the Kafka log retention.

You must configure data replication. You can optionally configure multiple DSEFS file systems in a datacenter, and perform other functions, including setting the Kafka log retention.
The location of the dse.yaml file depends on the type of installation:
Installer-Services /etc/dse/dse.yaml
Package installations /etc/dse/dse.yaml
Installer-No Services install_location/resources/dse/conf/dse.yaml
Tarball installations install_location/resources/dse/conf/dse.yaml

DSEFS limitations 

Know these limitations when you configure and tune DSEFS. The following functionality and features are not supported:
  • Authentication, authorization, and encryption.

    Use operating system access controls to protect the local DSEFS data directories.

  • File system consistency checks (fsck).
  • File repair.
  • Forced rebalancing, although the cluster will eventually reach balance.
  • Data compression.
  • Checksum.
  • Automatic backups.
  • Multi-datacenter replication.
  • Symbolic links (soft links, symlinks) and hardlinks.
  • Snapshots.

Procedure

  1. Required: Configure replication for the metadata and the data blocks.
    You must set the replication factor appropriately to prevent data loss in the case of node failure. Replication factors must be set for both the metadata and the data blocks.
    1. Globally: set replication for the metadata in the dsefs keyspace that is stored in the Cassandra database.
      For example, use a CQL statement to configure a replication factor of 3 on the Analytics datacenter using NetworkTopologyStrategy:
      ALTER KEYSPACE dsefs WITH replication = {'class': 'NetworkTopologyStrategy', 'Analytics': '3'};
    2. Locally: set replication per DSEFS file or directory where the data blocks are stored.
      For example, use the command line:
      Installer-Services and Package installations:
      sudo dse cassandra-stop
      $ sudo dse cassandra options
      Installer-No Services and Tarball installations:
      install_location/bin/dse cassandra-stop
      $ install_location/bin/dse cassandra options

      When a replication factor (RF) is not specified, the RF is inherited from the parent directory.

  2. Optional: Configure multiple DSEFS file systems within a single datacenter:
    1. In the dse.yaml file, specify a separate DSEFS keyspace for each logical datacenter.
      For example, on a cluster with logical datacenters DC1 and DC2.
      On each node in DC1:
      dsefs_options:
          ...
          keyspace_name: dsefs1
      On each node in DC2:
      dsefs_options:
          ...
          keyspace_name: dsefs2
    2. Restart the nodes.
    3. Alter the keyspace replication to exist only on the specific datacenters.
      On DC1:
      ALTER KEYSPACE dsefs1 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'};
      On DC2:
      ALTER KEYSPACE dsefs2 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'};
    For example, in a cluster with multiple datacenters, the keyspace names dsefs1 and dsefs2 define separate file systems in each datacenter.
  3. When bouncing a streaming application, verify the Kafka log configuration (especially log.retention.check.interval.ms and policies.log.retention.bytes). Ensure the Kafka log retention policy is robust enough to handle the length of time expected to bring the application and consumers back up.
    For example, if the log retention policy is too conservative and deletes or rolls are logged very frequently to save disk space, the users are likely to encounter issues when attempting to recover from a checkpoint that references offsets that are no longer maintained by the Kafka logs.