Configuring DSEFS

You must configure data replication. You can optionally configure multiple DSEFS file systems in a datacenter, and perform other functions, including setting the Kafka log retention.

DSEFS does not span datacenters. Create a separate DSEFS instance in each datacenter, as described in the steps below.

DSEFS limitations

Know these limitations when you configure and tune DSEFS. The following functionality and features are not supported:

  • Encryption.

    Use operating system access controls to protect the local DSEFS data directories.

  • File system consistency checks (fsck) and file repair have only limited support. Running fsck re-replicates blocks that were under-replicated because a node was taken out of a cluster.

  • File repair.

  • Forced rebalancing, although the cluster eventually reaches balance.

  • Checksum.

  • Automatic backups.

  • Multi-datacenter replication.

  • Symbolic links (soft links, symlinks) and hardlinks.

  • Snapshots.

Procedure

  1. Configure replication for the metadata and the data blocks.

    DSEFS keyspace creation uses SimpleStrategy with replication factor of 1. After starting the cluster for the first time, you must alter the keyspace to use NetworkTopologyStrategy with proper RF.

    You must set the replication factor appropriately to prevent data loss in the case of node failure. Replication factors must be set for both the metadata and the data blocks. The replication factor of 3 for data blocks is suitable for most use-cases.

    1. Globally: set replication for the metadata in the dsefs keyspace that is stored in the database.

      For example, use a CQL statement to configure a replication factor of 3 on the Analytics datacenter using NetworkTopologyStrategy:

      ALTER KEYSPACE dsefs
      WITH REPLICATION = {
         'class': 'NetworkTopologyStrategy',
         'Analytics': '3'};

      Datacenter names are case sensitive. Verify the case of the using utility, using a command like dsetool status.

    2. Run nodetool repair on the DSEFS keyspace.

      $ nodetool repair dsefs
    3. Locally: set the replication factor on a specific DSEFS file or directory where the data blocks are stored.

      For example, use the command line:

      $ dse fs mkdir -n 4 newdirectory

      When a replication factor (RF) is not specified, the RF is inherited from the parent directory.

      Where is the dse.yaml file?

      The location of the dse.yaml file depends on the type of installation:

      Installation Type Location

      Package installations + Installer-Services installations

      /etc/dse/dse.yaml

      Tarball installations + Installer-No Services installations

      <installation_location>/resources/dse/conf/dse.yaml

  2. If you have multiple Analytics datacenters, you must configure each DSEFS file system to replicate within its own datacenter:

    1. In the dse.yaml file, specify a separate DSEFS keyspace for each logical datacenter.

      For example, on a cluster with logical datacenters DC1 and DC2.

      On each node in DC1:

      dsefs_options:
          ...
          keyspace_name: dsefs1

      On each node in DC2:

      dsefs_options:
          ...
          keyspace_name: dsefs2
    2. Restart the nodes.

    3. Alter the keyspace replication to exist only on the specific datacenters.

      On DC1:

      ALTER KEYSPACE dsefs1
      WITH REPLICATION = {
         'class': 'NetworkTopologyStrategy',
         'DC1': '3'};

      On DC2:

      ALTER KEYSPACE dsefs2
      WITH REPLICATION = {
         'class': 'NetworkTopologyStrategy',
         'DC2': '3'};
    4. Run nodetool repair on the DSEFS keyspace.

      $ nodetool repair dsefs

    For example, in a cluster with multiple datacenters, the keyspace names dsefs1 and dsefs2 define separate file systems in each datacenter.

  3. When bouncing a streaming application, verify the Kafka log configuration (especially log.retention.check.interval.ms and policies.log.retention.bytes). Ensure the Kafka log retention policy is robust enough to handle the length of time expected to bring the application and consumers back up.

    For example, if the log retention policy is too conservative and deletes or rolls are logged very frequently to save disk space, the users are likely to encounter issues when attempting to recover from a checkpoint that references offsets that are no longer maintained by the Kafka logs.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com