sstableloader

Bulk loads external data into a cluster, load existing SSTables into another cluster with a different number of nodes or replication strategy, and restore snapshots.

The sstableloader provides the ability to:
  • Bulk load external data into a cluster.
  • Load existing SSTables into another cluster with a different number of nodes or replication strategy.
  • Restore snapshots.
The sstableloader streams a set of SSTable data files to a live cluster. It does not simply copy the set of SSTables to every node, but transfers the relevant part of the data to each node, conforming to the replication strategy of the cluster. The table into which the data is loaded does not need to be empty.
Warning: DSE verifies that the contents of the SSTables match the schema of the tables you are loading. User-defined types (UDTs) are a part of the keyspace, so loading an SSTable with a UDT from a different keyspace is incompatible, and will be rejected. A table is only allowed to use UDTs that exist in the same keyspace as the table.
Warning: Running the sstableloader against the live data directory can cause snapshots to fail. Specify the snapshots directory when running the sstableloader.

In the /var/lib/cassandra/data directory, select a keyspace and a table to access the associated snapshots directory, as shown in the following example.

cd /var/lib/cassandra/data/keyspace/table_name/snapshots/snapshot_name
Run sstableloader specifying the path to the SSTables and passing it the location of the target cluster. When using the sstableloader be aware of the following:
  • Repairing tables that have been loaded into a different cluster does not repair the source tables.
  • If required, upgrade the SSTable version to a version that is compatible with the DataStax Distribution of Apache Cassandra (DDAC).

    For SSTable compatibility and upgrading, see SSTable compatibility.

Prerequisites

  • The source data loaded by sstableloader must be in SSTables.
  • Because sstableloader uses the streaming protocol, it requires a direct connection over port 7000 (storage port) to each connected node.

Generating SSTables

When using sstableloader to load external data, you must first put the external data into SSTables.

SSTableWriter is the API to create raw data files locally for bulkloading into your cluster. The source code includes the CQLSSTableWriter implementation for creating SSTable files from external data without needing to understand the details of how those map to the underlying storage engine. Import the org.apache.cassandra.io.sstable.CQLSSTableWriter class, and define the schema for the data you want to import, a writer for the schema, and a prepared insert statement.

Taking snapshots

If restoring from a snapshot, use the nodetool snapshot command to take a snapshot, which you can use sstableloader to load into a cluster.

A snapshot first flushes all in-memory writes to disk, then makes a hard link of the SSTable files for each keyspace. You must have enough free disk space on the node to accommodate making snapshots of your data files. A single snapshot requires little disk space. However, snapshots can cause your disk usage to grow more quickly over time because a snapshot prevents old obsolete data files from being deleted. After the snapshot is complete, you can move the backup files to another location if needed, or you can leave them in place.
Note: Restoring from a snapshot requires the table schema.

See Taking a snapshot for more information.

Restoring DataStax Distribution of Apache Cassandra 3.11 snapshots

For information about preparing snapshots for sstableloader import, see Restoring from centralized backups.

Importing SSTables from an existing cluster

Before importing existing SSTables, run nodetool flush on each source node to assure that any data in memtables is written out to the SSTables on disk.

Preparing the target environment

Before loading the data, you must define the schema of the target tables with CQL.

Usage

install_location/bin/sstableloader -d host_url (,host_url ...) [options] sstable_directory
Table 1. sstableloader options
Short option Long option Description
-alg --ssl-alg <ALGORITHM> Client SSL algorithm (default: SunX509).
-ap --auth-provider <auth provider class name> Allows the use of a third party auth provider. Can be combined with -u <username> and -pw <password> if the auth provider supports plain text credentials.
-ciphers --ssl-ciphers <CIPHER-SUITES> Client SSL. Comma-separated list of encryption suites.
-cph --connections-per-host <connectionsPerHost> Number of concurrent connections-per-host.
-d --nodes <initial_hosts> Required. Connect to a list of (comma separated) hosts for initial cluster information.
-f --conf-path <path_to_config_file> Path to the cassandra.yaml path for streaming throughput and client/server SSL.
-h --help Display help.
-i --ignore <NODES> Do not stream to this comma separated list of nodes.
-idct --inter_dc_throttle_mbits <MBPS> Inter-datacenter throttle speed in Megabits per second (default unlimited).
-ks --keystore <KEYSTORE> Client SSL. Full path to the keystore.
-kspw --keystore-password <KEYSTORE-PASSWORD>

Client SSL. Password for the keystore.

Overrides the client_encryption_options option in cassandra.yaml

--no-progress Do not display progress.
-p --port <rpc port> RPC port (default: 9042).
-prtcl --ssl-protocol <PROTOCOL>

Client SSL. Connections protocol to use (default: TLS).

Overrides the server_encryption_options option in cassandra.yaml

-pw --password <password> Authentication password.
-sp --storage_port <port_num> Port used for inter-node communication (default 7000).
-ssp --ssl_storage_port Port used for TLS inter-node communication (default 7001).
-st --store-type <STORE-TYPE> Client SSL. Type of store.
-t --throttle <throttle>

Throttle speed in megabits (Mb) per second (default: unlimited).

Overrides the stream_throughput_outbound_megabits_per_sec option in cassandra.yaml

-ts --truststore <TRUSTSTORE> Client SSL. Full path to truststore.
-tspw --truststore-password <TRUSTSTORE-PASSWORD> Client SSL. Password of the truststore.
-u --username <username> User name for authentication.
-v --verbose Verbose output.

Loading files

The sstableloader bulk loads the SSTables found in the specified directory, where the parent directories of the path are used for the target keyspace and table name, to the indicated target cluster.

The location of the SSTables to be streamed must end with directories named for the keyspace and table, including the files to load. For example:
ls
/var/lib/cassandra/data/keyspace_name/table_name/file_names
In the following path, the keyspace is cycling, the table name is cyclist_name-9e516080f30811e689e40725f37c761d, and the file name is mc-1-big-Data.db.
ls
/var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/mc-1-big-Data.db

Loading snapshots

The sstableloader bulk loads the SSTables found in the specified directory, where the parent directories of the path are used for the target keyspace and table name, to the indicated target cluster.

For snapshots, the location of the SSTables to be streamed must end with directories named for the keyspace and table, including the snapshot name. By default, snapshots are created in the /var/lib/cassandra/data/keyspace_name/table_name-UUID/snapshots/ directory.
ls
/var/lib/cassandra/data/keyspace_name/table_name/snapshots/snapshot_name
In the following path, the keyspace is cycling, the table name is cyclist_name-9e516080f30811e689e40725f37c761d, and the snapshot is 1527686840030.
ls
/var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots/1527686840030

For more sstableloader options, see sstableloader options

Using sstableloader

  1. Go to the location of the SSTables and view the contents of the table.
    cd /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/
    ls
    mc-1-big-Data.db
    mc-2-big-Data.db
    ...
    mc-6-TOC.txt
  2. To bulk load the files or snapshots, indicate one or more nodes in the target cluster with the -d flag, which takes a comma-separated list of IP addresses or hostnames. Additionally, specify the path to the files or snapshot in the source machine:

    Loading files

    If loading files, ensure that the files are in the following directory, whose names match those of the same target directory.
    ../keyspace_name/table_name/file_names
    In this example, ensure the files are in the following directory.
    ../cycling/cyclist_name-9e516080f30811e689e40725f37c761d/mc-1-big-Data.db

    Loading snapshots

    If restoring snapshot data from some other source, ensure that the snapshot files are in the following directory, whose names match those of the same target directory.
    ../keyspace_name/table_name/snapshots/snapshot_name
    In this example, ensure the snapshot files are in the following directory.
    ../cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots
Note: To get the best throughput from SSTable loading, you can use multiple instances of sstableloader to stream across multiple machines. No hard limit exists on the number of SSTables that sstableloader can run at the same time, so you can add additional loaders until you see no further improvement.
installation_location/bin/sstableloader -d 110.82.155.1 /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots/1527686840030