sstableloader
Bulk loads external data into a cluster, load existing SSTables into another cluster with a different number of nodes or replication strategy, and restore snapshots.
- Bulk load external data into a cluster.
- Load existing SSTables into another cluster with a different number of nodes or replication strategy.
- Restore snapshots.
In the /var/lib/cassandra/data directory, select a keyspace and a table to access the associated snapshots directory, as shown in the following example.
cd /var/lib/cassandra/data/keyspace/table_name/snapshots/snapshot_name
- Repairing tables that have been loaded into a different cluster does not repair the source tables.
- If required, upgrade the SSTable version to a version that is compatible
with the DataStax Distribution of Apache Cassandra™ (DDAC).
For SSTable compatibility and upgrading, see SSTable compatibility.
Prerequisites
- The source data loaded by sstableloader must be in SSTables.
- Because sstableloader uses the streaming protocol, it requires a direct connection over port 7000 (storage port) to each connected node.
Generating SSTables
When using sstableloader to load external data, you must first put the external data into SSTables.
SSTableWriter is the API to create raw data files locally for bulkloading into your cluster. The source code includes the CQLSSTableWriter implementation for creating SSTable files from external data without needing to understand the details of how those map to the underlying storage engine. Import the org.apache.cassandra.io.sstable.CQLSSTableWriter class, and define the schema for the data you want to import, a writer for the schema, and a prepared insert statement.
Taking snapshots
If restoring from a snapshot, use the nodetool snapshot command to take a snapshot, which you can use sstableloader to load into a cluster.
See Taking a snapshot for more information.
Restoring DataStax Distribution of Apache Cassandra™ 3.11 snapshots
For information about preparing snapshots for sstableloader import, see Restoring from centralized backups.
Importing SSTables from an existing cluster
Before importing existing SSTables, run nodetool flush on each source node to assure that any data in memtables is written out to the SSTables on disk.
Preparing the target environment
Before loading the data, you must define the schema of the target tables with CQL.
Usage
install_location/bin/sstableloader -d host_url (,host_url ...) [options] sstable_directory
Short option | Long option | Description |
---|---|---|
-alg | --ssl-alg <ALGORITHM> | Client SSL algorithm (default: SunX509). |
-ap | --auth-provider <auth provider class name> | Allows the use of a third party auth provider. Can be combined with -u <username> and -pw <password> if the auth provider supports plain text credentials. |
-ciphers | --ssl-ciphers <CIPHER-SUITES> | Client SSL. Comma-separated list of encryption suites. |
-cph | --connections-per-host <connectionsPerHost> | Number of concurrent connections-per-host. |
-d | --nodes <initial_hosts> | Required. Connect to a list of (comma separated) hosts for initial cluster information. |
-f | --conf-path <path_to_config_file> | Path to the cassandra.yaml path for streaming throughput and client/server SSL. |
-h | --help | Display help. |
-i | --ignore <NODES> | Do not stream to this comma separated list of nodes. |
-idct | --inter_dc_throttle_mbits <MBPS> | Inter-datacenter throttle speed in Megabits per second (default unlimited). |
-ks | --keystore <KEYSTORE> | Client SSL. Full path to the keystore. |
-kspw | --keystore-password <KEYSTORE-PASSWORD> |
Client SSL. Password for the keystore. Overrides the client_encryption_options option in cassandra.yaml |
--no-progress | Do not display progress. | |
-p | --port <rpc port> | RPC port (default: 9042). |
-prtcl | --ssl-protocol <PROTOCOL> |
Client SSL. Connections protocol to use (default: TLS). Overrides the server_encryption_options option in cassandra.yaml |
-pw | --password <password> | Authentication password. |
-sp | --storage_port <port_num> | Port used for inter-node communication (default 7000). |
-ssp | --ssl_storage_port | Port used for TLS inter-node communication (default 7001). |
-st | --store-type <STORE-TYPE> | Client SSL. Type of store. |
-t | --throttle <throttle> |
Throttle speed in megabits (Mb) per second (default: unlimited). Overrides the stream_throughput_outbound_megabits_per_sec option in cassandra.yaml |
-ts | --truststore <TRUSTSTORE> | Client SSL. Full path to truststore. |
-tspw | --truststore-password <TRUSTSTORE-PASSWORD> | Client SSL. Password of the truststore. |
-u | --username <username> | User name for authentication. |
-v | --verbose | Verbose output. |
Loading files
The sstableloader bulk loads the SSTables found in the specified directory, where the parent directories of the path are used for the target keyspace and table name, to the indicated target cluster.
ls /var/lib/cassandra/data/keyspace_name/table_name/file_namesIn the following path, the keyspace is
cycling
, the table name is
cyclist_name-9e516080f30811e689e40725f37c761d
, and the file
name is
mc-1-big-Data.db
.ls /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/mc-1-big-Data.db
Loading snapshots
The sstableloader bulk loads the SSTables found in the specified directory, where the parent directories of the path are used for the target keyspace and table name, to the indicated target cluster.
ls /var/lib/cassandra/data/keyspace_name/table_name/snapshots/snapshot_nameIn the following path, the keyspace is
cycling
, the table name is
cyclist_name-9e516080f30811e689e40725f37c761d
, and the
snapshot is
1527686840030
.ls /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots/1527686840030
For more sstableloader options, see sstableloader options
Using sstableloader
- Go to the location of the SSTables and view the contents of the
table.
cd /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/
ls
mc-1-big-Data.db mc-2-big-Data.db ... mc-6-TOC.txt
- To bulk load the files or snapshots, indicate one or more nodes in the target
cluster with the
-d
flag, which takes a comma-separated list of IP addresses or hostnames. Additionally, specify the path to the files or snapshot in the source machine:Loading files
If loading files, ensure that the files are in the following directory, whose names match those of the same target directory.../keyspace_name/table_name/file_names
In this example, ensure the files are in the following directory.../cycling/cyclist_name-9e516080f30811e689e40725f37c761d/mc-1-big-Data.db
Loading snapshots
If restoring snapshot data from some other source, ensure that the snapshot files are in the following directory, whose names match those of the same target directory.../keyspace_name/table_name/snapshots/snapshot_name
In this example, ensure the snapshot files are in the following directory.../cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots
installation_location/bin/sstableloader -d 110.82.155.1 /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots/1527686840030