sstableloader
The sstableloader
provides the ability to:
-
Bulk load external data into a cluster.
-
Load existing SSTables into another cluster with a different number of nodes or replication strategy.
-
Restore snapshots.
The sstableloader
streams a set of SSTable data files to a live cluster.
It does not simply copy the set of SSTables to every node, but transfers the relevant part of the data to each node, conforming to the replication strategy of the cluster.
The table into which the data is loaded does not need to be empty.
DSE verifies that the contents of the SSTables match the schema of the tables you are loading. User-defined types (UDTs) are a part of the keyspace, so loading an SSTable with a UDT from a different keyspace is incompatible and therefore rejected. A table is only allowed to use UDTs that exist in the same keyspace as the table. |
Running the |
cd /var/lib/cassandra/data/keyspace/table_name/snapshots/snapshot_name
Run sstableloader
specifying the path to the SSTables and passing it the location of the target cluster.
When using the sstableloader
be aware of the following:
-
Repairing tables that have been loaded into a different cluster does not repair the source tables.
-
If required, upgrade the SSTable version to a version that is compatible with the current version of DataStax Enterprise.
For SSTable compatibility and upgrading, see SSTable compatibility.
Prerequisites
-
The source data loaded by
sstableloader
must be in SSTables. -
Because
sstableloader
uses the streaming protocol, it requires a direct connection over port 7000 (storage port) to each connected node.
Generating SSTables
When using sstableloader
to load external data, you must first put the external data into SSTables.
SSTableWriter
is the API to create raw data files locally for bulkloading into your cluster.
The source code includes the CQLSSTableWriter
implementation for creating SSTable files from external data without needing to understand the details of how those map to the underlying storage engine.
Import the org.apache.cassandra.io.sstable.CQLSSTableWriter
class, and define the schema for the data you want to import, a writer for the schema, and a prepared insert statement.
Taking snapshots
If restoring from a snapshot, use the nodetool snapshot
command to take a snapshot, which you can use sstableloader
to load into a cluster.
A snapshot first flushes all in-memory writes to disk, then makes a hard link of the SSTable files for each keyspace. You must have enough free disk space on the node to accommodate making snapshots of your data files. A single snapshot requires little disk space. However, snapshots can cause your disk usage to grow more quickly over time because a snapshot prevents old obsolete data files from being deleted. After the snapshot is complete, you can move the backup files to another location if needed, or you can leave them in place.
Restoring from a snapshot requires the table schema. |
See Taking a snapshot for more information.
Restoring DataStax Enterprise snapshots
For information about preparing snapshots for sstableloader import, see Restoring from centralized backups.
Importing SSTables from an existing cluster
Before importing existing SSTables, run nodetool flush
on each source node to assure that any data in memtables is written out to the SSTables on disk.
Preparing the target environment
Before loading the data, you must define the schema of the target tables with CQL.
Usage
sstableloader -d host_url (,host_url ...) [options] sstable_directory
Tarball and Installer No-Services path:
<installation_location>/resources/cassandra/bin
Short option | Long option | Description |
---|---|---|
|
|
Client SSL algorithm (default: SunX509). |
|
|
Allows the use of a third party auth provider.
Can be combined with |
|
|
Client SSL. Comma-separated list of encryption suites. |
|
|
Number of concurrent connections-per-host. |
|
|
Required. Connect to a list of (comma separated) hosts for initial cluster information. |
|
|
Path to the |
|
|
Path to the |
|
|
Display help. |
|
|
Do not stream to this comma separated list of nodes. |
|
|
Inter-datacenter throttle speed in Megabits per second (default unlimited). |
|
|
Client SSL. Full path to the keystore. |
|
|
Client SSL.
Password for the keystore.
Overrides the |
|
Do not display progress. |
|
|
|
RPC port (default: 9042). |
|
|
Client SSL.
Connections protocol to use (default: TLS).
Overrides the |
|
|
Authentication password. |
|
|
Port used for inter-node communication (default 7000). |
|
|
Port used for TLS inter-node communication (default 7001). |
|
|
Client SSL. Type of store. |
|
|
Throttle speed in megabits (Mb) per second (default: unlimited).
Overrides the |
|
|
Client SSL. Full path to truststore. |
|
|
Client SSL. Password of the truststore. |
|
|
User name for authentication. |
|
|
Verbose output. |
Loading files
The sstableloader
bulk loads the SSTables found in the specified directory, where the parent directories of the path are used for the target keyspace and table name, to the indicated target cluster.
The location of the SSTables to be streamed must end with directories named for the keyspace and table, including the files to load. For example:
ls
/var/lib/cassandra/data/keyspace_name/table_name/file_names
In the following path, the keyspace is cycling
, the table name is cyclist_name-9e516080f30811e689e40725f37c761d
, and the file name is mc-1-big-Data.db
.
ls
/var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/mc-1-big-Data.db
Loading snapshots
The sstableloader
bulk loads the SSTables found in the specified directory, where the parent directories of the path are used for the target keyspace and table name, to the indicated target cluster.
For snapshots, the location of the SSTables to be streamed must end with directories named for the keyspace and table, including the snapshot name.
By default, snapshots are created in the /var/lib/cassandra/data/keyspace_name/table_name-UUID/snapshots/ directory
.
ls
/var/lib/cassandra/data/keyspace_name/table_name/snapshots/snapshot_name
In the following path, the keyspace is cycling
, the table name is cyclist_name-9e516080f30811e689e40725f37c761d
, and the snapshot is 1527686840030
.
ls
/var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots/1527686840030
For more sstableloader
options, see sstableloader
options
Using sstableloader
-
Go to the location of the SSTables and view the contents of the table.
cd /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/
ls
mc-1-big-Data.db mc-2-big-Data.db ... mc-6-TOC.txt
-
To bulk load the files or snapshots, indicate one or more nodes in the target cluster with the
-d
flag, which takes a comma-separated list of IP addresses or hostnames. Additionally, specify the path to the files or snapshot in the source machine:
Loading files
If loading files, ensure that the files are in the following directory, whose names match those of the same target directory.
../keyspace_name/table_name/file_names
In this example, ensure the files are in the following directory.
../cycling/cyclist_name-9e516080f30811e689e40725f37c761d/mc-1-big-Data.db
Loading snapshots
If restoring snapshot data from some other source, ensure that the snapshot files are in the following directory, whose names match those of the same target directory.
../keyspace_name/table_name/snapshots/snapshot_name
In this example, ensure the snapshot files are in the following directory.
../cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots
To get the best throughput from SSTable loading, you can use multiple instances of |
Package installation
sstableloader -d 110.82.155.1 /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots/1527686840030
Tarball installation
installation_location/bin/sstableloader -d 110.82.155.1 /var/lib/cassandra/data/cycling/cyclist_name-9e516080f30811e689e40725f37c761d/snapshots/1527686840030