sstableloader (Cassandra bulk loader)

Provides the ability to bulk load external data into a cluster, load existing SSTables into another cluster with a different number of nodes or replication strategy, and restore snapshots.

The Cassandra bulk loader, also called the sstableloader, provides the ability to:
  • Bulk load external data into a cluster.
  • Load existing SSTables into another cluster with a different number of nodes or replication strategy.
  • Restore snapshots.

The sstableloader streams a set of SSTable data files to a live cluster. It does not simply copy the set of SSTables to every node, but transfers the relevant part of the data to each node, conforming to the replication strategy of the cluster. The table into which the data is loaded does not need to be empty.

Warning: Running the sstableloader against the live data directory can cause snapshots to fail. Specify the snapshots directory when running the sstableloader.

In the /var/lib/cassandra/data directory, select a keyspace and a table to access the associated snapshots directory, as shown in the following example.

$ cd /var/lib/cassandra/data/keyspace/table_name/snapshots/snapshot_name
Run sstableloader specifying the path to the SSTables and passing it the location of the target cluster. When using the sstableloader be aware of the following:
  • Bulkloading SSTables created in versions prior to Cassandra 3.0 is supported only in Cassandra 3.0.5 and later.
  • Repairing tables that have been loaded into a different cluster does not repair the source tables.

Prerequisites

  • The source data loaded by sstableloader must be in SSTables.
  • Because sstableloader uses the streaming protocol, it requires a direct connection over the port 7000 (storage port) to each connected node.

Generating SSTables

When using sstableloader to load external data, you must first put the external data into SSTables.

SSTableWriter is the API to create raw Cassandra data files locally for bulk load into your cluster. The Cassandra source code includes the CQLSSTableWriter implementation for creating SSTable files from external data without needing to understand the details of how those map to the underlying storage engine. Import the org.apache.cassandra.io.sstable.CQLSSTableWriter class, and define the schema for the data you want to import, a writer for the schema, and a prepared insert statement. For a complete example, see https://www.datastax.com/blog/2014/09/using-cassandra-bulk-loader-updated.

Restoring Cassandra snapshots

For information about preparing snapshots for sstableloader import, see Restoring from centralized backups.

Importing SSTables from an existing cluster

Before importing existing SSTables, run nodetool flush on each source node to assure that any data in memtables is written out to the SSTables on disk.

Preparing the target environment

Before loading the data, you must define the schema of the target tables with CQL or Thrift.

Usage

  • Cassandra package installations:
    sstableloader -d host_url (,host_url …) [options] path_to_keyspace
  • Cassandra tarball installations:
    cd install_location
    $ bin/sstableloader -d host_url (,host_url …) [options] path_to_keyspace

The sstableloader bulk loads the SSTables found in the specified directory, where the parent directories of the path are used for the target keyspace and table name, to the indicated target cluster.

Verify the location of the sstables to be streamed ends with directories named for the keyspace and table:
ls /var/lib/cassandra/data/Keyspace1/Standard1/snapshot/snapshot_name
Keyspace1-Standard1-jb-60-CRC.db
Keyspace1-Standard1-jb-60-Data.db
...
Keyspace1-Standard1-jb-60-TOC.txt

For more sstableloader options, see sstableloader options

Using sstableloader

  1. If restoring snapshot data from some other source: make sure that the snapshot files are in a keyspace/tablename/snapshot/snapshot_name directory path whose names match those of the target keyspace/tablename/snapshot/snapshot_name. In this example, make sure the snapshot files are in /Keyspace/Standard1/snapshot/snapshot_name.
  2. Go to the location of the SSTables:
    Cassandra package installations:
    cd /var/lib/cassandra/data/Keyspace1/Standard1/snapshot/snapshot_name
    Cassandra tarball installations
    cd install_location/data/data/Keyspace1/Standard1/snapshot/snapshot_name
  3. To view the contents of the keyspace:
    ls
    Keyspace1-Standard1-jb-60-CRC.db
    Keyspace1-Standard1-jb-60-Data.db
    ...
    Keyspace1-Standard1-jb-60-TOC.txt
  4. To bulk load the files, indicate one or more nodes in the target cluster with the -d flag, which takes a comma-separated list of IP addresses or hostnames, and specify the path to ../Keyspace1/Standard1/snapshot/snapshot_name in the source machine. For example:
    sstableloader -d 110.82.155.1 /var/lib/cassandra/data/Keyspace1/Standard1/snapshot/snapshot_name

    This command bulk loads all files.

Note: To get the best throughput from SSTable loading, you can use multiple instances of sstableloader to stream across multiple machines. No hard limit exists on the number of SSTables that sstableloader can run at the same time, so you can add additional loaders until you see no further improvement.

Usage

sstableloader -d host_url (,host_url ...) [options] path_to_keyspace
Table 1. sstableloader options
Short option Long option Description
-alg --ssl-alg <ALGORITHM> Client SSL algorithm (default: SunX509).
-ap --auth-provider <auth provider class name> Allows the use of a third party auth provider. Can be combined with -u <username> and -pw <password> if the auth provider supports plain text credentials.
-ciphers --ssl-ciphers <CIPHER-SUITES> Client SSL. Comma-separated list of encryption suites.
-cph --connections-per-host <connectionsPerHost> Number of concurrent connections-per-host.
-d --nodes <initial_hosts> Required. Connect to a list of (comma separated) hosts for initial cluster information.
-f --conf-path <path_to_config_file> Path to the cassandra.yaml path for streaming throughput and client/server SSL.
-h --help Display help.
-i --ignore <NODES> Do not stream to this comma separated list of nodes.
-idct --inter_dc_throttle_mbits <MBPS> Inter-datacenter throttle speed in Megabits per second (default unlimited).
-ks --keystore <KEYSTORE> Client SSL. Full path to the keystore.
-kspw --keystore-password <KEYSTORE-PASSWORD>

Client SSL. Password for the keystore.

Overrides the client_encryption_options option in cassandra.yaml

--no-progress Do not display progress.
-p --port <rpc port> RPC port (default: 9160 [Thrift]).
-prtcl --ssl-protocol <PROTOCOL>

Client SSL. Connections protocol to use (default: TLS).

Overrides the server_encryption_options option in cassandra.yaml

-pw --password <password> Authentication password.
-sp --storage_port <port_num> Port used for inter-node communication (default 7000).
-ssp --ssl_storage_port Port used for TLS inter-node communication (default 7001).
-st --store-type <STORE-TYPE> Client SSL. Type of store.
-t --throttle <throttle>

Throttle speed in megabits (Mb) per second (default: unlimited).

Overrides the stream_throughput_outbound_megabits_per_sec option in cassandra.yaml

-t --throttle_mbits <MBPS> "throttle speed in megabits per second (default unlimited)"
-ts --truststore <TRUSTSTORE> Client SSL. Full path to truststore.
-tspw --truststore-password <TRUSTSTORE-PASSWORD> Client SSL. Password of the truststore.
-u --username <username> User name for authentication.
-v --verbose Verbose output.
The location of the cassandra.yaml file depends on the type of installation:
Cassandra package installations /etc/cassandra/cassandra.yaml
Cassandra tarball installations install_location/cassandra/conf/cassandra.yaml