Installing DataStax Bulk Loader

Introduction

DataStax Bulk Loader lets you efficiently and reliably load and unload CSV/JSON data in and out of:

  • DataStax Astra DB databases

  • DataStax Enterprise (DSE) 5.1 and 6.8 databases

  • Open source Apache Cassandra@reg; 2.1 and later databases

DataStax Bulk Loader is open-source software. Join the community of developers who contribute to the product! See the public GitHub repo: https://github.com/datastax/dsbulk

DataStax recommends using the latest DataStax Bulk Loader version, which is currently 1.11. DataStax Bulk Loader is supported on Linux, macOS, and Windows platforms.

You can use DataStax Bulk Loader as a standalone tool that connects remotely to a cluster. The tool is not required to run locally on a cluster node, but can be used in this configuration.

Using DataStax Bulk Loader requires a Java executable, as explained in the post-install requirements and recommendations section.

Installation steps

Use one of the following options to install DSBulk Loader.

  • Linux/macOS

  • Windows

You agree to the terms of the open-source Apache-2.0 license agreement when you download this DataStax product.

  1. Download the DSBulk Loader tarball or zip file from the DSBulk Loader GitHub repo.

  2. Select the package for your OS. DataStax provides a tar file for Linux and macOS.

  3. Unpack the downloaded distribution.

    Linux example:

    tar -xzvf dsbulk-{dsbulk-version}.tar.gz

You agree to the terms of the open-source Apache-2.0 license agreement when you download this DataStax product.

Ensure you have Java installed on your Windows system.

  1. Download the DSBulk Loader zip file for Windows from the DSBulk Loader GitHub repo. Releases are published in various formats. Ensure you select .zip, not .zip.asc.

  2. Extract the contents to a directory. While DSBulk attempts to find the Java executable automatically, you can manually configure Java for DSBulk to specify the Java executable to use. Define the JAVA_HOME environment variable to specify which Java VM to use for DSBulk.

  3. Adjust the port-number to your specific configuration. The default is 9042.

  4. Open a command prompt, navigate to the dsbulk-VERSION\bin directory, and then run dsbulk commands.

  5. Run the DSBulk commands.

    For example, to load data from a CSV file into your database, you could use the following:

    dsbulk load -url C:\PATH_TO_CSV -k KEYSPACE_NAME -t TABLE_NAME

    Replace the following:

    • PATH_TO_CSV: The path to the CSV file that you want to load. You must escape all backslashes (\) in Windows paths.

    • KEYSPACE_NAME and TABLE_NAME: The keyspace and table in your database where you want to load the data. Escape all backslashes (\) in Windows paths.

      For more information, see Escaping command line arguments.

DSBulk has options for your needs, including data formatting, authentication, and performance tuning. Use the dsbulk help command to explore available commands and options. For more information on all available settings, see DSBulk Loader snapshot options.

Post-install requirements and recommendations

Java executable is required.

Using DataStax Bulk Loader requires a Java executable.

On macOS, Linux, and *nix systems, the rules used to find a Java executable are:

  1. Use $JAVA if defined

  2. Use ${JAVA_HOME}/bin/java if defined

  3. Use $(/usr/libexec/java_home)/bin/java if defined

  4. Use the first Java executable found on $PATH

On Windows systems, the rules used to find a Java executable are:

  1. Use %JAVA_HOME%\bin\java if defined

  2. Use the first Java executable found on $PATH

You can pass system properties to the DataStax Bulk Loader process by exporting the environment variable DSBULK_JAVA_OPTS. This step can be useful, for example, to configure JMX monitoring, or to configure advanced authentication schemes such as Kerberos. For example, on a Linux system:

# Remote JMX configuration
export DSBULK_JAVA_OPTS="$DSBULK_JAVA_OPTS -Dcom.sun.management.jmxremote.port=port-number"
# Kerberos configuration
export DSBULK_JAVA_OPTS="$DSBULK_JAVA_OPTS -Djava.security.krb5.conf=configuration-path-and-filename"
# Invoke DSBulk
bin/dsbulk load -h host1.com -k ks1 -t table1 -url data.csv

Regarding any prior package installs

If you previously used a package install of DSE on the node where you just installed dsbulk, a prior version of dsbulk was included, such as 1.9.1. After unpacking the latest version of dsbulk from the standalone tarball, update your PATH so that it points to the new version.

For example, on a macOS node, edit your $HOME/.bashrc file, adding a command such as:

export PATH=path-to-unpacked-location/dsbulk-1.11/bin:$PATH

From the command line, execute your updated .bashrc, and verify the dsbulk version. Example:

source ~/.bashrc
dsbulk --version
DataStax Bulk Loader 1.11

What’s next?

Learn how to get started with DataStax Bulk Loader.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com