Load data with DSBulk

If your CSV file is more than 40MB, you can upload data with the DataStax Bulk Loader (DSBulk). DSBulk provides commands like dsbulk load, dsbulk unload, and dsbulk count, along with extensive options. For more information, see the DataStax Bulk Loader reference.

  1. Download the dsbulk installation file. DSBulk 1.11.0 or later is required to support the vector CQL data type. The following command automatically downloads the latest DSBulk version:

    curl -OL https://downloads.datastax.com/dsbulk/dsbulk.tar.gz
    Results
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   242  100   242    0     0    681      0 --:--:-- --:--:-- --:--:--   691
    100 40.4M  100 40.4M    0     0  20.7M      0  0:00:01  0:00:01 --:--:-- 31.6M
  2. Extract the DSBulk archive:

    tar -xzvf dsbulk.tar.gz
    Results

    This example uses DSBulk version 1.11.0:

    x dsbulk-1.11.0/README.md
    x dsbulk-1.11.0/LICENSE.txt
    x dsbulk-1.11.0/manual/
    x dsbulk-1.11.0/manual/driver.template.conf
    x dsbulk-1.11.0/manual/settings.md
    x dsbulk-1.11.0/manual/application.template.conf
    x dsbulk-1.11.0/bin/dsbulk
    x dsbulk-1.11.0/bin/dsbulk.cmd
    x dsbulk-1.11.0/conf/
    x dsbulk-1.11.0/conf/driver.conf
    x dsbulk-1.11.0/conf/application.conf
    x dsbulk-1.11.0/THIRD-PARTY.txt
    x dsbulk-1.11.0/lib/java-driver-core-4.17.0.jar
    x dsbulk-1.11.0/lib/native-protocol-1.5.1.jar
    x dsbulk-1.11.0/lib/netty-handler-4.1.94.Final.jar
       .
       .
       .
    x dsbulk-1.11.0/lib/lz4-java-1.8.0.jar
    x dsbulk-1.11.0/lib/snappy-java-1.1.7.3.jar
    x dsbulk-1.11.0/lib/jansi-1.18.jar
  3. To verify the installation, run the following command in the same directory where you extracted DSBulk:

    dsbulk-VERSION/bin/dsbulk --version
    Results
    DataStax Bulk Loader v1.11.0
  4. Create an application token with the Administrator User role, and then store the token securely.

  5. If you haven’t done so already, create a database.

  6. Download the database’s secure connect bundle.

  7. Create a table in your database:

    1. In the Astra Portal, go to your database, and click CQL Console.

    2. When the token@cqlsh> prompt appears, select the keyspace where you want to create the table:

      use KEYSPACE_NAME;
    3. Create a table to load a sample dataset:

      CREATE TABLE KEYSPACE_NAME.world_happiness_report_2021 (
        country_name text,
        regional_indicator text,
        ladder_score float,
        gdp_per_capita float,
        social_support float,
        healthy_life_expectancy float,
        generosity float,
        PRIMARY KEY (country_name)
      );

      If you want to load your own data, replace world_happiness_report_2021 with your table name, and then adjust the column names and data types for your data.

  8. To load the sample dataset, download the World Happiness Report 2021 sample dataset. This is a small sample dataset, but DSBulk can load, unload, and count extremely large files.

  9. Use DSBulk to load data into the table:

    dsbulk-VERSION/bin/dsbulk load -url PATH_TO_CSV_FILE -k KEYSPACE_NAME \
    -t TABLE_NAME -b PATH_TO_SECURE_CONNECT_BUNDLE -u token \
    -p APPLICATION_TOKEN
    Results
    Operation directory: /path/to/directory/log/LOAD ...
    total | failed | rows/s |  p50ms |  p99ms | p999ms | batches
      149 |      0 |    400 | 106.65 | 187.70 | 191.89 |    1.00
  10. After the upload completes, you can query the loaded data from the CQL Console:

    SELECT * FROM KEYSPACE_NAME.worldhappinessreport2021;

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com