Load and unload vector data

You can use dsbulk commands with CSV or JSON data that include vector<type, dimension> data.

This guide shows how to use DSBulk to load and unload vector data from an Astra DB database:

  1. Use the Astra DB CQL console or the standalone cqlsh to create a table with a vector column.

    This guide creates a table named foo in a keyspace named ks1. The table has two columns: Column i is an integer as well as the primary key, and column j is a vector with three dimensions.

    token@cqlsh> CREATE TABLE ks1.foo (
        i int PRIMARY KEY,
        j vector<float, 3>
    );
  2. Create a Storage-Attached Index (SAI) on the vector column to enable vector search:

    token@cqlsh> CREATE CUSTOM INDEX ann_index ON ks1.foo (j) USING 'StorageAttachedIndex';

    You can also use the Astra DB Data API to load vector data, create vector search indexes, and run vector searches on tables. For more information, see Find data with vector search.

  3. Optional: If you created a new table for this guide, run dsbulk unload to confirm that the ks1.foo table is empty and that you can connect to your Astra DB database:

    bin/dsbulk unload -k ks1 -t foo 2> /dev/null \
    -b "path/to/SCB.zip" -u token -p AstraCS:...
    Result

    The result should show zero rows unloaded:

    ...
    total | failed | rows/s | p50ms | p99ms | p999ms
        0 |      0 |      0 |  0.00 |  0.00 |   0.00
    ...
  4. Load and unload vector data using your preferred file format.

    In Astra DB, vector<type, dimension> is restricted to type float32. Use float type syntax in your JSON and CSV files, such as [8, 2.3, 58] for a vector with three dimensions.

    • CSV

    • JSON

    1. Prepare a sample data file with vector data:

      cat ../vector_test_data.csv
      vector_test_data.csv
      i,j
      1,"[8, 2.3, 58]"
      2,"[1.2, 3.4, 5.6]"
      5,"[23, 18, 3.9]"
    2. Load the data:

      bin/dsbulk load -url "./../vector_test_data.csv" -k ks1 -t foo \
      -b "path/to/SCB.zip" -u token -p AstraCS:...
      Result
      ...
      total | failed | rows/s | p50ms | p99ms | p999ms | batches
          3 |      0 |     22 |  5.10 |  6.91 |   6.91 |    1.00
      ...
    3. Unload the data in CSV format:

      bin/dsbulk unload -k test -t foo \
      -b "path/to/SCB.zip" -u token -p AstraCS:...
      Result
      ...
      i,j
      5,"[23.0, 18.0, 3.9]"
      2,"[1.2, 3.4, 5.6]"
      1,"[8.0, 2.3, 58.0]"
      total | failed | rows/s | p50ms | p99ms | p999ms
          3 |      0 |     16 |  2.25 |  2.97 |   2.97
      ...

      You can also use vector data in dsbulk unload -query commands. The built-in minimal CQL parser allows you to perform this kind of operation. In the following example, note the use of the ann keyword, which is used for vector search with CQL:

      bin/dsbulk unload -query "select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1" \
      -b "path/to/SCB.zip" -u token -p AstraCS:...
      Result
      ...
      j
      "[1.2, 3.4, 5.6]"
      total | failed | rows/s | p50ms | p99ms | p999ms
          1 |      0 |      7 |  8.21 |  8.22 |   8.22
      ...
    1. Create three sample JSON files with vector data, and store them in the same directory. Each file contains data for one row.

      1. Create a sample JSON file for primary key 1:

        cat ../vector_test_data_json/one.json
        one.json
        {
            "i":1,
            "j":[8, 2.3, 58]
        }
      2. Create a sample JSON file for primary key 2:

        cat ../vector_test_data_json/two.json
        two.json
        {
            "i":2,
            "j":[1.2, 3.4, 5.6]
        }
      3. Create a sample JSON file for primary key 5:

        cat ../vector_test_data_json/five.json
        five.json
        {
            "i":5,
            "j":[23, 18, 3.9]
        }
    2. Load the contents of all three sample JSON files from the directory where you created the files:

      bin/dsbulk load -url "./../vector_test_data_json" -k ks1 -t foo -c json \
      -b "path/to/SCB.zip" -u token -p AstraCS:...
      Result
      ...
      total | failed | rows/s | p50ms | p99ms | p999ms | batches
          3 |      0 |     16 | 37.18 | 39.58 |  39.58 |    1.00
      ...
    3. Unload all the rows to a single JSON file with dsbulk unload:

      bin/dsbulk unload -k ks1 -t foo -c json \
      -b "path/to/SCB.zip" -u token -p AstraCS:...
      Result
      ...
      {"i":5,"j":[23.0,18.0,3.9]}
      {"i":1,"j":[8.0,2.3,58.0]}
      {"i":2,"j":[1.2,3.4,5.6]}
      total | failed | rows/s | p50ms | p99ms | p999ms
          3 |      0 |     14 |  2.58 |  2.87 |   2.87
      ...
  5. Use the built-in or standalone cqlsh to verify that the data was loaded correctly:

    • Run a vector search:

      token@cqlsh> select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1;
      Result
        j
       -----------------
        [1.2, 3.4, 5.6]
      
       (1 rows)
    • Select all rows (for small tables only):

      token@cqlsh> select * from ks1.foo;
      Result
        i | j
       ---+-----------------
        5 |   [23, 18, 3.9]
        1 |    [8, 2.3, 58]
        2 | [1.2, 3.4, 5.6]
      
       (3 rows)

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2025 DataStax, an IBM Company | Privacy policy | Terms of use | Manage Privacy Choices

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com