Loading and unloading vector data

DSBulk 1.11 adds support for the vector data type when used with Astra DB Serverless (Vector) databases.

In Astra DB, vector<type, dimension> is restricted to type float32 at this time. If a vector is specified in a CQL CREATE TABLE statement, the syntax is to use float as shown in the example below.

Using vector data

Once your Astra DB database is Active in the Astra Portal, select it from the databases in the left menu.

In this example, the created database was called myastra_with_vector_search and the keyspace is ks1.

  1. On your database’s CQL Console tab in the Astra Portal, let’s start by creating a table named foo in the keyspace:

    token@cqlsh> CREATE TABLE ks1.foo (
        i int PRIMARY KEY,
        j vector<float, 3>
    );
  2. Create a Storage-Attached Index:

    token@cqlsh> CREATE CUSTOM INDEX ann_index ON ks1.foo (j) USING 'StorageAttachedIndex';
  3. Enter a query, which shows we have not yet added data:

    token@cqlsh> select * from ks1.foo;
    Result
     i | j
    ---+---
    
    (0 rows)

With the vector support added in DSBulk v1.11, we can now use dsbulk commands with CSV or JSON data that include that the vector<type, dimension> data type.

Examples from the command line, with CSV data

Before you use dsbulk commands with an Astra DB database, recall that you’ll need to specify the path to the database’s Secure Connect Bundle file and the username/password. The parameter placeholders are shown in the examples below.

  1. Let’s enter a dsbulk unload command to confirm that the ks1.foo table is empty:

    bin/dsbulk unload -k ks1 -t foo 2> /dev/null \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    Result
    ...
    total | failed | rows/s | p50ms | p99ms | p999ms
        0 |      0 |      0 |  0.00 |  0.00 |   0.00
    ...
  2. Prepare a sample data file:

    cat ../vector_test_data.csv
    Result
    i,j
    1,"[8, 2.3, 58]"
    2,"[1.2, 3.4, 5.6]"
    5,"[23, 18, 3.9]"
  3. Load the data:

    bin/dsbulk load -url "./../vector_test_data.csv" -k ks1 -t foo \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    Result
    ...
    total | failed | rows/s | p50ms | p99ms | p999ms | batches
        3 |      0 |     22 |  5.10 |  6.91 |   6.91 |    1.00
    ...
  4. Unload the data in CSV format:

    bin/dsbulk unload -k test -t foo \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    Result
    ...
    i,j
    5,"[23.0, 18.0, 3.9]"
    2,"[1.2, 3.4, 5.6]"
    1,"[8.0, 2.3, 58.0]"
    total | failed | rows/s | p50ms | p99ms | p999ms
        3 |      0 |     16 |  2.25 |  2.97 |   2.97
    ...

Added vector support in the query parameter

DSBulk 1.11 adds support for vectors in dsbulk unload -query commands. It provides a built-in minimal CQL parser, which allows you to perform this kind of operation. For example:

bin/dsbulk unload -query "select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1" \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
Result
...
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      7 |  8.21 |  8.22 |   8.22
...

In the SELECT example above, notice the keyword ann. In this context, that’s short for Approximate Nearest Neighbor (ANN). It’s an important feature of Vector Search.

Examples from the command line, with JSON data

The preliminary steps are the same as in the CSV section above. After the database, keyspace, table, and SAI index were created, we can proceed with a few sample JSON data files.

Sample JSON file for primary key 1:

cat ../vector_test_data_json/one.json
Result
{
    "i":1,
    "j":[8, 2.3, 58]
}

Sample JSON file for primary key 2:

cat ../vector_test_data_json/two.json
Result
{
    "i":2,
    "j":[1.2, 3.4, 5.6]
}

Sample JSON file for primary key 5:

cat ../vector_test_data_json/five.json
Result
{
    "i":5,
    "j":[23, 18, 3.9]
}

Run some dsbulk commands with the preceding JSON files.

  1. Load all three rows from the JSON files in the specified directory:

    bin/dsbulk load -url "./../vector_test_data_json" -k ks1 -t foo -c json \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    Result
    ...
    total | failed | rows/s | p50ms | p99ms | p999ms | batches
        3 |      0 |     16 | 37.18 | 39.58 |  39.58 |    1.00
    ...
  2. Unload with dsbulk unload:

    bin/dsbulk unload -k ks1 -t foo -c json \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    Result
    ...
    {"i":5,"j":[23.0,18.0,3.9]}
    {"i":1,"j":[8.0,2.3,58.0]}
    {"i":2,"j":[1.2,3.4,5.6]}
    total | failed | rows/s | p50ms | p99ms | p999ms
        3 |      0 |     14 |  2.58 |  2.87 |   2.87
    ...

Verification in the Astra Portal

In the Astra Portal, from your database’s CQL Console, queries return expected results. For example:

token@cqlsh> select * from ks1.foo;
Result
  i | j
 ---+-----------------
  5 |   [23, 18, 3.9]
  1 |    [8, 2.3, 58]
  2 | [1.2, 3.4, 5.6]

 (3 rows)
token@cqlsh> select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1;
Result
  j
 -----------------
  [1.2, 3.4, 5.6]

 (1 rows)

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2025 DataStax, an IBM Company | Privacy policy | Terms of use | Manage Privacy Choices

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com