Loading and unloading vector data

DSBulk Loader 1.11 adds support for the vector data type when used with Astra DB databases created with the Vector Search feature.

If you have not yet created an Astra database with Vector Search, see this topic in the Astra DB documentation.

In Astra DB, vector<type, dimension> is restricted to type float32 at this time. If a vector is specified in a CQL CREATE TABLE statement, the syntax is to use float as shown in the example below.

Using vector data

Once your Astra DB is Active in Astra Portal, select it from the databases in the left menu.

In this example, the created database was called myastra_with_vector_search and the keyspace is ks1.

  1. On your database’s CQL Console tab in Astra Portal, let’s start by creating a table named foo in the keyspace:

    token@cqlsh> CREATE TABLE ks1.foo (
        i int PRIMARY KEY,
        j vector<float, 3>
    );
  2. Create a Storage-Attached Index:

    token@cqlsh> CREATE CUSTOM INDEX ann_index ON ks1.foo (j) USING 'StorageAttachedIndex';
  3. Enter a query, which shows we have not yet added data:

    • cqlsh

    • Results

    token@cqlsh> select * from ks1.foo;
     i | j
    ---+---
    
    (0 rows)

With the vector support added in DSBulk v1.11, we can now use dsbulk commands with CSV or JSON data that include that the vector<type, dimension> data type.

Examples from the command line, with CSV data

Before you use dsbulk commands with an Astra DB database, recall that you’ll need to specify the path to the database’s Secure Connect Bundle file and the username/password. The parameter placeholders are shown in the examples below.

  1. Let’s enter a dsbulk unload command to confirm that the ks1.foo table is empty:

    • dsbulk

    • Results

    bin/dsbulk unload -k ks1 -t foo 2> /dev/null \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    ...
    total | failed | rows/s | p50ms | p99ms | p999ms
        0 |      0 |      0 |  0.00 |  0.00 |   0.00
    ...
  2. This is our sample data file:

    • cat

    • Results

    cat ../vector_test_data.csv
    i,j
    1,"[8, 2.3, 58]"
    2,"[1.2, 3.4, 5.6]"
    5,"[23, 18, 3.9]"
  3. Let’s load the data:

    • dsbulk

    • Results

    bin/dsbulk load -url "./../vector_test_data.csv" -k ks1 -t foo \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    ...
    total | failed | rows/s | p50ms | p99ms | p999ms | batches
        3 |      0 |     22 |  5.10 |  6.91 |   6.91 |    1.00
    ...
  4. Unload the data in CSV format:

    • dsbulk

    • Results

    bin/dsbulk unload -k test -t foo \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    ...
    i,j
    5,"[23.0, 18.0, 3.9]"
    2,"[1.2, 3.4, 5.6]"
    1,"[8.0, 2.3, 58.0]"
    total | failed | rows/s | p50ms | p99ms | p999ms
        3 |      0 |     16 |  2.25 |  2.97 |   2.97
    ...

Added vector support in the -query parameter

DSBulk Loader 1.11 also adds support for vectors in dsbulk unload -query commands.

It provides a built-in minimal CQL parser, which allows you to perform this kind of operation. Example:

  • dsbulk

  • Results

bin/dsbulk unload -query "select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1" \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
...
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      7 |  8.21 |  8.22 |   8.22
...

In the SELECT example above, notice the keyword ann. In this context, that’s short for Approximate Nearest Neighbor (ANN). It’s an important feature of Vector Search.

Examples from the command line, with JSON data

The preliminary steps are the same as in the CSV section above. After the database, keyspace, table, and SAI index were created, we can proceed with a few sample JSON data files.

Sample JSON file for primary key 1:

  • cat

  • Results

cat ../vector_test_data_json/one.json
{
    "i":1,
    "j":[8, 2.3, 58]
}

Sample JSON file for primary key 2:

  • cat

  • Results

cat ../vector_test_data_json/two.json
{
    "i":2,
    "j":[1.2, 3.4, 5.6]
}

Sample JSON file for primary key 5:

  • cat

  • Results

cat ../vector_test_data_json/five.json
{
    "i":5,
    "j":[23, 18, 3.9]
}

Let’s run some dsbulk commands with those JSON files.

  1. Load all three rows from the JSON files in the specified directory:

    • dsbulk

    • Results

    bin/dsbulk load -url "./../vector_test_data_json" -k ks1 -t foo -c json \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    ...
    total | failed | rows/s | p50ms | p99ms | p999ms | batches
        3 |      0 |     16 | 37.18 | 39.58 |  39.58 |    1.00
    ...
  2. Unload with dsbulk unload:

    • dsbulk

    • Results

    bin/dsbulk unload -k ks1 -t foo -c json \
    -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
    ...
    {"i":5,"j":[23.0,18.0,3.9]}
    {"i":1,"j":[8.0,2.3,58.0]}
    {"i":2,"j":[1.2,3.4,5.6]}
    total | failed | rows/s | p50ms | p99ms | p999ms
        3 |      0 |     14 |  2.58 |  2.87 |   2.87
    ...

Verification in Astra Portal

Back in Astra Portal, from the CQL Console tab for your database, queries return expected results. Examples:

  • cqlsh

  • Results

token@cqlsh> select * from ks1.foo;
  i | j
 ---+-----------------
  5 |   [23, 18, 3.9]
  1 |    [8, 2.3, 58]
  2 | [1.2, 3.4, 5.6]

 (3 rows)
  • cqlsh

  • Results

token@cqlsh> select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1;
  j
 -----------------
  [1.2, 3.4, 5.6]

 (1 rows)

What’s next?

For more, see Astra Vector Search in the Astra DB documentation.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com