Loading and unloading vector data

DSBulk 1.11 adds support for the vector data type when used with Astra DB Serverless (Vector) databases.

In Astra DB, vector<type, dimension> is restricted to type float32 at this time. If a vector is specified in a CQL CREATE TABLE statement, the syntax is to use float as shown in the example below.

Using `vector` data

Once your Astra DB database is Active in the Astra Portal, select it from the databases in the left menu.

In this example, the created database was called myastra_with_vector_search and the keyspace is ks1.

On your database’s CQL Console tab in the Astra Portal, let’s start by creating a table named foo in the keyspace:
```
token@cqlsh> CREATE TABLE ks1.foo (
    i int PRIMARY KEY,
    j vector<float, 3>
);
```

Create a Storage-Attached Index:

token@cqlsh> CREATE CUSTOM INDEX ann_index ON ks1.foo (j) USING 'StorageAttachedIndex';

Enter a query, which shows we have not yet added data:
- cqlsh
- Results
token@cqlsh> select * from ks1.foo;
i | j ---+--- (0 rows)

With the vector support added in DSBulk v1.11, we can now use dsbulk commands with CSV or JSON data that include that the vector<type, dimension> data type.

Examples from the command line, with CSV data

Before you use dsbulk commands with an Astra DB database, recall that you’ll need to specify the path to the database’s Secure Connect Bundle file and the username/password. The parameter placeholders are shown in the examples below.

Let’s enter a dsbulk unload command to confirm that the ks1.foo table is empty:

dsbulk
Results

bin/dsbulk unload -k ks1 -t foo 2> /dev/null \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password

...
total | failed | rows/s | p50ms | p99ms | p999ms
    0 |      0 |      0 |  0.00 |  0.00 |   0.00
...

This is our sample data file:

cat
Results

cat ../vector_test_data.csv

i,j
1,"[8, 2.3, 58]"
2,"[1.2, 3.4, 5.6]"
5,"[23, 18, 3.9]"

Let’s load the data:

dsbulk
Results

bin/dsbulk load -url "./../vector_test_data.csv" -k ks1 -t foo \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password

...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      0 |     22 |  5.10 |  6.91 |   6.91 |    1.00
...

Unload the data in CSV format:

dsbulk
Results

bin/dsbulk unload -k test -t foo \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password

...
i,j
5,"[23.0, 18.0, 3.9]"
2,"[1.2, 3.4, 5.6]"
1,"[8.0, 2.3, 58.0]"
total | failed | rows/s | p50ms | p99ms | p999ms
    3 |      0 |     16 |  2.25 |  2.97 |   2.97
...

Added vector support in the `-query` parameter

DSBulk 1.11 also adds support for vectors in dsbulk unload -query commands.

It provides a built-in minimal CQL parser, which allows you to perform this kind of operation. Example:

dsbulk
Results

bin/dsbulk unload -query "select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1" \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password

...
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      7 |  8.21 |  8.22 |   8.22
...

In the SELECT example above, notice the keyword ann. In this context, that’s short for Approximate Nearest Neighbor (ANN). It’s an important feature of Vector Search.

Examples from the command line, with JSON data

The preliminary steps are the same as in the CSV section above. After the database, keyspace, table, and SAI index were created, we can proceed with a few sample JSON data files.

Sample JSON file for primary key 1:

cat
Results

cat ../vector_test_data_json/one.json

{
    "i":1,
    "j":[8, 2.3, 58]
}

Sample JSON file for primary key 2:

cat
Results

cat ../vector_test_data_json/two.json

{
    "i":2,
    "j":[1.2, 3.4, 5.6]
}

Sample JSON file for primary key 5:

cat
Results

cat ../vector_test_data_json/five.json

{
    "i":5,
    "j":[23, 18, 3.9]
}

Let’s run some dsbulk commands with those JSON files.

Load all three rows from the JSON files in the specified directory:

dsbulk
Results

bin/dsbulk load -url "./../vector_test_data_json" -k ks1 -t foo -c json \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password

...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      0 |     16 | 37.18 | 39.58 |  39.58 |    1.00
...

Unload with dsbulk unload:

dsbulk
Results

bin/dsbulk unload -k ks1 -t foo -c json \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password

...
{"i":5,"j":[23.0,18.0,3.9]}
{"i":1,"j":[8.0,2.3,58.0]}
{"i":2,"j":[1.2,3.4,5.6]}
total | failed | rows/s | p50ms | p99ms | p999ms
    3 |      0 |     14 |  2.58 |  2.87 |   2.87
...

Verification in the Astra Portal

In the Astra Portal, from your database’s CQL Console, queries return expected results. Examples:

cqlsh
Results

token@cqlsh> select * from ks1.foo;

  i | j
 ---+-----------------
  5 |   [23, 18, 3.9]
  1 |    [8, 2.3, 58]
  2 | [1.2, 3.4, 5.6]

 (3 rows)

cqlsh
Results

token@cqlsh> select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1;

  j
 -----------------
  [1.2, 3.4, 5.6]

 (1 rows)

Loading and unloading vector data

Using `vector` data

Examples from the command line, with CSV data

Added vector support in the `-query` parameter

Examples from the command line, with JSON data

Verification in the Astra Portal

See also

Was this helpful?

Give Feedback

Loading and unloading vector data

Using vector data

Examples from the command line, with CSV data

Added vector support in the -query parameter

Examples from the command line, with JSON data

Verification in the Astra Portal

See also

Was this helpful?

Using `vector` data

Added vector support in the `-query` parameter