Loading and unloading vector data
DSBulk 1.11 adds support for the vector
data type when used with Astra DB Serverless (Vector) databases.
In Astra DB, |
Using vector
data
Once your Astra DB database is Active in the Astra Portal, select it from the databases in the left menu.
In this example, the created database was called myastra_with_vector_search
and the keyspace is ks1
.
-
On your database’s CQL Console tab in the Astra Portal, let’s start by creating a table named
foo
in the keyspace:token@cqlsh> CREATE TABLE ks1.foo ( i int PRIMARY KEY, j vector<float, 3> );
-
Create a Storage-Attached Index:
token@cqlsh> CREATE CUSTOM INDEX ann_index ON ks1.foo (j) USING 'StorageAttachedIndex';
-
Enter a query, which shows we have not yet added data:
token@cqlsh> select * from ks1.foo;
Result
i | j ---+--- (0 rows)
With the vector
support added in DSBulk v1.11, we can now use dsbulk
commands with CSV or JSON data that include that the vector<type, dimension>
data type.
Examples from the command line, with CSV data
Before you use dsbulk commands with an Astra DB database, recall that you’ll need to specify the path to the database’s Secure Connect Bundle file and the username/password. The parameter placeholders are shown in the examples below. |
-
Let’s enter a
dsbulk unload
command to confirm that theks1.foo
table is empty:bin/dsbulk unload -k ks1 -t foo 2> /dev/null \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
Result
... total | failed | rows/s | p50ms | p99ms | p999ms 0 | 0 | 0 | 0.00 | 0.00 | 0.00 ...
-
Prepare a sample data file:
cat ../vector_test_data.csv
Result
i,j 1,"[8, 2.3, 58]" 2,"[1.2, 3.4, 5.6]" 5,"[23, 18, 3.9]"
-
Load the data:
bin/dsbulk load -url "./../vector_test_data.csv" -k ks1 -t foo \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
Result
... total | failed | rows/s | p50ms | p99ms | p999ms | batches 3 | 0 | 22 | 5.10 | 6.91 | 6.91 | 1.00 ...
-
Unload the data in CSV format:
bin/dsbulk unload -k test -t foo \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
Result
... i,j 5,"[23.0, 18.0, 3.9]" 2,"[1.2, 3.4, 5.6]" 1,"[8.0, 2.3, 58.0]" total | failed | rows/s | p50ms | p99ms | p999ms 3 | 0 | 16 | 2.25 | 2.97 | 2.97 ...
Added vector support in the query parameter
DSBulk 1.11 adds support for vectors in dsbulk unload -query
commands.
It provides a built-in minimal CQL parser, which allows you to perform this kind of operation.
For example:
bin/dsbulk unload -query "select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1" \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
Result
...
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
1 | 0 | 7 | 8.21 | 8.22 | 8.22
...
In the |
Examples from the command line, with JSON data
The preliminary steps are the same as in the CSV section above. After the database, keyspace, table, and SAI index were created, we can proceed with a few sample JSON data files.
Sample JSON file for primary key 1
:
cat ../vector_test_data_json/one.json
Result
{
"i":1,
"j":[8, 2.3, 58]
}
Sample JSON file for primary key 2
:
cat ../vector_test_data_json/two.json
Result
{
"i":2,
"j":[1.2, 3.4, 5.6]
}
Sample JSON file for primary key 5
:
cat ../vector_test_data_json/five.json
Result
{
"i":5,
"j":[23, 18, 3.9]
}
Run some dsbulk
commands with the preceding JSON files.
-
Load all three rows from the JSON files in the specified directory:
bin/dsbulk load -url "./../vector_test_data_json" -k ks1 -t foo -c json \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
Result
... total | failed | rows/s | p50ms | p99ms | p999ms | batches 3 | 0 | 16 | 37.18 | 39.58 | 39.58 | 1.00 ...
-
Unload with
dsbulk unload
:bin/dsbulk unload -k ks1 -t foo -c json \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
Result
... {"i":5,"j":[23.0,18.0,3.9]} {"i":1,"j":[8.0,2.3,58.0]} {"i":2,"j":[1.2,3.4,5.6]} total | failed | rows/s | p50ms | p99ms | p999ms 3 | 0 | 14 | 2.58 | 2.87 | 2.87 ...
Verification in the Astra Portal
In the Astra Portal, from your database’s CQL Console, queries return expected results. For example:
token@cqlsh> select * from ks1.foo;
Result
i | j
---+-----------------
5 | [23, 18, 3.9]
1 | [8, 2.3, 58]
2 | [1.2, 3.4, 5.6]
(3 rows)
token@cqlsh> select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1;
Result
j
-----------------
[1.2, 3.4, 5.6]
(1 rows)