Loading and unloading vector data
DSBulk Loader 1.11 adds support for the vector
data type when used with Astra DB databases created with the Vector Search feature.
If you have not yet created an Astra database with Vector Search, see this topic in the Astra DB documentation.
In Astra DB, |
Using vector
data
Once your Astra DB is Active in Astra Portal, select it from the databases in the left menu.
In this example, the created database was called myastra_with_vector_search
and the keyspace is ks1
.
-
On your database’s CQL Console tab in Astra Portal, let’s start by creating a table named
foo
in the keyspace:token@cqlsh> CREATE TABLE ks1.foo ( i int PRIMARY KEY, j vector<float, 3> );
-
Create a Storage-Attached Index:
token@cqlsh> CREATE CUSTOM INDEX ann_index ON ks1.foo (j) USING 'StorageAttachedIndex';
-
Enter a query, which shows we have not yet added data:
-
cqlsh
-
Results
token@cqlsh> select * from ks1.foo;
i | j ---+--- (0 rows)
-
With the vector
support added in DSBulk v1.11, we can now use dsbulk
commands with CSV or JSON data that include that the vector<type, dimension>
data type.
Examples from the command line, with CSV data
Before you use dsbulk commands with an Astra DB database, recall that you’ll need to specify the path to the database’s Secure Connect Bundle file and the username/password. The parameter placeholders are shown in the examples below. |
-
Let’s enter a
dsbulk unload
command to confirm that theks1.foo
table is empty:-
dsbulk
-
Results
bin/dsbulk unload -k ks1 -t foo 2> /dev/null \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
... total | failed | rows/s | p50ms | p99ms | p999ms 0 | 0 | 0 | 0.00 | 0.00 | 0.00 ...
-
-
This is our sample data file:
-
cat
-
Results
cat ../vector_test_data.csv
i,j 1,"[8, 2.3, 58]" 2,"[1.2, 3.4, 5.6]" 5,"[23, 18, 3.9]"
-
-
Let’s load the data:
-
dsbulk
-
Results
bin/dsbulk load -url "./../vector_test_data.csv" -k ks1 -t foo \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
... total | failed | rows/s | p50ms | p99ms | p999ms | batches 3 | 0 | 22 | 5.10 | 6.91 | 6.91 | 1.00 ...
-
-
Unload the data in CSV format:
-
dsbulk
-
Results
bin/dsbulk unload -k test -t foo \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
... i,j 5,"[23.0, 18.0, 3.9]" 2,"[1.2, 3.4, 5.6]" 1,"[8.0, 2.3, 58.0]" total | failed | rows/s | p50ms | p99ms | p999ms 3 | 0 | 16 | 2.25 | 2.97 | 2.97 ...
-
Added vector support in the -query
parameter
DSBulk Loader 1.11 also adds support for vectors in dsbulk unload -query
commands.
It provides a built-in minimal CQL parser, which allows you to perform this kind of operation. Example:
-
dsbulk
-
Results
bin/dsbulk unload -query "select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1" \
-b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
...
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
1 | 0 | 7 | 8.21 | 8.22 | 8.22
...
In the |
Examples from the command line, with JSON data
The preliminary steps are the same as in the CSV section above. After the database, keyspace, table, and SAI index were created, we can proceed with a few sample JSON data files.
Sample JSON file for primary key 1
:
-
cat
-
Results
cat ../vector_test_data_json/one.json
{
"i":1,
"j":[8, 2.3, 58]
}
Sample JSON file for primary key 2
:
-
cat
-
Results
cat ../vector_test_data_json/two.json
{
"i":2,
"j":[1.2, 3.4, 5.6]
}
Sample JSON file for primary key 5
:
-
cat
-
Results
cat ../vector_test_data_json/five.json
{
"i":5,
"j":[23, 18, 3.9]
}
Let’s run some dsbulk
commands with those JSON files.
-
Load all three rows from the JSON files in the specified directory:
-
dsbulk
-
Results
bin/dsbulk load -url "./../vector_test_data_json" -k ks1 -t foo -c json \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
... total | failed | rows/s | p50ms | p99ms | p999ms | batches 3 | 0 | 16 | 37.18 | 39.58 | 39.58 | 1.00 ...
-
-
Unload with
dsbulk unload
:-
dsbulk
-
Results
bin/dsbulk unload -k ks1 -t foo -c json \ -b "path/to/secure-connect-database_name.zip" -u database_user -p database_password
... {"i":5,"j":[23.0,18.0,3.9]} {"i":1,"j":[8.0,2.3,58.0]} {"i":2,"j":[1.2,3.4,5.6]} total | failed | rows/s | p50ms | p99ms | p999ms 3 | 0 | 14 | 2.58 | 2.87 | 2.87 ...
-
Verification in Astra Portal
Back in Astra Portal, from the CQL Console tab for your database, queries return expected results. Examples:
-
cqlsh
-
Results
token@cqlsh> select * from ks1.foo;
i | j
---+-----------------
5 | [23, 18, 3.9]
1 | [8, 2.3, 58]
2 | [1.2, 3.4, 5.6]
(3 rows)
-
cqlsh
-
Results
token@cqlsh> select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1;
j
-----------------
[1.2, 3.4, 5.6]
(1 rows)
What’s next?
For more, see Astra Vector Search in the Astra DB documentation.