Load and unload vector data
You can use dsbulk commands with CSV or JSON data that include vector<type, dimension> data.
This guide shows how to use DSBulk to load and unload vector data from an Astra DB database.
Create a table with a vector column and a vector index
-
Use the Astra DB CQL console or standalone
cqlshto create a table with a vector column.This guide creates a table named
fooin a keyspace namedks1. The table has two columns: Columniis an integer as well as the primary key, and columnjis a vector with three dimensions.token@cqlsh> CREATE TABLE ks1.foo ( i int PRIMARY KEY, j vector<float, 3> ); -
Create a Storage-Attached Index (SAI) on the vector column to enable vector search:
token@cqlsh> CREATE CUSTOM INDEX ann_index ON ks1.foo (j) USING 'StorageAttachedIndex';You can also use the Astra DB Data API to load vector data, create vector search indexes, and run vector searches on tables. For more information, see Find data with vector search.
-
Optional: If you created a new table for this guide, run
dsbulk unloadto confirm that theks1.footable is empty and that you can connect to your Astra DB database:bin/dsbulk unload -k ks1 -t foo 2> /dev/null \ -b "path/to/SCB.zip" -u token -p AstraCS:...The result should show zero rows unloaded:
Result... total | failed | rows/s | p50ms | p99ms | p999ms 0 | 0 | 0 | 0.00 | 0.00 | 0.00 ...
Load vector data
Load and unload vector data using your preferred file format.
|
In Astra DB, |
Load vector data from a CSV file
-
Prepare a sample data file with vector data:
cat ../vector_test_data.csvvector_test_data.csvi,j 1,"[8, 2.3, 58]" 2,"[1.2, 3.4, 5.6]" 5,"[23, 18, 3.9]" -
Load the data:
bin/dsbulk load -url "./../vector_test_data.csv" -k ks1 -t foo \ -b "path/to/SCB.zip" -u token -p AstraCS:...Result... total | failed | rows/s | p50ms | p99ms | p999ms | batches 3 | 0 | 22 | 5.10 | 6.91 | 6.91 | 1.00 ...
Load vector data from a JSON file
-
Create three sample JSON files with vector data, and store them in the same directory. Each file contains data for one row.
-
Create a sample JSON file for primary key
1:cat ../vector_test_data_json/one.jsonone.json{ "i":1, "j":[8, 2.3, 58] } -
Create a sample JSON file for primary key
2:cat ../vector_test_data_json/two.jsontwo.json{ "i":2, "j":[1.2, 3.4, 5.6] } -
Create a sample JSON file for primary key
5:cat ../vector_test_data_json/five.jsonfive.json{ "i":5, "j":[23, 18, 3.9] }
-
-
Load the contents of all three sample JSON files from the directory where you created the files:
bin/dsbulk load -url "./../vector_test_data_json" -k ks1 -t foo -c json \ -b "path/to/SCB.zip" -u token -p AstraCS:...Result... total | failed | rows/s | p50ms | p99ms | p999ms | batches 3 | 0 | 16 | 37.18 | 39.58 | 39.58 | 1.00 ...
Verify that the data was written to the table
Use the Astra DB CQL console or standalone cqlsh to read from the table and verify that the data was loaded correctly:
- Run a vector search
-
token@cqlsh> select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1;Resultj ----------------- [1.2, 3.4, 5.6] (1 rows) - Select all rows (small tables only)
-
token@cqlsh> select * from ks1.foo;Resulti | j ---+----------------- 5 | [23, 18, 3.9] 1 | [8, 2.3, 58] 2 | [1.2, 3.4, 5.6] (3 rows)
Unload vector data
Unload rows to a CSV or JSON file using the dsbulk unload command.
Unload in CSV format
- Unload all rows in CSV format
-
bin/dsbulk unload -k test -t foo \ -b "path/to/SCB.zip" -u token -p AstraCS:...Result... i,j 5,"[23.0, 18.0, 3.9]" 2,"[1.2, 3.4, 5.6]" 1,"[8.0, 2.3, 58.0]" total | failed | rows/s | p50ms | p99ms | p999ms 3 | 0 | 16 | 2.25 | 2.97 | 2.97 ... - Unload specific rows with
dsbulk unload -query -
The
-queryparameter accepts a CQL statement that selects specific rows to unload. The built-in minimal Cassandra Query Language (CQL) parser supports these operations.For tables with vector data, you can use a vector search (
annkeyword) to select specific rows to unload. For example:bin/dsbulk unload -query "select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1" \ -b "path/to/SCB.zip" -u token -p AstraCS:...Result... j "[1.2, 3.4, 5.6]" total | failed | rows/s | p50ms | p99ms | p999ms 1 | 0 | 7 | 8.21 | 8.22 | 8.22 ...
Unload in JSON format
- Unload all rows in JSON format
-
bin/dsbulk unload -k ks1 -t foo \ -c json \ -b "path/to/SCB.zip" -u token -p AstraCS:...Result... {"i":5,"j":[23.0,18.0,3.9]} {"i":1,"j":[8.0,2.3,58.0]} {"i":2,"j":[1.2,3.4,5.6]} total | failed | rows/s | p50ms | p99ms | p999ms 3 | 0 | 14 | 2.58 | 2.87 | 2.87 ... - Unload specific rows with
dsbulk unload -query -
The
-queryparameter accepts a CQL statement that selects specific rows to unload.For tables with vector data, you can use a vector search (
annkeyword) to select specific rows to unload. For example:bin/dsbulk unload -query "select j from ks1.foo order by j ann of [3.4, 7.8, 9.1] limit 1" \ -c json \ -b "path/to/SCB.zip" -u token -p AstraCS:...