Load your data
After you create a collection or table, you can load your data.
To load data, you must have the appropriate permissions, such as the Database Administrator role. To load data with the Data API, you need an application token with sufficient permissions.
Load vector data into a collection
You can load structured and unstructured vector data into a vector-enabled collection.
Load structured JSON or CSV data
You can use the Astra Portal or the Data API to load data from a JSON or CSV file into a collection.
Load a sample vector dataset
You can use sample datasets to explore features and test your applications before loading your own data.
In the Astra Portal, you can create a sample collection that automatically loads a sample dataset:
-
In the Astra Portal navigation menu, select your Serverless (Vector) database, and then click Data Explorer.
-
In the Keyspace field, select a keyspace that has no collections. If you don’t have an empty keyspace, create a keyspace for the sample dataset.
-
Click Create Sample Collection.
After you load the sample dataset, you can interact with it using the Data Explorer or the Data API.
You can also do the following:
-
Download the movie_openai_100.csv sample dataset (3.5MB), and manually load it into a collection. To comply with the vector data in the sample dataset, the collection’s Embedding generation method must be Bring your own and the Dimension must be 1536.
-
Load a sample dataset with the Data API. For an example script, see Astra DB Serverless quickstart.
-
Astra Portal
-
Python
-
TypeScript
-
Java
-
curl
-
In the Astra Portal, go to Databases, and then select your Serverless (Vector) database.
-
Click Data Explorer.
-
Select the keyspace and collection where you want to load data.
-
Click Load Data.
-
If your collection can load unstructured data, click Structured data.
-
Click Select File, and then select the JSON or CSV file that contains your dataset.
After the file uploads, the first ten rows of your data appear in the Data Preview section.
If you get a
Selected embedding does not match collection dimensions
error, you need to create a new collection with vector dimensions that match your dataset. -
If your collection has an Astra DB vectorize integration, use the Vector Field to select the field to use to generate embeddings.
The data importer applies the top-level
$vectorize
key to the selected field, and automatically generates an embedding vector from its contents. The resulting documents in the collection store the actual text in the$vectorize
field and the resulting embedding in the$vector
field. Documents in the collection do not retain the original field name for the$vectorize
field.For more information, see Vector and vectorize.
-
Optional for CSV files only: In the Data Preview section, select the data type for each field or column. The available types are String, Number, Array, Object, and Vector.
If the data importer detects embeddings in your dataset, it automatically assigns the Vector data type to that field or column. Each collection can have only one vector field.
These data type selections only apply to the initial data that you load, with the exception of Vector, which permanently maps the field to the reserved key
$vector
. Data type selections aren’t fixed in the schema, and they don’t apply to documents inserted later. For example, the same field can be a string in one document and a number in another. You can also have different sets of fields in different documents in the same collection. -
Click Load Data.
The Data API syntax depends on your embedding generation method and other configurations.
For more information and options, see insertMany
and Vector and vectorize.
-
Bring my own embeddings
-
Use an embedding provider integration
# Insert documents with embeddings into the collection.
documents = [
{
"text": "Chat bot integrated sneakers that talk to you",
"$vector": [0.1, 0.15, 0.3, 0.12, 0.05],
},
{
"text": "An AI quilt to help you sleep forever",
"$vector": [0.45, 0.09, 0.01, 0.2, 0.11],
},
{
"text": "A deep learning display that controls your mood",
"$vector": [0.1, 0.05, 0.08, 0.3, 0.6],
},
]
insertion_result = collection.insert_many(documents)
print(f"* Inserted {len(insertion_result.inserted_ids)} items.\n")
If you get a |
When you load data into a collection that uses an embedding provider integration, Astra DB vectorize can automatically generate embeddings for your data. For more information, see Vector and vectorize.
# Insert documents into the collection.
# (UUIDs here are version 7.)
documents = [
{
"$vectorize": "Chat bot integrated sneakers that talk to you",
},
{
"$vectorize": "An AI quilt to help you sleep forever",
},
{
"$vectorize": "A deep learning display that controls your mood",
},
]
insertion_result = collection.insert_many(documents)
print(f"* Inserted {len(insertion_result.inserted_ids)} items.\n")
The Data API syntax depends on your embedding generation method and other configurations.
For more information and options, see insertMany
and Vector and vectorize.
-
Bring my own embeddings
-
Use an embedding provider integration
// Insert documents with embeddings into the collection.
const documents = [
{
idea: 'Chat bot integrated sneakers that talk to you',
$vector: [0.1, 0.15, 0.3, 0.12, 0.05],
},
{
idea: 'An AI quilt to help you sleep forever',
$vector: [0.45, 0.09, 0.01, 0.2, 0.11],
},
{
idea: 'A deep learning display that controls your mood',
$vector: [0.1, 0.05, 0.08, 0.3, 0.6],
},
];
const inserted = await collection.insertMany(documents);
console.log(`* Inserted ${inserted.insertedCount} items.`);
If you get a |
When you load data into a collection that uses an embedding provider integration, Astra DB vectorize can automatically generate embeddings for your data. For more information, see Vector and vectorize.
// Insert documents into the collection (using UUIDv7s)
const documents = [
{
$vectorize: 'Chat bot integrated sneakers that talk to you',
},
{
$vectorize: 'An AI quilt to help you sleep forever',
},
{
$vectorize: 'A deep learning display that controls your mood',
},
];
try {
const inserted = await collection.insertMany(documents);
console.log(`* Inserted ${inserted.insertedCount} items.`);
} catch (e) {
console.log('* Documents found on DB already. Let\'s move on!');
}
The Data API syntax depends on your embedding generation method and other configurations.
For more information and options, see insertMany
and Vector and vectorize.
-
Bring my own embeddings
-
Use an embedding provider integration
// Insert documents with embeddings into the collection
collection.insertMany(
new Document("1")
.append("text", "Chat bot integrated sneakers that talk to you")
.vector(new float[]{0.1f, 0.15f, 0.3f, 0.12f, 0.05f}),
new Document("2")
.append("text", "An AI quilt to help you sleep forever")
.vector(new float[]{0.45f, 0.09f, 0.01f, 0.2f, 0.11f}),
new Document("3")
.append("text", "A deep learning display that controls your mood")
.vector(new float[]{0.1f, 0.05f, 0.08f, 0.3f, 0.6f}));
System.out.println("Inserted documents into the collection");
If you get a |
When you load data into a collection that uses an embedding provider integration, Astra DB vectorize can automatically generate embeddings for your data. For more information, see Vector and vectorize.
// Insert documents into the collection
InsertManyResult insertResult = collection.insertMany(
new Document()
.vectorize("Chat bot integrated sneakers that talk to you"),
new Document()
.vectorize("An AI quilt to help you sleep forever"),
new Document()
.vectorize("A deep learning display that controls your mood")
);
System.out.println("Insert " + insertResult.getInsertedIds().size() + " items.");
The Data API syntax depends on your embedding generation method and other configurations.
For more information and options, see insertMany
and Vector and vectorize.
-
Bring my own embeddings
-
Use an embedding provider integration
# Insert documents with embeddings into the collection
curl -sS -L -X POST "$ASTRA_DB_API_ENDPOINT/api/json/v1/default_keyspace/vector_test" \
--header "Token: $ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"insertMany": {
"documents": [
{
"text": "Chat bot integrated sneakers that talk to you",
"$vector": [0.1, 0.15, 0.3, 0.12, 0.05]
},
{
"text": "An AI quilt to help you sleep forever",
"$vector": [0.45, 0.09, 0.01, 0.2, 0.11]
},
{
"text": "A deep learning display that controls your mood",
"$vector": [0.1, 0.05, 0.08, 0.3, 0.6]
}
],
"options": {
"ordered": false
}
}
}' | jq
If you get a |
When you load data into a collection that uses an embedding provider integration, Astra DB vectorize can automatically generate embeddings for your data. For more information, see Vector and vectorize.
# Insert documents into the collection and generate embeddings.
curl -sS -L -X POST "$ASTRA_DB_API_ENDPOINT/api/json/v1/default_keyspace/pass:q[**COLLECTION_NAME**]" \
--header "Token: $ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"insertMany": {
"documents": [
{
"$vectorize": "Chat bot integrated sneakers that talk to you"
},
{
"$vectorize": "An AI quilt to help you sleep forever"
},
{
"$vectorize": "A deep learning display that controls your mood"
}
],
"options": {
"ordered": false
}
}
}' | jq
After loading data, you can interact with it using the Data Explorer or the Data API.
Load unstructured vector data
This Astra DB Serverless feature is currently in public preview. Development is ongoing, and the features and functionality are subject to change. Astra DB Serverless, and the use of such, is subject to the DataStax Preview Terms. |
The Astra DB Unstructured.io integration transforms your unstructured PDF files into structured vector data. Astra DB processes your files with Unstructured Open Source and loads the resulting JSON or CSV data into your collection.
The Unstructured data loader integration has the following requirements and limitations:
-
You need a Serverless (Vector) database.
-
You need a vector-enabled collection that uses an Astra DB vectorize integration. This integration is not available for collections where the Embedding generation method is Bring my own.
-
This integration is available in the Astra Portal only.
-
This integration supports PDF files only. It does not extract images in PDFs, but it can extract related text, such as captions.
To process PDFs with the Unstructured data loader integration, do the following:
-
In the Astra Portal, go to Databases, and then select your Serverless (Vector) database.
-
Click Data Explorer.
-
Select the keyspace and collection where you want to load data.
-
Click Load Data.
-
Click Unstructured data.
-
Click Select File, and then select up to 10 PDF files to process and load.
You can load unstructured and structured vector data into the same collection, but you can’t load JSON and CSV files together with unstructured file types. You must upload JSON and CSV files separately through the Structured data option.
-
Optional: Modify the chunking configuration:
-
Chunk max characters: Set the maximum chunk length. Unstructured splits oversized chunks to fit this limit. The default is
500
. For more information, seechunk_max_characters
.Your embedding model must support the chunk size you set. If the chunk size is larger than your model supports, an error occurs when you try to load data.
-
Chunk character overlap: Set a chunk prefix of the last
n
characters from the prior chunk. This applies to oversized chunks only. The default is0
. For more information, seechunk_overlap
.
-
-
Click Load Data.
-
Wait while Astra DB processes your files. If necessary, you can cancel processing for any files that have not started processing. Once a file begins processing, you can’t cancel it.
During processing, Astra DB does the following:
-
Uses Unstructured Open Source to process your file into structured data.
-
Loads the resulting JSON or CSV data into your collection.
-
Uses your collection’s Astra DB vectorize integration to generate embeddings based on the extracted
text
field. -
Discards the original PDF files.
After loading data, you can interact with it using the Data Explorer or the Data API.
-
Count records
After you load data, the Data Explorer in the Astra Portal shows a count of Records in the collection:
-
For collections with fewer than 1,000 documents, Astra DB shows an exact number.
-
For collections with 1,000 or more documents, Astra DB shows an approximate number of records.
Astra DB doesn’t count documents in real time. Particularly after loading many documents, allow time for the Records count to refresh.
To run these counts on demand, you can use the Data API countDocuments and estimatedDocumentCount commands.
Load data into a non-vector collection
To load non-vector data into a collection that isn’t vector enabled, do the following:
-
In the Astra Portal, go to Databases, and then select your Serverless (Non-Vector) database.
-
Click Load Data.
-
In the Data Loader, click Select File and choose a CSV to upload. Wait for the upload to complete.
If you don’t have a CSV file, you can click Load a sample dataset and then select a sample dataset.
-
Click Next.
-
Optional: Change the Table Name.
-
Review the data types for each column and choose the correct data type from the drop-down if you see any that aren’t correct.
-
From the Partition keys drop-down, select the columns to use as the partition keys.
-
Optional: From the Clustering columns drop-down, select the columns to use as the clustering columns.
-
Click Next.
-
Optional: Click the Target Database drop-down and select a different database if you want to upload the dataset to a different database from the one you started with.
-
Click the Target Keyspace drop-down and select the keyspace where you want to create your table.
-
Click Finish.
You receive an email notification when the data import is complete.
Load data into a table
You can use CQL or the Data API to manage table data. For more information, see Cassandra Query Language (CQL) for Astra DB and Work with tables.
Load data with DSBulk
If your CSV file is more than 40MB, you can upload data with the DataStax Bulk Loader (DSBulk).
DSBulk provides commands like dsbulk load
, dsbulk unload
, and dsbulk count
, along with extensive options.
For more information, see the DataStax Bulk Loader reference.
-
Download the
dsbulk
installation file. DSBulk 1.11.0 or later is required to support the vector CQL data type. The following command automatically downloads the latest DSBulk version:curl -OL https://downloads.datastax.com/dsbulk/dsbulk.tar.gz
Results
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 242 100 242 0 0 681 0 --:--:-- --:--:-- --:--:-- 691 100 40.4M 100 40.4M 0 0 20.7M 0 0:00:01 0:00:01 --:--:-- 31.6M
-
Extract the DSBulk archive:
tar -xzvf dsbulk.tar.gz
Results
This example uses DSBulk version 1.11.0:
x dsbulk-1.11.0/README.md x dsbulk-1.11.0/LICENSE.txt x dsbulk-1.11.0/manual/ x dsbulk-1.11.0/manual/driver.template.conf x dsbulk-1.11.0/manual/settings.md x dsbulk-1.11.0/manual/application.template.conf x dsbulk-1.11.0/bin/dsbulk x dsbulk-1.11.0/bin/dsbulk.cmd x dsbulk-1.11.0/conf/ x dsbulk-1.11.0/conf/driver.conf x dsbulk-1.11.0/conf/application.conf x dsbulk-1.11.0/THIRD-PARTY.txt x dsbulk-1.11.0/lib/java-driver-core-4.17.0.jar x dsbulk-1.11.0/lib/native-protocol-1.5.1.jar x dsbulk-1.11.0/lib/netty-handler-4.1.94.Final.jar . . . x dsbulk-1.11.0/lib/lz4-java-1.8.0.jar x dsbulk-1.11.0/lib/snappy-java-1.1.7.3.jar x dsbulk-1.11.0/lib/jansi-1.18.jar
-
To verify the installation, run the following command in the same directory where you extracted DSBulk:
dsbulk-VERSION/bin/dsbulk --version
Results
DataStax Bulk Loader v1.11.0
-
Create an application token with the Administrator User role, and then store the token securely.
-
If you haven’t done so already, create a database. You can use either a Serverless (Vector) or Serverless (Non-Vector) database.
-
Download the database’s Secure Connect Bundle (SCB).
-
Create a table in your database:
-
In the Astra Portal, go to your database, and click CQL Console.
-
When the
token@cqlsh>
prompt appears, select the keyspace where you want to create the table:use KEYSPACE_NAME;
-
Create a table to load a sample dataset:
CREATE TABLE KEYSPACE_NAME.world_happiness_report_2021 ( country_name text, regional_indicator text, ladder_score float, gdp_per_capita float, social_support float, healthy_life_expectancy float, generosity float, PRIMARY KEY (country_name) );
If you want to load your own data, replace
world_happiness_report_2021
with your own table name, and then adjust the column names and data types for your data.
-
-
To load the sample dataset, download the World Happiness Report 2021 sample dataset. This is a small sample dataset, but DSBulk can load, unload, and count extremely large files.
DSBulk can also load vector data. For more information, see Loading and unloading vector data with DSBulk. If you need a sample vector dataset, you can download the movie_openai_100.csv sample dataset (3.5MB).
-
Use DSBulk to load data into the table:
dsbulk-VERSION/bin/dsbulk load -url PATH_TO_CSV_FILE -k KEYSPACE_NAME \ -t TABLE_NAME -b PATH_TO_SECURE_CONNECT_BUNDLE -u token \ -p APPLICATION_TOKEN
Results
Operation directory: /path/to/directory/log/LOAD ... total | failed | rows/s | p50ms | p99ms | p999ms | batches 149 | 0 | 400 | 106.65 | 187.70 | 191.89 | 1.00
-
After the upload completes, you can query the loaded data from the CQL Console:
-
Serverless (Vector) databases
-
Serverless (Non-Vector) databases
SELECT * FROM KEYSPACE_NAME.worldhappinessreport2021;
SELECT * FROM KEYSPACE_NAME.world_happiness_report_2021 LIMIT 1;
-