Load data (create collection)
The DataStax Astra DB Serverless (Vector) documentation site is currently in Public Preview and is provided on an “AS IS” basis, without warranty or indemnity of any kind. For more, see the DataStax Preview Terms. |
Load data into a new collection in your database. You can also import a sample dataset to get started right away.
To load data, you must have one of the following roles:
Load your own dataset
-
Astra Portal
-
Python
-
TypeScript
-
Java
Load your own dataset into a new collection from a JSON or a CSV file.
-
Log into the Astra Portal and select your database.
-
Click Load Data.
-
In the Load Data dialog, click Upload dataset.
-
Select the file on your computer that contains your dataset.
Don’t have a dataset handy? Here’s one you can try: movie_openai_100.csv
Once the file finishes uploading, a preview of your data appears.
Your dataset may or may not already contain embeddings. If your dataset doesn’t contain embeddings, you can still configure the collection and later insert the vector data using one of the client APIs.
-
Enter a name for the new collection in the Collection Name field.
-
(Optional) Enable vectors.
If the data importer detects embeddings in your dataset, the Enable Vector toggle automatically turns on and provides you with a pre-populated configuration based on an analysis of your data. If your dataset doesn’t contain embeddings (or if the data importer doesn’t detect them properly) you can still turn on Enable Vector and fill out the configuration manually.
The displayed fields are:
Vector Field/Column
The field or column in your dataset that contains vector embeddings. This field will be mapped to the reserved key
$vector
in the collection.Dimensions
The length of the vectors generated by your embedding model. Clicking this dropdown reveals a list of common embedding models and their vector sizes.
Similarity Metric
The criterion used by your embedding model to compare vectors.
The available metrics are:
-
Cosine Similarity (default)
-
Dot Product (recommended for normalized vectors)
-
Euclidean Distance
Since Dot Product and Cosine Similarity are equivalent for normalized vectors, but the Dot Product algorithm is ~50% faster, it’s recommended that you use Dot Product for the similarity metric. However, if embeddings are not normalized, then Dot Product will silently give you nonsense in queries. (This is why Dot Product is not set as the default.)
-
-
Click Load Data.
Your dataset begins loading into the new collection. Once your data finishes loading, you can start interacting with it using the Data Explorer or the client APIs.
Use the Python client to load documents into your database.
documents = [
{
"_id": "1",
"text": "ChatGPT integrated sneakers that talk to you",
"$vector": [0.1, 0.15, 0.3, 0.12, 0.05],
},
{
"_id": "2",
"text": "An AI quilt to help you sleep forever",
"$vector": [0.45, 0.09, 0.01, 0.2, 0.11],
},
{
"_id": "3",
"text": "A deep learning display that controls your mood",
"$vector": [0.1, 0.05, 0.08, 0.3, 0.6],
},
]
res = col.insert_many(documents)
Use the TypeScript client to load documents into your database.
const documents = [
{
"_id": "1",
"text": "ChatGPT integrated sneakers that talk to you",
"$vector": [0.1, 0.15, 0.3, 0.12, 0.05],
},
{
"_id": "2",
"text": "An AI quilt to help you sleep forever",
"$vector": [0.45, 0.09, 0.01, 0.2, 0.11],
},
{
"_id": "3",
"text": "A deep learning display that controls your mood",
"$vector": [0.1, 0.05, 0.08, 0.3, 0.6],
}
];
const res = await col.insertMany(documents);
Use the Java client to load documents into your database.
col.insertMany(List.of(
new JsonDocument()
.id("1")
.put("text", "ChatGPT integrated sneakers that talk to you")
.vector(new float[]{0.1f, 0.15f, 0.3f, 0.12f, 0.05f}),
new JsonDocument()
.id("2")
.put("text", "An AI quilt to help you sleep forever")
.vector(new float[]{0.45f, 0.09f, 0.01f, 0.2f, 0.11f}),
new JsonDocument()
.id("3")
.put("text", "A deep learning display that controls your mood")
.vector(new float[]{0.1f, 0.05f, 0.08f, 0.3f, 0.6f})
));
Load a sample dataset
Various sample datasets are available to help you explore features and test your applications. Loading a sample dataset creates a pre-configured collection in your database.
-
Log into the Astra Portal and select your database.
-
Click Load Data.
-
In the Load Data dialog, click Import sample dataset.
-
Use the Data Type dropdown to select the type of data you’d like to work with.
-
Click Load Data.
The sample dataset begins loading into the new collection. Once the data loads, you can interact with it using the Data Explorer or the client APIs.