Create a collection
Creates a new collection in a Serverless (Vector) database.
Ready to write code? See the examples for this method to get started. If you are new to the Data API, check out the quickstart. |
Result
-
Python
-
TypeScript
-
Java
-
curl
Creates a collection with the specified parameters.
Returns a Collection
object.
You can use this object to work with documents in the collection.
Unless you specify the document_type
parameter, the collection is typed as Collection[dict]
.
For more information, see Typing support.
Example response:
Collection(name="COLLECTION_NAME", keyspace="default_keyspace", database.api_endpoint="API_ENDPOINT", api_options=FullAPIOptions(token=StaticTokenProvider("APPLICATION_TOKEN"...), ...))
Creates a collection with the specified parameters.
Returns a promise that resolves to a Collection<Schema>
object.
You can use this object to work with documents in the collection.
A Collection
is typed as Collection<Schema>
, where Schema
defaults to SomeDoc
(Record<string, any>
).
Providing the specific Schema
type enables stronger typing for collection operations.
For more information, see Typing Collections and Tables.
Creates a collection with the specified parameters.
Returns a Collection
object.
You can use this object to work with documents in the collection.
Creates a collection with the specified parameters.
If the command succeeds, the response indicates the success.
Example response:
{
"status": {
"ok": 1
}
}
Parameters
You cannot edit a collection’s definition after you create the collection. |
-
Python
-
TypeScript
-
Java
-
curl
The signature of this method changed in Python client version 2.0. If you are using an earlier version, DataStax recommends upgrading to the latest version. For more information, see Data API client upgrade guide. |
Use the create_collection
method, which belongs to the astrapy.Database
class.
Method signature
create_collection(
name: str,
*,
definition: CollectionDefinition | dict[str, Any] | None,
document_type: type[Any],
keyspace: str,
collection_admin_timeout_ms: int,
embedding_api_key: str | EmbeddingHeadersProvider,
spawn_api_options: APIOptions,
) -> Collection
Most astrapy
objects have an asynchronous counterpart, for use within the asyncio
framework.
To get an AsyncCollection
, use the create_collection
method of instances of AsyncDatabase
, or alternatively the to_async
method of the synchronous Collection
class.
See the AsyncCollection client reference for details about the async API.
Name | Type | Summary |
---|---|---|
|
|
The name of the new collection. Collection names must follow these rules:
|
|
The full configuration for the collection. See the Properties of You can define Plain Python dictionaries can be passed for |
|
|
|
Optional.
A formal specifier for the type checker.
If provided, Default: |
|
|
The keyspace in which to create the collection. Default: The working keyspace for the database. |
|
|
A timeout, in milliseconds, to impose on the underlying API request.
If not provided, the corresponding |
|
|
Optional. This only applies to collections with a vectorize embedding provider integration. Use this option to provide the API key directly with headers instead of using an API key in the Astra DB KMS. The API key is sent to the Data API for every operation on the collection. It is useful when a vectorize service is configured but no credentials are stored, or when you want to override the stored credentials. For more information, see Auto-generate embeddings with vectorize. |
|
A complete or partial specification of the APIOptions to override the defaults inherited from the |
Name | Type | Summary |
---|---|---|
|
Optional. The vector configuration for the collection. This includes things like the vector dimension and similarity metric. This also includes settings for server-side embedding generation if you want your collection to have vectorize enabled. Required for vector search and hybrid search. See the example with a vector service and example without a vector service for usage. |
|
|
Optional. The lexical search configuration for the collection. Only collections in databases in the AWS The
See the example for usage. Default: A |
|
|
Optional. The reranker configuration for the collection. Only collections in databases in the AWS The
See the example for usage. Default: A |
|
|
|
Optional. The selective indexing configuration for the collection. See the example to specify which fields to index and the example to specify which fields to not index for usage. Default: All fields of all documents. |
|
Optional.
Specifies the default ID type for documents in the collection.
This is used when you insert a document without an Can be one of:
See the example for usage. For more information, see Document IDs. Default: |
Use the createCollection
method, which belongs to the Db
class.
Method signature
async createCollection<Schema extends SomeDoc = SomeDoc>(
name: string,
options?: {
vector?: CollectionVectorOptions,
indexing?: CollectionIndexingOptions<Schema>,
defaultId?: CollectionDefaultIdOptions,
lexical?: CollectionLexicalOptions,
rerank?: CollectionRerankOptions,
logging?: DataAPILoggingConfig,
keyspace?: string,
embeddingApiKey?: string | EmbeddingHeadersProvider,
serdes?: CollectionSerDesConfig,
timeoutDefaults?: TimeoutDescriptor,
timeout?: number | TimeoutDescriptor,
}
): Collection<Schema>
Name | Type | Summary |
---|---|---|
|
|
The name of the new collection. Collection names must follow these rules:
|
|
Optional.
The options for this operation. See Properties of |
Name | Type | Summary |
---|---|---|
Optional. The vector configuration for the collection. This includes things like the vector dimension and similarity metric. This also includes settings for server-side embedding generation if you want your collection to have vectorize enabled. Required for vector search and hybrid search. See the example with a vector service and example without a vector service for usage. |
||
Optional. The lexical search configuration for the collection. Only collections in databases in the AWS The
See the example for usage. Default: A |
||
Optional. The reranker configuration for the collection. Only collections in databases in the AWS The
See the example for usage. Default: A |
||
Optional. The selective indexing configuration for the collection. See the example to specify which fields to index and the example to specify which fields to not index for usage. Default: All fields of all documents. |
||
Optional.
Specifies the default ID type for documents in the collection.
This is used when you insert a document without an Can be one of:
See the example for usage. For more information, see Document IDs. Default: Each autogenerated |
||
|
Optional. This only applies to collections with a vectorize embedding provider integration. Use this option to provide the API key directly with headers instead of using an API key in the Astra DB KMS. The API key is sent to the Data API for every operation on the collection. It is useful when a vectorize service is configured but no credentials are stored, or when you want to override the stored credentials. For more information, see Auto-generate embeddings with vectorize. |
|
|
The keyspace in which to create the collection. Default: The working keyspace for the database. |
|
|
Optional. The configuration for logging events emitted by the DataAPIClient. |
|
|
Optional. The configuration for serialization/deserialization by the DataAPIClient. For more information, see Custom Ser/Des. |
|
|
Optional. The default timeout(s) to apply to operations performed on this Collection instance.
You can specify Details about the
|
|
|
|
Optional. The timeout to apply to this method. Only Default: 60 seconds, unless you specified a different default along the Options Hierarchy. |
Use the createCollection
method, which belongs to the com.datastax.astra.client.Database
class.
Method signature
Collection<Document> createCollection(String collectionName)
Collection<Document> createCollection(
String collectionName,
CollectionDefinition collectionDefinition
)
Collection<Document> createCollection(
String collectionName,
CollectionDefinition collectionDefinition,
CreateCollectionOptions options
)
<T> Collection<T> createCollection(
String collectionName,
Class<T> documentClass
)
<T> Collection<T> createCollection(
String collectionName,
CollectionDefinition collectionDefinition,
Class<T> documentClass
)
<T> Collection<T> createCollection(
String collectionName,
CollectionDefinition collectionDefinition,
Class<T> documentClass,
CreateCollectionOptions options
)
Name | Type | Summary |
---|---|---|
|
|
The name of the new collection. Collection names must follow these rules:
|
|
Settings for the collection, including vector options, the default ID format, and indexing options. |
|
|
Options for the operation, including the keyspace. |
|
|
|
Working with specialized beans for the collection and not the default |
Use the createCollection
command.
Command signature
curl -sS -L -X POST "API_ENDPOINT/api/json/v1/KEYSPACE_NAME" \
--header "Token: APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": OPTIONS
}
}'
Name | Type | Summary |
---|---|---|
|
|
The name of the new collection. Collection names must follow these rules:
|
|
|
Optional.
The options for this operation. See Properties of |
Name | Type | Summary |
---|---|---|
|
|
Optional.
Specifies the default ID type for documents in the collection.
This is used when you insert a document without an Can be one of:
See the example for usage. For more information, see Document IDs. Default: Each autogenerated |
|
|
Optional. The vector configuration for the collection. This includes things like the vector dimension and similarity metric. This also includes settings for server-side embedding generation if you want your collection to have vectorize enabled. Required for vector search and hybrid search. The
See the example with a vector service and example without a vector service for usage. |
|
|
Optional. The lexical search configuration for the collection. Only collections in databases in the AWS The
See the example for usage. Default: An object with an |
|
|
Optional. The reranker configuration for the collection. Only collections in databases in the AWS The
See the example for usage. Default: An object with an |
|
|
Optional. Configures selective indexing for data inserted to the collection. The * * See the example to specify which fields to index and the example to specify which fields to not index for usage. Default: All fields of all documents. |
Examples
The following examples demonstrate how to create a collection.
Create a collection that is not vector-enabled
-
Python
-
TypeScript
-
Java
-
curl
from astrapy import DataAPIClient
# Get a database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection = database.create_collection("COLLECTION_NAME")
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
// Define the type for the collection
interface User {
name: string;
age?: number;
}
// Create a collection
(async function () {
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
// Create a collection
(async function () {
const collection = await database.createCollection("COLLECTION_NAME");
})();
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.databases.Database;
public class Example {
public static void main(String[] args) {
// Get a database
Database database = new DataAPIClient("APPLICATION_TOKEN").getDatabase("API_ENDPOINT");
// Create a collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME");
}
}
curl -sS -L -X POST "API_ENDPOINT/api/json/v1/KEYSPACE_NAME" \
--header "Token: APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {}
}
}'
Create a collection that can store vector embeddings
Collections that are vector-enabled can store vector embeddings in the reserved $vector
field and work with vector search.
-
Python
-
TypeScript
-
Java
-
curl
The Python client supports multiple ways to create a collection:
-
You can define the collection parameters in a
CollectionDefinition
object and then create the collection from theCollectionDefinition
object. -
You can use a fluent interface to build the collection definition and then create the collection from the definition.
-
CollectionDefinition object
-
Fluent interface
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import CollectionDefinition, CollectionVectorOptions
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
dimension=1024,
metric=VectorMetric.COSINE,
),
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
from astrapy import DataAPIClient
from astrapy.info import CollectionDefinition
from astrapy.constants import VectorMetric
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(1024)
.set_vector_metric(VectorMetric.COSINE)
.build()
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
as an inline field in your interfaces, or you can extend the utility VectorDoc
type provided by the client.
import { DataAPIClient, VectorDoc } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
// Define the type for the collection
interface User extends VectorDoc {
name: string;
age?: number;
}
(async function () {
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
{
vector: {
dimension: 1024,
metric: "cosine",
},
},
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, or type-related issues will occur.
Consider using a type like VectorDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
field to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
(async function () {
const collection = await database.createCollection("COLLECTION_NAME", {
vector: {
dimension: 1024,
metric: "cosine",
},
});
})();
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
import com.datastax.astra.client.databases.Database;
public class Example {
public static void main(String[] args) {
// Get a database
Database database = new DataAPIClient("APPLICATION_TOKEN").getDatabase("API_ENDPOINT");
// Create a collection
CollectionDefinition collectionDefinition =
new CollectionDefinition().vectorDimension(1024).vectorSimilarity(SimilarityMetric.COSINE);
Collection<Document> collection =
database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
curl -sS -L -X POST "API_ENDPOINT/api/json/v1/KEYSPACE_NAME" \
--header "Token: APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"dimension": 1024,
"metric": "cosine"
}
}
}
}'
Create a collection that can automatically generate vector embeddings
If you want to automatically generate vector embeddings, create a vector-enabled collection and configure an embedding provider integration for the collection.
The configuration depends on the embedding provider.
-
Python
-
TypeScript
-
Java
-
curl
Azure OpenAI
For more detailed instructions, see Integrate Azure OpenAI as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.SIMILARITY_METRIC,
dimension=MODEL_DIMENSIONS,
service=VectorServiceOptions(
provider="azureOpenAI",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
parameters={
"resourceName": "RESOURCE_NAME",
"deploymentId": "DEPLOYMENT_ID",
},
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(MODEL_DIMENSIONS)
.set_vector_metric(VectorMetric.SIMILARITY_METRIC)
.set_vector_service(
provider="azureOpenAI",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
parameters={
"resourceName": "RESOURCE_NAME",
"deploymentId": "DEPLOYMENT_ID",
},
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Azure OpenAI API key that you want to use. Must be the name of an existing Azure OpenAI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:text-embedding-3-small
,text-embedding-3-large
,text-embedding-ada-002
.For Azure OpenAI, you must select the model that matches the one deployed to your
DEPLOYMENT_ID
in Azure. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
RESOURCE_NAME
: The name of your Azure OpenAI Service resource, as defined in the resource’s Instance details. For more information, see the Azure OpenAI documentation. -
DEPLOYMENT_ID
: Your Azure OpenAI resource’s Deployment name. For more information, see the Azure OpenAI documentation.
Hugging Face - Dedicated
For more detailed instructions, see Integrate Hugging Face Dedicated as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.SIMILARITY_METRIC,
dimension=MODEL_DIMENSIONS,
service=VectorServiceOptions(
provider="huggingfaceDedicated",
model_name="endpoint-defined-model",
authentication={
"providerKey": "API_KEY_NAME",
},
parameters={
"endpointName": "ENDPOINT_NAME",
"regionName": "REGION_NAME",
"cloudName": "CLOUD_NAME",
},
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(MODEL_DIMENSIONS)
.set_vector_metric(VectorMetric.SIMILARITY_METRIC)
.set_vector_service(
provider="huggingfaceDedicated",
model_name="endpoint-defined-model",
authentication={
"providerKey": "API_KEY_NAME",
},
parameters={
"endpointName": "ENDPOINT_NAME",
"regionName": "REGION_NAME",
"cloudName": "CLOUD_NAME",
},
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Hugging Face Dedicated user access token that you want to use. Must be the name of an existing Hugging Face Dedicated user access token in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:endpoint-defined-model
.For Hugging Face Dedicated, you must deploy the model as a text embeddings inference (TEI) container.
You must set
MODEL_NAME
toendpoint-defined-model
because this integration uses the model specified in your dedicated endpoint configuration. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
ENDPOINT_NAME
: The programmatically-generated name of your Hugging Face Dedicated endpoint. This is the first part of the endpoint URL. For example, if your endpoint URL ishttps://mtp1x7muf6qyn3yh.us-east-2.aws.endpoints.huggingface.cloud
, the endpoint name ismtp1x7muf6qyn3yh
. -
REGION_NAME
: The cloud provider region your Hugging Face Dedicated endpoint is deployed to. For example,us-east-2
. -
CLOUD_NAME
: The cloud provider your Hugging Face Dedicated endpoint is deployed to. For example,aws
.
Hugging Face - Serverless
For more detailed instructions, see Integrate Hugging Face Serverless as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.SIMILARITY_METRIC,
dimension=MODEL_DIMENSIONS,
service=VectorServiceOptions(
provider="huggingface",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(MODEL_DIMENSIONS)
.set_vector_metric(VectorMetric.SIMILARITY_METRIC)
.set_vector_service(
provider="huggingface",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
}
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Hugging Face Serverless user access token that you want to use. Must be the name of an existing Hugging Face Serverless user access token in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:sentence-transformers/all-MiniLM-L6-v2
,intfloat/multilingual-e5-large
,intfloat/multilingual-e5-large-instruct
,BAAI/bge-small-en-v1.5
,BAAI/bge-base-en-v1.5
,BAAI/bge-large-en-v1.5
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Jina AI
For more detailed instructions, see Integrate Jina AI as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.SIMILARITY_METRIC,
dimension=MODEL_DIMENSIONS,
service=VectorServiceOptions(
provider="jinaAI",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(MODEL_DIMENSIONS)
.set_vector_metric(VectorMetric.SIMILARITY_METRIC)
.set_vector_service(
provider="jinaAI",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
}
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Jina AI API key that you want to use. Must be the name of an existing Jina AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:jina-embeddings-v2-base-en
,jina-embeddings-v2-base-de
,jina-embeddings-v2-base-es
,jina-embeddings-v2-base-code
,jina-embeddings-v2-base-zh
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Mistral AI
For more detailed instructions, see Integrate Mistral AI as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.SIMILARITY_METRIC,
dimension=MODEL_DIMENSIONS,
service=VectorServiceOptions(
provider="mistral",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(MODEL_DIMENSIONS)
.set_vector_metric(VectorMetric.SIMILARITY_METRIC)
.set_vector_service(
provider="mistral",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
}
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Mistral AI API key that you want to use. Must be the name of an existing Mistral AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:mistral-embed
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
NVIDIA
For more detailed instructions, see Integrate NVIDIA as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.COSINE,
service=VectorServiceOptions(
provider="nvidia",
model_name="NV-Embed-QA",
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_metric(VectorMetric.COSINE)
.set_vector_service(
provider="nvidia",
model_name="NV-Embed-QA"
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
OpenAI
For more detailed instructions, see Integrate OpenAI as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.SIMILARITY_METRIC,
dimension=MODEL_DIMENSIONS,
service=VectorServiceOptions(
provider="openai",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
parameters={
"organizationId": "ORGANIZATION_ID",
"projectId": "PROJECT_ID",
},
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(MODEL_DIMENSIONS)
.set_vector_metric(VectorMetric.SIMILARITY_METRIC)
.set_vector_service(
provider="openai",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
parameters={
"organizationId": "ORGANIZATION_ID",
"projectId": "PROJECT_ID",
},
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the OpenAI API key that you want to use. Must be the name of an existing OpenAI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:text-embedding-3-small
,text-embedding-3-large
,text-embedding-ada-002
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
ORGANIZATION_ID
: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference. -
PROJECT_ID
: Optional. The ID of the OpenAI project that owns the API key. This cannot use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.
Upstage
For more detailed instructions, see Integrate Upstage as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.SIMILARITY_METRIC,
dimension=MODEL_DIMENSIONS,
service=VectorServiceOptions(
provider="upstageAI",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(MODEL_DIMENSIONS)
.set_vector_metric(VectorMetric.SIMILARITY_METRIC)
.set_vector_service(
provider="upstageAI",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
}
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Upstage API key that you want to use. Must be the name of an existing Upstage API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:solar-embedding-1-large
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Voyage AI
For more detailed instructions, see Integrate Voyage AI as an embedding provider.
-
CollectionDefinition object
-
Fluent interface
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.SIMILARITY_METRIC,
dimension=MODEL_DIMENSIONS,
service=VectorServiceOptions(
provider="voyageAI",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
},
)
)
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionVectorOptions,
VectorServiceOptions,
)
# Instantiate the client
client = DataAPIClient()
# Connect to a database
database = client.get_database(
os.environ["API_ENDPOINT"],
token=os.environ["APPLICATION_TOKEN"]
)
# Define the collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(MODEL_DIMENSIONS)
.set_vector_metric(VectorMetric.SIMILARITY_METRIC)
.set_vector_service(
provider="voyageAI",
model_name="MODEL_NAME",
authentication={
"providerKey": "API_KEY_NAME",
}
)
.build()
)
# Create the collection
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
print(f"* Collection: {collection.full_name}\n")
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Voyage AI API key that you want to use. Must be the name of an existing Voyage AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:voyage-2
,voyage-code-2
,voyage-finance-2
,voyage-large-2
,voyage-large-2-instruct
,voyage-law-2
,voyage-multilingual-2
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Azure OpenAI
For more detailed instructions, see Integrate Azure OpenAI as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "azureOpenAI",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
parameters: {
resourceName: "RESOURCE_NAME",
deploymentId: "DEPLOYMENT_ID",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "azureOpenAI",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
parameters: {
resourceName: "RESOURCE_NAME",
deploymentId: "DEPLOYMENT_ID",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Azure OpenAI API key that you want to use. Must be the name of an existing Azure OpenAI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:text-embedding-3-small
,text-embedding-3-large
,text-embedding-ada-002
.For Azure OpenAI, you must select the model that matches the one deployed to your
DEPLOYMENT_ID
in Azure. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
RESOURCE_NAME
: The name of your Azure OpenAI Service resource, as defined in the resource’s Instance details. For more information, see the Azure OpenAI documentation. -
DEPLOYMENT_ID
: Your Azure OpenAI resource’s Deployment name. For more information, see the Azure OpenAI documentation.
Hugging Face - Dedicated
For more detailed instructions, see Integrate Hugging Face Dedicated as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "huggingfaceDedicated",
modelName: "endpoint-defined-model",
authentication: {
providerKey: "API_KEY_NAME",
},
parameters: {
endpointName: "ENDPOINT_NAME",
regionName: "REGION_NAME",
cloudName: "CLOUD_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "huggingfaceDedicated",
modelName: "endpoint-defined-model",
authentication: {
providerKey: "API_KEY_NAME",
},
parameters: {
endpointName: "ENDPOINT_NAME",
regionName: "REGION_NAME",
cloudName: "CLOUD_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Hugging Face Dedicated user access token that you want to use. Must be the name of an existing Hugging Face Dedicated user access token in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:endpoint-defined-model
.For Hugging Face Dedicated, you must deploy the model as a text embeddings inference (TEI) container.
You must set
MODEL_NAME
toendpoint-defined-model
because this integration uses the model specified in your dedicated endpoint configuration. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
ENDPOINT_NAME
: The programmatically-generated name of your Hugging Face Dedicated endpoint. This is the first part of the endpoint URL. For example, if your endpoint URL ishttps://mtp1x7muf6qyn3yh.us-east-2.aws.endpoints.huggingface.cloud
, the endpoint name ismtp1x7muf6qyn3yh
. -
REGION_NAME
: The cloud provider region your Hugging Face Dedicated endpoint is deployed to. For example,us-east-2
. -
CLOUD_NAME
: The cloud provider your Hugging Face Dedicated endpoint is deployed to. For example,aws
.
Hugging Face - Serverless
For more detailed instructions, see Integrate Hugging Face Serverless as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "huggingface",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "huggingface",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Hugging Face Serverless user access token that you want to use. Must be the name of an existing Hugging Face Serverless user access token in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:sentence-transformers/all-MiniLM-L6-v2
,intfloat/multilingual-e5-large
,intfloat/multilingual-e5-large-instruct
,BAAI/bge-small-en-v1.5
,BAAI/bge-base-en-v1.5
,BAAI/bge-large-en-v1.5
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Jina AI
For more detailed instructions, see Integrate Jina AI as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "jinaAI",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "jinaAI",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Jina AI API key that you want to use. Must be the name of an existing Jina AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:jina-embeddings-v2-base-en
,jina-embeddings-v2-base-de
,jina-embeddings-v2-base-es
,jina-embeddings-v2-base-code
,jina-embeddings-v2-base-zh
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Mistral AI
For more detailed instructions, see Integrate Mistral AI as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "mistral",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "mistral",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Mistral AI API key that you want to use. Must be the name of an existing Mistral AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:mistral-embed
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
NVIDIA
For more detailed instructions, see Integrate NVIDIA as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
metric: "cosine",
service: {
provider: "nvidia",
modelName: "NV-Embed-QA",
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
metric: "cosine",
service: {
provider: "nvidia",
modelName: "NV-Embed-QA",
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
OpenAI
For more detailed instructions, see Integrate OpenAI as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "openai",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
parameters: {
organizationId: "ORGANIZATION_ID",
projectId: "PROJECT_ID",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "openai",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
parameters: {
organizationId: "ORGANIZATION_ID",
projectId: "PROJECT_ID",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the OpenAI API key that you want to use. Must be the name of an existing OpenAI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:text-embedding-3-small
,text-embedding-3-large
,text-embedding-ada-002
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
ORGANIZATION_ID
: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference. -
PROJECT_ID
: Optional. The ID of the OpenAI project that owns the API key. This cannot use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.
Upstage
For more detailed instructions, see Integrate Upstage as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "upstageAI",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "upstageAI",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Upstage API key that you want to use. Must be the name of an existing Upstage API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:solar-embedding-1-large
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Voyage AI
For more detailed instructions, see Integrate Voyage AI as an embedding provider.
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
and $vectorize
as inlines fields in your interfaces, or you can extend the utility VectorizeDoc
types provided by the client.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the type for the collection
interface User extends VectorizeDoc {
name: string,
age?: number,
}
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "voyageAI",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
collection_definition
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
field must still be a string, or type-related issues will occur.
Consider using a type like VectorizeDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
and $vectorize
fields to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Instantiate the client
const client = new DataAPIClient();
// Connect to a database
const database = client.db(process.env.API_ENDPOINT, {
token: process.env.APPLICATION_TOKEN,
});
// Define the collection
const collection_definition = {
vector: {
dimension: MODEL_DIMENSIONS,
metric: "SIMILARITY_METRIC",
service: {
provider: "voyageAI",
modelName: "MODEL_NAME",
authentication: {
providerKey: "API_KEY_NAME",
},
},
},
};
(async function () {
// Create the collection
const collection = await database.createCollection(
"COLLECTION_NAME",
collection_definition
);
})();
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Voyage AI API key that you want to use. Must be the name of an existing Voyage AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:voyage-2
,voyage-code-2
,voyage-finance-2
,voyage-large-2
,voyage-large-2-instruct
,voyage-law-2
,voyage-multilingual-2
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Azure OpenAI
For more detailed instructions, see Integrate Azure OpenAI as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
Map<String, Object> parameters = new HashMap<>();
parameters.put("resourceName", "RESOURCE_NAME");
parameters.put("deploymentId", "DEPLOYMENT_ID");
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorDimension(MODEL_DIMENSIONS)
.vectorSimilarity(SimilarityMetric.SIMILARITY_METRIC)
.vectorize(
"azureOpenAI",
"MODEL_NAME",
"API_KEY_NAME",
parameters);
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Azure OpenAI API key that you want to use. Must be the name of an existing Azure OpenAI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:text-embedding-3-small
,text-embedding-3-large
,text-embedding-ada-002
.For Azure OpenAI, you must select the model that matches the one deployed to your
DEPLOYMENT_ID
in Azure. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
RESOURCE_NAME
: The name of your Azure OpenAI Service resource, as defined in the resource’s Instance details. For more information, see the Azure OpenAI documentation. -
DEPLOYMENT_ID
: Your Azure OpenAI resource’s Deployment name. For more information, see the Azure OpenAI documentation.
Hugging Face - Dedicated
For more detailed instructions, see Integrate Hugging Face Dedicated as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
Map<String, Object> parameters = new HashMap<>();
parameters.put("endpointName", "ENDPOINT_NAME");
parameters.put("regionName", "REGION_NAME");
parameters.put("cloudName", "CLOUD_NAME");
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorDimension(MODEL_DIMENSIONS)
.vectorSimilarity(SimilarityMetric.SIMILARITY_METRIC)
.vectorize(
"huggingfaceDedicated",
"endpoint-defined-model",
"API_KEY_NAME",
parameters);
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Hugging Face Dedicated user access token that you want to use. Must be the name of an existing Hugging Face Dedicated user access token in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:endpoint-defined-model
.For Hugging Face Dedicated, you must deploy the model as a text embeddings inference (TEI) container.
You must set
MODEL_NAME
toendpoint-defined-model
because this integration uses the model specified in your dedicated endpoint configuration. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
ENDPOINT_NAME
: The programmatically-generated name of your Hugging Face Dedicated endpoint. This is the first part of the endpoint URL. For example, if your endpoint URL ishttps://mtp1x7muf6qyn3yh.us-east-2.aws.endpoints.huggingface.cloud
, the endpoint name ismtp1x7muf6qyn3yh
. -
REGION_NAME
: The cloud provider region your Hugging Face Dedicated endpoint is deployed to. For example,us-east-2
. -
CLOUD_NAME
: The cloud provider your Hugging Face Dedicated endpoint is deployed to. For example,aws
.
Hugging Face - Serverless
For more detailed instructions, see Integrate Hugging Face Serverless as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorDimension(MODEL_DIMENSIONS)
.vectorSimilarity(SimilarityMetric.SIMILARITY_METRIC)
.vectorize(
"huggingface",
"MODEL_NAME",
"API_KEY_NAME");
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Hugging Face Serverless user access token that you want to use. Must be the name of an existing Hugging Face Serverless user access token in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:sentence-transformers/all-MiniLM-L6-v2
,intfloat/multilingual-e5-large
,intfloat/multilingual-e5-large-instruct
,BAAI/bge-small-en-v1.5
,BAAI/bge-base-en-v1.5
,BAAI/bge-large-en-v1.5
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Jina AI
For more detailed instructions, see Integrate Jina AI as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorDimension(MODEL_DIMENSIONS)
.vectorSimilarity(SimilarityMetric.SIMILARITY_METRIC)
.vectorize(
"jinaAI",
"MODEL_NAME",
"API_KEY_NAME");
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Jina AI API key that you want to use. Must be the name of an existing Jina AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:jina-embeddings-v2-base-en
,jina-embeddings-v2-base-de
,jina-embeddings-v2-base-es
,jina-embeddings-v2-base-code
,jina-embeddings-v2-base-zh
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Mistral AI
For more detailed instructions, see Integrate Mistral AI as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorDimension(MODEL_DIMENSIONS)
.vectorSimilarity(SimilarityMetric.SIMILARITY_METRIC)
.vectorize(
"mistral",
"MODEL_NAME",
"API_KEY_NAME");
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Mistral AI API key that you want to use. Must be the name of an existing Mistral AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:mistral-embed
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
NVIDIA
For more detailed instructions, see Integrate NVIDIA as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorSimilarity(SimilarityMetric.COSINE)
.vectorize(
"nvidia",
"NV-Embed-QA");
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
OpenAI
For more detailed instructions, see Integrate OpenAI as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
Map<String, Object> parameters = new HashMap<>();
parameters.put("organizationId", "ORGANIZATION_ID");
parameters.put("projectId", "PROJECT_ID");
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorDimension(MODEL_DIMENSIONS)
.vectorSimilarity(SimilarityMetric.SIMILARITY_METRIC)
.vectorize(
"openai",
"MODEL_NAME",
"API_KEY_NAME",
parameters);
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the OpenAI API key that you want to use. Must be the name of an existing OpenAI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:text-embedding-3-small
,text-embedding-3-large
,text-embedding-ada-002
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
ORGANIZATION_ID
: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference. -
PROJECT_ID
: Optional. The ID of the OpenAI project that owns the API key. This cannot use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.
Upstage
For more detailed instructions, see Integrate Upstage as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorDimension(MODEL_DIMENSIONS)
.vectorSimilarity(SimilarityMetric.SIMILARITY_METRIC)
.vectorize(
"upstageAI",
"MODEL_NAME",
"API_KEY_NAME");
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Upstage API key that you want to use. Must be the name of an existing Upstage API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:solar-embedding-1-large
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Voyage AI
For more detailed instructions, see Integrate Voyage AI as an embedding provider.
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.databases.Database;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.core.vector.SimilarityMetric;
public class Example {
public static void main(String[] args) {
// Instantiate the client
DataAPIClient client = new DataAPIClient(new DataAPIClientOptions());
// Connect to a database
Database database =
client.getDatabase(
System.getenv("API_ENDPOINT"),
new DatabaseOptions(System.getenv("APPLICATION_TOKEN"), new DataAPIClientOptions()));
// Define the collection
CollectionDefinition collectionDefinition =
new CollectionDefinition()
.vectorDimension(MODEL_DIMENSIONS)
.vectorSimilarity(SimilarityMetric.SIMILARITY_METRIC)
.vectorize(
"voyageAI",
"MODEL_NAME",
"API_KEY_NAME");
// Create the collection
Collection<Document> collection = database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Voyage AI API key that you want to use. Must be the name of an existing Voyage AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:voyage-2
,voyage-code-2
,voyage-finance-2
,voyage-large-2
,voyage-large-2-instruct
,voyage-law-2
,voyage-multilingual-2
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Azure OpenAI
For more detailed instructions, see Integrate Azure OpenAI as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"dimension": MODEL_DIMENSIONS,
"metric": "SIMILARITY_METRIC",
"service": {
"provider": "azureOpenAI",
"modelName": "MODEL_NAME",
"authentication": {
"providerKey": "API_KEY_NAME"
},
"parameters": {
"resourceName": "RESOURCE_NAME",
"deploymentId": "DEPLOYMENT_ID"
}
}
}
}
}
}'
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Azure OpenAI API key that you want to use. Must be the name of an existing Azure OpenAI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:text-embedding-3-small
,text-embedding-3-large
,text-embedding-ada-002
.For Azure OpenAI, you must select the model that matches the one deployed to your
DEPLOYMENT_ID
in Azure. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
RESOURCE_NAME
: The name of your Azure OpenAI Service resource, as defined in the resource’s Instance details. For more information, see the Azure OpenAI documentation. -
DEPLOYMENT_ID
: Your Azure OpenAI resource’s Deployment name. For more information, see the Azure OpenAI documentation.
Hugging Face - Dedicated
For more detailed instructions, see Integrate Hugging Face Dedicated as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"metric": "SIMILARITY_METRIC",
"dimension": MODEL_DIMENSIONS,
"service": {
"provider": "huggingfaceDedicated",
"modelName": "endpoint-defined-model",
"authentication": {
"providerKey": "API_KEY_NAME"
},
"parameters": {
"endpointName": "ENDPOINT_NAME",
"regionName": "REGION_NAME",
"cloudName": "CLOUD_NAME"
}
}
}
}
}
}'
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Hugging Face Dedicated user access token that you want to use. Must be the name of an existing Hugging Face Dedicated user access token in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:endpoint-defined-model
.For Hugging Face Dedicated, you must deploy the model as a text embeddings inference (TEI) container.
You must set
MODEL_NAME
toendpoint-defined-model
because this integration uses the model specified in your dedicated endpoint configuration. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
ENDPOINT_NAME
: The programmatically-generated name of your Hugging Face Dedicated endpoint. This is the first part of the endpoint URL. For example, if your endpoint URL ishttps://mtp1x7muf6qyn3yh.us-east-2.aws.endpoints.huggingface.cloud
, the endpoint name ismtp1x7muf6qyn3yh
. -
REGION_NAME
: The cloud provider region your Hugging Face Dedicated endpoint is deployed to. For example,us-east-2
. -
CLOUD_NAME
: The cloud provider your Hugging Face Dedicated endpoint is deployed to. For example,aws
.
Hugging Face - Serverless
For more detailed instructions, see Integrate Hugging Face Serverless as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"dimension": MODEL_DIMENSIONS,
"metric": "SIMILARITY_METRIC",
"service": {
"provider": "huggingface",
"modelName": "MODEL_NAME",
"authentication": {
"providerKey": "API_KEY_NAME"
}
}
}
}
}
}'
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Hugging Face Serverless user access token that you want to use. Must be the name of an existing Hugging Face Serverless user access token in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:sentence-transformers/all-MiniLM-L6-v2
,intfloat/multilingual-e5-large
,intfloat/multilingual-e5-large-instruct
,BAAI/bge-small-en-v1.5
,BAAI/bge-base-en-v1.5
,BAAI/bge-large-en-v1.5
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Jina AI
For more detailed instructions, see Integrate Jina AI as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"dimension": MODEL_DIMENSIONS,
"metric": "SIMILARITY_METRIC",
"service": {
"provider": "jinaAI",
"modelName": "MODEL_NAME",
"authentication": {
"providerKey": "API_KEY_NAME"
}
}
}
}
}
}'
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Jina AI API key that you want to use. Must be the name of an existing Jina AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:jina-embeddings-v2-base-en
,jina-embeddings-v2-base-de
,jina-embeddings-v2-base-es
,jina-embeddings-v2-base-code
,jina-embeddings-v2-base-zh
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Mistral AI
For more detailed instructions, see Integrate Mistral AI as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"dimension": MODEL_DIMENSIONS,
"metric": "SIMILARITY_METRIC",
"service": {
"provider": "mistral",
"modelName": "MODEL_NAME",
"authentication": {
"providerKey": "API_KEY_NAME"
}
}
}
}
}
}'
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Mistral AI API key that you want to use. Must be the name of an existing Mistral AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:mistral-embed
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
NVIDIA
For more detailed instructions, see Integrate NVIDIA as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"metric": "cosine",
"service": {
"provider": "nvidia",
"modelName": "NV-Embed-QA"
}
}
}
}
}'
OpenAI
For more detailed instructions, see Integrate OpenAI as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"dimension": MODEL_DIMENSIONS,
"metric": "SIMILARITY_METRIC",
"service": {
"provider": "openai",
"modelName": "MODEL_NAME",
"authentication": {
"providerKey": "API_KEY_NAME"
},
"parameters": {
"organizationId": "ORGANIZATION_ID",
"projectId": "PROJECT_ID"
}
}
}
}
}
}'
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the OpenAI API key that you want to use. Must be the name of an existing OpenAI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:text-embedding-3-small
,text-embedding-3-large
,text-embedding-ada-002
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions. -
ORGANIZATION_ID
: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference. -
PROJECT_ID
: Optional. The ID of the OpenAI project that owns the API key. This cannot use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.
Upstage
For more detailed instructions, see Integrate Upstage as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"dimension": MODEL_DIMENSIONS,
"metric": "SIMILARITY_METRIC",
"service": {
"provider": "upstageAI",
"modelName": "MODEL_NAME",
"authentication": {
"providerKey": "API_KEY_NAME"
}
}
}
}
}
}'
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Upstage API key that you want to use. Must be the name of an existing Upstage API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:solar-embedding-1-large
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Voyage AI
For more detailed instructions, see Integrate Voyage AI as an embedding provider.
curl -sS -L -X POST "$API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"vector": {
"dimension": MODEL_DIMENSIONS,
"metric": "SIMILARITY_METRIC",
"service": {
"provider": "voyageAI",
"modelName": "MODEL_NAME",
"authentication": {
"providerKey": "API_KEY_NAME"
}
}
}
}
}
}'
Replace the following:
-
COLLECTION_NAME
: The name for your collection. -
SIMILARITY_METRIC
: The method you want to use to calculate vector similarity scores. The available metrics areCOSINE
(default),DOT_PRODUCT
, andEUCLIDEAN
. -
API_KEY_NAME
: The name of the Voyage AI API key that you want to use. Must be the name of an existing Voyage AI API key in the Astra Portal. -
MODEL_NAME
: The model that you want to use to generate embeddings. The available models are:voyage-2
,voyage-code-2
,voyage-finance-2
,voyage-large-2
,voyage-large-2-instruct
,voyage-law-2
,voyage-multilingual-2
. -
MODEL_DIMENSIONS
: The number of dimensions that you want the generated vectors to have. Your chosen embedding model must support the specified number of dimensions.If you omit
dimension
, Astra can use a default dimension value. However, some models don’t have default dimensions. You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.
Create a collection that supports hybrid search
If you want to perform hybrid search on your collection, you must create a collection that has vector, lexical, and rerank enabled.
Your collection must also be in a database in the AWS us-east-2
region.
Lexical and rerank are enabled by default when you create a collection in a database in the AWS us-east-2
region, but you can optionally configure the lexical analyzer and the reranker model.
For configuration details about the lexical analyzer, see Find data with CQL analyzers. The following example uses a configuration suitable for English text.
For configuration details about the reranker model, inspect the available reranker models. Only the NVIDIA llama-3.2-nv-rerankqa-1b-v2 reranking model reranker model is supported.
For configuration details about vector, see Create a collection that can store vector embeddings and Create a collection that can automatically generate vector embeddings.
-
Python
-
TypeScript
-
Java
-
curl
The Python client supports multiple ways to create a collection:
-
You can define the collection parameters in a
CollectionDefinition
object and then create the collection from theCollectionDefinition
object. -
You can use a fluent interface to build the collection definition and then create the collection from the definition.
-
CollectionDefinition object
-
Fluent interface
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import (
CollectionDefinition,
CollectionLexicalOptions,
CollectionRerankOptions,
CollectionVectorOptions,
RerankServiceOptions,
VectorServiceOptions,
)
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = CollectionDefinition(
vector=CollectionVectorOptions(
metric=VectorMetric.COSINE,
dimension=1024,
service=VectorServiceOptions(
provider="nvidia",
model_name="NV-Embed-QA",
),
),
lexical=CollectionLexicalOptions(
analyzer={
"tokenizer": {"name": "standard", "args": {}},
"filters": [
{"name": "lowercase"},
{"name": "stop"},
{"name": "porterstem"},
{"name": "asciifolding"},
],
"charFilters": [],
},
enabled=True,
),
rerank=CollectionRerankOptions(
enabled=True,
service=RerankServiceOptions(
provider="nvidia",
model_name="nvidia/llama-3.2-nv-rerankqa-1b-v2",
),
),
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
from astrapy import DataAPIClient
from astrapy.info import CollectionDefinition
from astrapy.constants import VectorMetric
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = (
CollectionDefinition.builder()
.set_vector_dimension(1024)
.set_vector_metric(VectorMetric.COSINE)
.set_vector_service(
provider="nvidia",
model_name="NV-Embed-QA",
)
.set_lexical(
{
"tokenizer": {"name": "standard", "args": {}},
"filters": [
{"name": "lowercase"},
{"name": "stop"},
{"name": "porterstem"},
{"name": "asciifolding"},
],
"charFilters": [],
}
)
.set_rerank("nvidia", "nvidia/llama-3.2-nv-rerankqa-1b-v2")
.build()
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
You can define $vector
, $vectorize
, and $lexical
as inlines fields in your interfaces, or you can extend the utility VectorDoc
, VectorizeDoc
, and LexicalDoc
types provided by the client.
import { DataAPIClient, LexicalDoc, VectorizeDoc } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
// Define the type for the collection
interface User extends VectorizeDoc, LexicalDoc {
name: string;
age?: number;
}
(async function () {
const collection = await database.createCollection<User>("COLLECTION_NAME", {
vector: {
dimension: 1024,
metric: "cosine",
service: {
provider: "nvidia",
modelName: "NV-Embed-QA",
},
},
lexical: {
enabled: true,
analyzer: {
tokenizer: {
name: "standard",
args: {},
},
filters: [
{
name: "lowercase",
},
{
name: "stop",
},
{
name: "porterstem",
},
{
name: "asciifolding",
},
],
charFilters: [],
},
},
rerank: {
enabled: true,
service: {
provider: "nvidia",
modelName: "nvidia/llama-3.2-nv-rerankqa-1b-v2",
},
},
});
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
The $vector
field must still be number[]
or DataAPIVector
, and the $vectorize
and $lexical
fields must still be a string, or type-related issues will occur.
Consider using a type like VectorDoc & LexicalDoc & SomeDoc
or VectorizeDoc & LexicalDoc & SomeDoc
which allows the documents to remain untyped, but still statically requires the $vector
, $vectorize
, and $lexical
to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
(async function () {
const collection = await database.createCollection("COLLECTION_NAME", {
vector: {
dimension: 1024,
metric: "cosine",
service: {
provider: "nvidia",
modelName: "NV-Embed-QA",
},
},
lexical: {
enabled: true,
analyzer: {
tokenizer: {
name: "standard",
args: {},
},
filters: [
{
name: "lowercase",
},
{
name: "stop",
},
{
name: "porterstem",
},
{
name: "asciifolding",
},
],
charFilters: [],
},
},
rerank: {
enabled: true,
service: {
provider: "nvidia",
modelName: "nvidia/llama-3.2-nv-rerankqa-1b-v2",
},
},
});
})();
The Java client supports multiple ways to create a collection:
-
You can define the collection parameters in a
CollectionDefinition
object and then create the collection from theCollectionDefinition
object. -
You can use a fluent interface to build the collection definition and then create the collection from the definition.
-
CollectionDefinition object
-
Fluent interface
import static com.datastax.astra.client.core.lexical.AnalyzerTypes.STANDARD;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.core.lexical.Analyzer;
import com.datastax.astra.client.core.lexical.LexicalOptions;
import com.datastax.astra.client.core.rerank.CollectionRerankOptions;
import com.datastax.astra.client.core.rerank.RerankServiceOptions;
import com.datastax.astra.client.core.vector.SimilarityMetric;
import com.datastax.astra.client.core.vector.VectorOptions;
import com.datastax.astra.client.core.vectorize.VectorServiceOptions;
import com.datastax.astra.client.databases.Database;
public class Example {
public static void main(String[] args) {
// Get a database
Database database = new DataAPIClient("APPLICATION_TOKEN").getDatabase("API_ENDPOINT");
// Create a collection
CollectionDefinition collectionDefinition = new CollectionDefinition();
// Vector Options
VectorServiceOptions vectorService =
new VectorServiceOptions().provider("nvidia").modelName("NV-Embed-QA");
VectorOptions vectorOptions =
new VectorOptions()
.dimension(1024)
.metric(SimilarityMetric.COSINE.getValue())
.service(vectorService);
collectionDefinition.vector(vectorOptions);
// Lexical Options
Analyzer analyzer =
new Analyzer()
.tokenizer(STANDARD.getValue())
.addFilter("lowercase")
.addFilter("stop")
.addFilter("porterstem")
.addFilter("asciifolding");
LexicalOptions lexicalOptions = new LexicalOptions().enabled(true).analyzer(analyzer);
collectionDefinition.lexical(lexicalOptions);
// Rerank Options
RerankServiceOptions rerankService =
new RerankServiceOptions()
.modelName("nvidia/llama-3.2-nv-rerankqa-1b-v2")
.provider("nvidia");
CollectionRerankOptions rerankOptions =
new CollectionRerankOptions().enabled(true).service(rerankService);
collectionDefinition.rerank(rerankOptions);
database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
import static com.datastax.astra.client.core.lexical.AnalyzerTypes.STANDARD;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.core.lexical.Analyzer;
import com.datastax.astra.client.core.lexical.LexicalOptions;
import com.datastax.astra.client.core.vector.SimilarityMetric;
import com.datastax.astra.client.databases.Database;
public class Example {
public static void main(String[] args) {
// Get a database
Database database = new DataAPIClient("APPLICATION_TOKEN").getDatabase("API_ENDPOINT");
database.createCollection(
"COLLECTION_NAME",
new CollectionDefinition()
.vector(1024, SimilarityMetric.COSINE)
.vectorize("nvidia", "NV-Embed-QA")
.lexical(
new LexicalOptions()
.enabled(true)
.analyzer(
new Analyzer()
.tokenizer(STANDARD.getValue())
.addFilter("lowercase")
.addFilter("stop")
.addFilter("porterstem")
.addFilter("asciifolding")))
.rerank("nvidia", "nvidia/llama-3.2-nv-rerankqa-1b-v2"));
}
}
curl -sS -L -X POST "API_ENDPOINT/api/json/v1/KEYSPACE_NAME" \
--header "Token: APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"lexical": {
"analyzer": {
"tokenizer": {
"name": "standard",
"args": {}
},
"filters": [
{
"name": "lowercase"
},
{
"name": "stop"
},
{
"name": "porterstem"
},
{
"name": "asciifolding"
}
],
"charFilters": []
},
"enabled": true
},
"rerank": {
"enabled": true,
"service": {
"modelName": "nvidia/llama-3.2-nv-rerankqa-1b-v2",
"provider": "nvidia"
}
},
"vector": {
"dimension": 1024,
"metric": "cosine",
"service": {
"provider": "nvidia",
"modelName": "NV-Embed-QA"
}
}
}
}
}'
Create a collection and specify the default ID format
For more information about the default ID format, see Document IDs. For allowed values, see the Parameters.
-
Python
-
TypeScript
-
Java
-
curl
The Python client supports multiple ways to create a collection:
-
You can define the collection parameters in a
CollectionDefinition
object and then create the collection from theCollectionDefinition
object. -
You can use a fluent interface to build the collection definition and then create the collection from the definition.
-
CollectionDefinition object
-
Fluent interface
from astrapy import DataAPIClient
from astrapy.info import (
CollectionDefinition,
CollectionDefaultIDOptions,
)
from astrapy.constants import DefaultIdType
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = CollectionDefinition(
default_id=CollectionDefaultIDOptions(DefaultIdType.OBJECTID),
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
from astrapy import DataAPIClient
from astrapy.info import CollectionDefinition
from astrapy.constants import DefaultIdType
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = (
CollectionDefinition.builder().set_default_id(DefaultIdType.OBJECTID).build()
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
-
Typed collections
-
Untyped collections
You can manually define a client-side type for your collection to help statically catch errors.
The _id
field type should match the defaultId
type.
import { DataAPIClient, ObjectId } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
// Define the type for the collection
interface User {
_id: ObjectId;
name: string;
age?: number;
}
(async function () {
const collection = await database.createCollection<User>(
"COLLECTION_NAME",
{
defaultId: {
type: "objectId",
},
},
);
})();
If you don’t pass a type parameter, the collection remains untyped. This is a more flexible but less type-safe option.
However, if you later specify _id
when you insert a document, DataStax recommends that it has the same type as the defaultId
.
Consider using a type like { id: ObjectId } & SomeDoc
which allows the documents to remain untyped, but still statically requires the _id
field to have the correct type.
import { DataAPIClient } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
(async function () {
const collection = await database.createCollection("COLLECTION_NAME", {
defaultId: {
type: "objectId",
},
});
})();
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefaultIdTypes;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.databases.Database;
public class Example {
public static void main(String[] args) {
// Get a database
Database database = new DataAPIClient("APPLICATION_TOKEN").getDatabase("API_ENDPOINT");
// Create a collection
CollectionDefinition collectionDefinition =
new CollectionDefinition().defaultId(CollectionDefaultIdTypes.OBJECT_ID);
Collection<Document> collection =
database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
curl -sS -L -X POST "API_ENDPOINT/api/json/v1/KEYSPACE_NAME" \
--header "Token: APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"defaultId": {
"type": "uuidv7"
}
}
}
}'
Create a collection and specify which fields to index
For more information about selective indexing, see Indexes in collections.
-
Python
-
TypeScript
-
Java
-
curl
The Python client supports multiple ways to create a collection:
-
You can define the collection parameters in a
CollectionDefinition
object and then create the collection from theCollectionDefinition
object. -
You can use a fluent interface to build the collection definition and then create the collection from the definition.
-
CollectionDefinition object
-
Fluent interface
from astrapy import DataAPIClient
from astrapy.info import CollectionDefinition
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = CollectionDefinition(
indexing={"allow": ["city", "country"]},
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
from astrapy import DataAPIClient
from astrapy.info import CollectionDefinition
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = (
CollectionDefinition.builder().set_indexing("allow", ["city", "country"]).build()
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
import { DataAPIClient } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
(async function () {
const collection = await database.createCollection("COLLECTION_NAME", {
indexing: {
allow: ["city", "country"],
},
});
})();
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.databases.Database;
public class Example {
public static void main(String[] args) {
// Get a database
Database database = new DataAPIClient("APPLICATION_TOKEN").getDatabase("API_ENDPOINT");
// Create a collection
CollectionDefinition collectionDefinition =
new CollectionDefinition().indexingAllow("city", "country");
Collection<Document> collection =
database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
curl -sS -L -X POST "API_ENDPOINT/api/json/v1/KEYSPACE_NAME" \
--header "Token: APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"indexing": {
"allow": ["city", "country"]
}
}
}
}'
Create a collection and specify which fields shouldn’t be indexed
For more information about selective indexing, see Indexes in collections.
-
Python
-
TypeScript
-
Java
-
curl
The Python client supports multiple ways to create a collection:
-
You can define the collection parameters in a
CollectionDefinition
object and then create the collection from theCollectionDefinition
object. -
You can use a fluent interface to build the collection definition and then create the collection from the definition.
-
CollectionDefinition object
-
Fluent interface
from astrapy import DataAPIClient
from astrapy.info import CollectionDefinition
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = CollectionDefinition(
indexing={"deny": ["city", "country"]},
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
from astrapy import DataAPIClient
from astrapy.info import CollectionDefinition
# Get an existing database
client = DataAPIClient()
database = client.get_database(
"API_ENDPOINT",
token="APPLICATION_TOKEN",
)
# Create a collection
collection_definition = (
CollectionDefinition.builder().set_indexing("deny", ["city", "country"]).build()
)
collection = database.create_collection(
"COLLECTION_NAME",
definition=collection_definition,
)
import { DataAPIClient } from "@datastax/astra-db-ts";
// Get a database
const client = new DataAPIClient("APPLICATION_TOKEN");
const database = client.db("API_ENDPOINT");
(async function () {
const collection = await database.createCollection("COLLECTION_NAME", {
indexing: {
deny: ["city", "country"],
},
});
})();
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.collections.Collection;
import com.datastax.astra.client.collections.definition.CollectionDefinition;
import com.datastax.astra.client.collections.definition.documents.Document;
import com.datastax.astra.client.databases.Database;
public class Example {
public static void main(String[] args) {
// Get a database
Database database = new DataAPIClient("APPLICATION_TOKEN").getDatabase("API_ENDPOINT");
// Create a collection
CollectionDefinition collectionDefinition =
new CollectionDefinition().indexingDeny("city", "country");
Collection<Document> collection =
database.createCollection("COLLECTION_NAME", collectionDefinition);
}
}
curl -sS -L -X POST "API_ENDPOINT/api/json/v1/KEYSPACE_NAME" \
--header "Token: APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
"createCollection": {
"name": "COLLECTION_NAME",
"options": {
"indexing": {
"deny": ["city", "country"]
}
}
}
}'
Client reference
-
Python
-
TypeScript
-
Java
-
curl
For more information, see the client reference.
For more information, see the client reference.
For more information, see the client reference.
Client reference documentation is not applicable for HTTP.