Integrate OpenAI as an embedding provider

Integrate OpenAI as an external embedding provider for Astra DB vectorize to leverage OpenAI’s embeddings API within Astra DB Serverless.

Prerequisites

To configure the OpenAI embedding provider integration, you need the following:

  • An active Astra account with the Organization Administrator role.

  • A Serverless (Vector) database.

    If this is your first time using Astra DB, follow the Quickstart to create a database and connect to it with an API client.

  • A paid OpenAI account.

  • Your OpenAI organization ID and project ID, if your OpenAI account belongs to multiple organizations or you use a legacy user API key to access projects. Additionally, you can’t use the default project for this integration. If necessary, create a project in your OpenAI account to use for this integration.

Create the OpenAI API key

Sign in to your OpenAI account and create a new API key with unrestricted access to the API. Make sure to copy the API key to a secure location.

Astra DB supports OpenAI user API keys, project API keys, and service account API keys. We recommend using a service account API key for better security and control in production environments. For more information about OpenAI API keys, see the OpenAI API reference.

Don’t modify or delete the API key in your OpenAI account after you’ve added it to Astra DB. This breaks the integration. For more information, see Embedding provider authentication.

Add the OpenAI integration to your organization

Use the Astra Portal to add the OpenAI embedding provider integration to your Astra DB organization:

  1. In the Astra Portal navigation menu, click Integrations.

  2. In the All Integrations section, select OpenAI Embedding provider.

  3. Click Add integration.

  4. In the Add Integration dialog, do the following:

    1. Enter a unique API key name.

      You can’t change API key names. Make sure the name is meaningful and that it helps you identify your OpenAI API key in Astra DB.

    2. Enter your OpenAI API key.

    3. In the Add databases to scope section, select a Serverless (Vector) database that you want to use the OpenAI API key.

      When you create a collection in a scoped database, you can choose any of the API keys that are available to the database. Astra DB uses the API key to request embeddings from your embedding provider when you load data into the collection.

      You can add up to 10 databases at once, and you can add more databases later.

      For greater access control, you can add multiple API keys, and each API key can have different scoped databases. Additionally, you can add the same database to multiple API key scopes.

      For example, you can have a few broadly-scoped API keys or many narrowly-scoped API keys.

  5. Click Add Integration.

    The OpenAI integration switches to rss_feed ACTIVE, and your API key and its scoped databases appear in the API keys section. If you want to add more API keys for this integration, click Add API key.

When you create collections in the scoped databases, you can select the OpenAI integration, and then use it to generate embeddings.

Add the OpenAI integration to a new collection

Before you can use the OpenAI integration to generate embeddings, you must add the integration to a new collection.

You can’t change a collection’s embedding provider or embedding generation method after you create it. To use a different embedding provider, you must create a new collection with a different embedding provider integration.

  • Astra Portal

  • Python

  • TypeScript

  • Java

  • curl

  1. In the Astra Portal, go to Databases, and then select your Serverless (Vector) database.

  2. Click Data Explorer.

  3. In the Namespace field, select the namespace where you want to create the collection, or use the default namespace, which is named default_keyspace.

  4. Click Create Collection.

  5. In the Create collection dialog, enter a name for the collection. Collection names can have no more than 50 characters.

  6. Turn on Vector-enabled collection.

  7. Under Embedding generation method, select the OpenAI embedding provider integration.

    If the integration isn’t listed, follow the steps in Add the OpenAI integration to your organization and Manage scoped databases to make sure the integration is active and that your database is scoped to at least one API key.

  8. Complete the following fields:

    • API key: The API key that you want the collection to use to access your embedding provider and generate embeddings. This field is only active if the database is scoped to multiple OpenAI API keys.

    • Organization ID: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference.

    • Project ID: Optional. The ID of the OpenAI project that owns the API key. This can’t use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.

    • Embedding model: The model that you want to use to generate embeddings. The available models are: text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002.

    • Dimensions: The number of dimensions that you want the generated vectors to have. Most models automatically populate the Dimensions. You can edit this field if the model supports a range of dimensions or the embedding provider integration uses an endpoint-defined model. Your chosen embedding model must support the specified number of dimensions.

    • Similarity metric: The method you want to use to calculate vector similarity scores. The available metrics are Cosine, Dot Product, and Euclidean.

  9. Click Create collection.

Use the Python client to create a collection that uses the OpenAI integration.

Initialize the client

If you haven’t done so already, initialize the client before creating a collection:

import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.info import CollectionVectorServiceOptions

# Initialize the client and get a "Database" object
client = DataAPIClient(os.environ["ASTRA_DB_APPLICATION_TOKEN"])
database = client.get_database(os.environ["ASTRA_DB_API_ENDPOINT"])
print(f"* Database: {database.info().name}\n")

Create a collection integrated with OpenAI:

collection = database.create_collection(
    "COLLECTION_NAME",
    metric=VectorMetric.COSINE,
    dimension=MODEL_DIMENSIONS, # optional
    service=CollectionVectorServiceOptions(
        provider="openai",
        model_name="MODEL_NAME",
        authentication={
            "providerKey": "API_KEY_NAME",
        },
        parameters={
            "organizationId": "ORGANIZATION_ID",
            "projectId": "PROJECT_ID",
        },
    ),
)
print(f"* Collection: {collection.full_name}\n")

Replace the following:

  • COLLECTION_NAME: The name for your collection.

  • API_KEY_NAME: The name of the OpenAI API key that you want to use for your collection. Must be the name of an existing OpenAI API key in the Astra Portal.

  • MODEL_NAME: The desired model to use to generate embeddings. For OpenAI, the supported models are text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002.

  • MODEL_DIMENSIONS: The number of dimensions that you want the generated vectors to have. Your embedding model must support the specified number of dimensions.

    If your model has a default dimension value, you can omit dimension.

    You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.

  • ORGANIZATION_ID: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference.

  • PROJECT_ID: Optional. The ID of the OpenAI project that owns the API key. This can’t use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.

Use the TypeScript client to create a collection that uses the OpenAI integration.

Initialize the client

If you haven’t done so already, initialize the client before creating a collection:

import { DataAPIClient, VectorDoc, UUID } from '@datastax/astra-db-ts';

const { ASTRA_DB_APPLICATION_TOKEN, ASTRA_DB_API_ENDPOINT } = process.env;

// Initialize the client and get a 'Db' object
const client = new DataAPIClient(ASTRA_DB_APPLICATION_TOKEN);
const db = client.db(ASTRA_DB_API_ENDPOINT);

console.log(`* Connected to DB ${db.id}`);

Create a collection integrated with OpenAI:

(async function () {
  const collection = await db.createCollection('COLLECTION_NAME', {
    vector: {
      dimension: MODEL_DIMENSIONS, // optional
      service: {
        provider: 'openai',
        modelName: 'MODEL_NAME}',
        authentication: {
          providerKey: 'API_KEY_NAME',
        },
        parameters: {
          organizationId: 'ORGANIZATION_ID',
          projectId: 'PROJECT_ID',
        },
      },
    },
  });
  console.log(`* Created collection ${collection.namespace}.${collection.collectionName}`);

Replace the following:

  • COLLECTION_NAME: The name for your collection.

  • API_KEY_NAME: The name of the OpenAI API key that you want to use for your collection. Must be the name of an existing OpenAI API key in the Astra Portal.

  • MODEL_NAME: The desired model to use to generate embeddings. For OpenAI, the supported models are text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002.

  • MODEL_DIMENSIONS: The number of dimensions that you want the generated vectors to have. Your embedding model must support the specified number of dimensions.

    If your model has a default dimension value, you can omit dimension.

    You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.

  • ORGANIZATION_ID: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference.

  • PROJECT_ID: Optional. The ID of the OpenAI project that owns the API key. This can’t use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.

Use the Java client to create a collection that uses the OpenAI integration.

Initialize the client

If you haven’t done so already, initialize the client before creating a collection:

import com.datastax.astra.client.Collection;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.Database;
import com.datastax.astra.client.model.CollectionOptions;
import com.datastax.astra.client.model.Document;
import com.datastax.astra.client.model.FindIterable;
import com.datastax.astra.client.model.FindOptions;
import com.datastax.astra.client.model.SimilarityMetric;

import static com.datastax.astra.client.model.SimilarityMetric.COSINE;

public class Quickstart {

  public static void main(String[] args) {
    // Loading Arguments
    String astraToken = System.getenv("ASTRA_DB_APPLICATION_TOKEN");
    String astraApiEndpoint = System.getenv("ASTRA_DB_API_ENDPOINT");

    // Initialize the client
    DataAPIClient client = new DataAPIClient(astraToken);
    System.out.println("Connected to AstraDB");

    Database db = client.getDatabase(astraApiEndpoint);
    System.out.println("Connected to Database.");

Create a collection integrated with OpenAI:

Map<String, Object > params = new HashMap<>();
params.put("organizationId", "ORGANIZATION_ID");
params.put("projectId", "PROJECT_ID");
CollectionOptions.CollectionOptionsBuilder builder = CollectionOptions
       .builder()
       .vectorSimilarity(SimilarityMetric.COSINE)
       .vectorDimension(MODEL_DIMENSIONS) // optional
       .vectorize("openai", "MODEL_NAME", "API_KEY_NAME", params);
Collection<Document> collection = db
       .createCollection("COLLECTION_NAME", builder.build());

Replace the following:

  • COLLECTION_NAME: The name for your collection.

  • API_KEY_NAME: The name of the OpenAI API key that you want to use for your collection. Must be the name of an existing OpenAI API key in the Astra Portal.

  • MODEL_NAME: The desired model to use to generate embeddings. For OpenAI, the supported models are text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002.

  • MODEL_DIMENSIONS: The number of dimensions that you want the generated vectors to have. Your embedding model must support the specified number of dimensions.

    If your model has a default dimension value, you can omit dimension.

    You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.

  • ORGANIZATION_ID: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference.

  • PROJECT_ID: Optional. The ID of the OpenAI project that owns the API key. This can’t use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.

Use the Data API to create a collection that uses the OpenAI integration:

curl -sS --location -X POST "$ASTRA_DB_API_ENDPOINT/api/json/v1/default_keyspace" \
--header "Token: $ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "createCollection": {
    "name": "COLLECTION_NAME",
    "options": {
      "vector": {
        "dimension": MODEL_DIMENSIONS, # optional
        "metric": "cosine",
        "service": {
          "provider": "openai",
          "modelName": "MODEL_NAME",
          "authentication": {
            "providerKey": "API_KEY_NAME"
          },
          "parameters": {
            "organizationId": "ORGANIZATION_ID",
            "projectId": "PROJECT_ID"
          }
        }
      }
    }
  }
}' | jq

Replace the following:

  • COLLECTION_NAME: The name for your collection.

  • API_KEY_NAME: The name of the OpenAI API key that you want to use for your collection. Must be the name of an existing OpenAI API key in the Astra Portal.

  • MODEL_NAME: The desired model to use to generate embeddings. For OpenAI, the supported models are text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002.

  • MODEL_DIMENSIONS: The number of dimensions that you want the generated vectors to have. Your embedding model must support the specified number of dimensions.

    If your model has a default dimension value, you can omit dimension.

    You can use the Data API to find supported embedding providers and their configuration parameters, including dimensions ranges and default dimensions.

  • ORGANIZATION_ID: Optional. The ID of the OpenAI organization that owns the API key. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about organization IDs, see the OpenAI API reference.

  • PROJECT_ID: Optional. The ID of the OpenAI project that owns the API key. This can’t use the default project. Only required if your OpenAI account belongs to multiple organizations or if you are using a legacy user API key to access projects. For more information about project IDs, see the OpenAI API reference.

If you get a Collection Limit Reached or TOO_MANY_INDEXES message, you must delete a collection before you can create a new one.

Serverless (Vector) databases created after June 24, 2024 can have up to 10 collections. Databases created before this date can have up to 5 collections. The collection limit is based on Storage Attached Indexing (SAI).

After you create a collection, load data into the collection.

Load and search data with vectorize

  1. Load vector data into your vectorize-integrated collection.

    When you load structured JSON or CSV data, the Vector Field specifies field to use to generate embeddings with $vectorize.

    The Load Data dialog with Vector Field dropdown expanded.

  2. After loading data, you can perform a similarity search using text, rather than a vector.

Manage scoped databases

For each API key, you select the databases that can use that API key. These are referred to as scoped databases.

To change the scoped databases for an existing OpenAI API key, do the following:

  1. In the Astra Portal navigation menu, click Integrations, and then select OpenAI.

  2. In the API keys section, expand each API key to show the list of scoped databases.

  3. Add or remove databases from each API key’s scope, as needed:

    • To remove a database from the API key’s scope, click delete Delete, enter the Database name, and then click Remove scope.

    • To add a database to the API key’s scope, click more_vert More, select Add database, select the Serverless (Vector) database that you want to add to the scope, and then click Add database.

Remove OpenAI API keys

Removing API keys immediately disables $vectorize embedding generation for any collections that used the removed API keys. Make sure the API key is not used by any active collections before you remove it.

Removing API keys from Astra DB Serverless does not delete them from your OpenAI account.

To remove API keys, do the following:

  1. In the Astra Portal navigation menu, click Integrations, and then select OpenAI Embedding provider.

  2. In the API keys section, locate the API key that you want to remove, click more_vert More, and then select Remove API key.

  3. In the confirmation dialog, enter the API key name, and then click Remove key.

  4. In your OpenAI account, delete the API key if you don’t plan to reuse it.

  5. If you no longer want to use this embedding provider or you are not rotating the API key, then you must recreate any collections that used the removed API key to generate embeddings. For more information, see Change providers or credentials.

Rotate OpenAI API keys

To rotate API keys, you must remove the API key, and then recreate it with the same name and scoped databases.

Removing the API key immediately disables $vectorize embedding generation for any collections that used that API key. Vectorize remains unavailable until you add the new API key to the OpenAI integration.

For more information, see Change providers or credentials.

To rotate API keys, do the following:

  1. In your OpenAI, create a new API key.

  2. Remove the API key that you want to rotate. Make a note of the API key’s name and scoped databases. When you recreate the API key, it must have the exact same name and scope.

  3. In the Astra Portal navigation menu, click Integrations, and then select OpenAI Embedding provider.

  4. In the API keys section, add a new API key with the same name as the removed API key.

    If the name doesn’t match, any collections that used the removed API key can’t detect the replacement API key.

  5. Add all relevant databases to the new API key’s scoped databases.

    At minimum, you must add all databases that used the removed API key so that the collections in those databases can detect the replacement API key. To ensure that you don’t miss any databases, DataStax recommends adding all of the databases that were in the removed API key’s scope.

Remove the OpenAI integration from your organization

To remove the OpenAI embedding provider integration from your Astra organization remove all existing OpenAI API keys, and then recreate any collections that used the integration to generate embeddings.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com