Work with collections

Collections store documents in Serverless (Vector) databases. Collections are best for semi-structured data.

With the Data API clients, use the Database class to manage collections and the Collection class to work with the data in collections.

Serverless (Vector) databases created after June 24, 2024 can have approximately 10 collections. Databases created before this date can have approximately 5 collections. The collection limit is based on the number of indexes.

For more information about the Data API and clients, see Get started with the Data API.

Create a collection

Create a new collection in a Serverless (Vector) database.

The required and valid parameters depend on whether the collection will store vector data and your embedding generation method. For more information, see Manage collections and tables.

You can’t edit a collection’s parameters after you create the collection.

  • Python

  • TypeScript

  • Java

  • curl

The signature of this command changed in Python client version 2.0-preview.

If you are using client version 2.0-preview or later, see the description of this change in Data API client upgrade guide.

For more information, see the Client reference.

Create a collection that is not vector-enabled:

collection = database.create_collection("COLLECTION_NAME")

Create a collection to store vector data and provide embeddings when you load data:

from astrapy.constants import VectorMetric

collection = database.create_collection(
    "COLLECTION_NAME",
    dimension=5,
    metric=VectorMetric.COSINE,
)

Create a new collection that generates vector embeddings automatically with vectorize.

To automatically generate embeddings, you must enable the corresponding embedding provider integration, add the embedding provider API key in the Astra KMS, and make sure your database can access the embedding provider service. You can use the Data API to find supported embedding providers and their configuration parameters.

As an alternative to Astra KMS authentication, you can do one of the following:

  • Use the Astra-hosted NVIDIA embedding provider integration, if your database meets the cloud provider and region requirements.

  • Use header authentication to manually provide the embedding provider credentials with every request that requires embedding generation, including loading data and vector search with vectorize. For more information, see Vector and vectorize and the explanation of the embedding_api_key parameter in this command’s Parameters.

from astrapy.info import CollectionVectorServiceOptions
from astrapy.constants import VectorMetric

collection = database.create_collection(
    "COLLECTION_NAME",
    metric=VectorMetric.DOT_PRODUCT,
    dimension=1536,
    service=CollectionVectorServiceOptions(
        provider="openai",
        model_name="text-embedding-3-small",
        authentication={
            "providerKey": "API_KEY_NAME",
        },
    ),
)

Create a new collection with default document IDs of type ObjectID:

from astrapy.constants import DefaultIdType

collection = database.create_collection(
    "COLLECTION_NAME",
    default_id_type=DefaultIdType.OBJECTID,
)

Create a new collection with selective indexing:

collection = database.create_collection(
    "COLLECTION_NAME",
    indexing={"allow": ["city", "country"]},
)

Parameters:

Name Type Summary

name

str

The name of the collection.

keyspace

Optional[str]

The keyspace where the collection is to be created. If not specified, the database’s working keyspace is used.

dimension

Optional[int]

For vector collections, the dimension of the vectors, which is the number of their components. If you’re not sure what dimension to set, use whatever dimension vector your embeddings model produces.

metric

Optional[str]

The similarity metric used for vector searches. Allowed values are VectorMetric.DOT_PRODUCT, VectorMetric.EUCLIDEAN or VectorMetric.COSINE (default).

service

Optional[CollectionVectorServiceOptions]

The service definition for vector embeddings. Required for vector collections that generate embeddings automatically.

This is an instance of CollectionVectorServiceOptions, which defines the provider and model_name, and other optional settings, such as authentication. This parameter can also be a simple dictionary.

authentication is an object defining how to authenticate with the embedding provider. For example, {providerKey: "API_KEY_NAME"}, where API_KEY_NAME is the name of your embedding provider key in the Astra DB KMS.

indexing

Optional[Dict[str, Any]]

Optional specification for selective indexing of the collection, in the form of a dictionary such as {"deny": […​]} or {"allow": […​]}.

default_id_type

Optional[str]

Set the default ID type that the API server will generate when inserting documents that don’t explicitly specify an _id field. Can be set to any of the values DefaultIdType.UUID, DefaultIdType.OBJECTID, DefaultIdType.UUIDV6, DefaultIdType.UUIDV7, DefaultIdType.DEFAULT.

additional_options

Optional[Dict[str, Any]]

Any further set of key-value pairs that will be added to the "options" part of the payload when sending the Data API command to create a collection.

max_time_ms

Optional[int]

A timeout, in milliseconds, for the underlying HTTP request.

embedding_api_key

Optional[str]

An alternative to authentication in CollectionVectorServiceOptions. Provide the API key directly instead of using an API key in the Astra DB KMS. The API key is passed to the Data API with each request in the form of an x-embedding-api-key HTTP header.

This parameter is not stored on the database, and it is used by the Collection instance only when issuing reads or writes on the collection.

This is useful for creating collections with an embedding service without specifying an authentication in the service configuration.

embedding_api_key overrides the Astra DB KMS API key if you set both.

collection_max_time_ms

Optional[int]

A default timeout, in milliseconds, for the duration of each operation on the collection. Individual timeouts can be provided to each collection method call and will take precedence, with this value being an overall default. Note that for some methods involving multiple API calls (such as delete_many and insert_many), you should provide a timeout with sufficient duration for the operation you’re performing. This parameter is not stored on the database, it is only used by the Collection instance when issuing reads or writes on the collection.

Returns:

Collection - The created collection object that you can use to work with documents in the collection.

Example response
Collection(name="COLLECTION_NAME", keyspace="default_keyspace", database=Database(api_endpoint="ASTRA_DB_API_ENDPOINT", token="APPLICATION_TOKEN", keyspace="default_keyspace"))

Example:

from astrapy import DataAPIClient
import astrapy
client = DataAPIClient("TOKEN")
database = client.get_database("API_ENDPOINT")

# Create a non-vector collection
collection_simple = database.create_collection("NON_VECTOR_COLLECTION_NAME")

# Create a vector collection
collection_vector = database.create_collection(
    "VECTOR_COLLECTION_NAME",
    dimension=3,
    metric=astrapy.constants.VectorMetric.COSINE,
)

# Create a collection with UUIDv6 as default IDs
from astrapy.constants import DefaultIdType, SortDocuments

collection_uuid6 = database.create_collection(
    "UUIDV6_COLLECTION_NAME",
    default_id_type=DefaultIdType.UUIDV6,
)

collection_uuid6.insert_one({"desc": "a document", "seq": 0})
collection_uuid6.insert_one({"_id": 123, "desc": "another", "seq": 1})
doc_ids = [
    doc["_id"]
    for doc in collection_uuid6.find({}, sort={"seq": SortDocuments.ASCENDING})
]
print(doc_ids)
#  Will print: [UUID('1eef29eb-d587-6779-adef-45b95ef13497'), 123]
print(doc_ids[0].version)
#  Will print: 6

For more information, see the Client reference.

const collection = await db.createCollection('COLLECTION_NAME');

Create a new collection to store vector data.

const collection = await db.createCollection<Schema>('COLLECTION_NAME', {
  vector: {
    dimension: 5,
    metric: 'cosine',
  },
});

Create a new collection that generates vector embeddings automatically.

To automatically generate embeddings, you must enable the corresponding embedding provider integration, add the embedding provider API key in the Astra KMS, and make sure your database can access the embedding provider service. You can use the Data API to find supported embedding providers and their configuration parameters.

As an alternative to Astra KMS authentication, you can do one of the following:

  • Use the Astra-hosted NVIDIA embedding provider integration, if your database meets the cloud provider and region requirements.

  • Use header authentication to manually provide the embedding provider credentials with every request that requires embedding generation, including loading data and vector search with vectorize. For more information, see the explanation of the embeddingApiKey optional parameter in the Options table and Vector and vectorize.

const collection = await db.createCollection<Schema>('COLLECTION_NAME', {
  vector: {
    dimension: 1536,
    metric: 'dot_product',
    service: {
      provider: 'openai',
      modelName: 'text-embedding-3-small',
      authentication: {
        providerKey: 'API_KEY_NAME',
      },
    },
  },
});

A Collection is typed as Collection<Schema> where Schema is the type of the documents in the collection. Operations on the collection will be strongly typed if a specific schema is provided, otherwise remained largely weakly typed if no type is provided, which may be preferred for dynamic data access & operations. It’s up to the user to ensure that the provided type truly represents the documents in the collection.

Parameters:

Name Type Summary

collectionName

string

The name of the collection to create.

vector?

CreateCollectionOptions<Schema>

The options for creating the collection.

  • dimension: The dimension for the vector in the collection.

  • metric: The similarity metric to use for vector search.

  • service.provider: The name of the embedding provider. Required for vector collections that generate embeddings automatically.

  • service.modelName: The model name for vector embeddings.

  • service.authentication: An object defining how to authenticate with the embedding provider. For example, {providerKey: 'API_KEY_NAME'}, where API_KEY_NAME is the name of your embedding provider key in the Astra DB KMS.

Name Type Summary

vector?

VectorOptions

The vector configuration for the collection, e.g. vector dimension & similarity metric. If not set, collection will not support vector search. If you’re not sure what dimension to set, use whatever dimension vector your embeddings model produces.

indexing?

IndexingOptions<Schema>

The selective indexing configuration for the collection.

defaultId?

DefaultIdOptions

The defaultId configuration for the collection, for when a document does not specify an _id field.

keyspace?

string

Overrides the keyspace where the collection is created. If not set, the database’s working keyspace is used.

embeddingApiKey?

string

An alternative to service.authentication.providerKey for the embedding provider. Provide the API key directly instead of using an API key in the Astra DB KMS. embeddingApiKey overrides the Astra DB KMS API key if you set both.

defaultMaxTimeMS?

number

The default maxTimeMS for each operation on the Collection.

maxTimeMs?

number

Maximum time in milliseconds the client should wait for the operation to complete.

Returns:

Promise<Collection<Schema>> - A promise that resolves to the created collection object.

Example:

import { DataAPIClient, VectorDoc } from '@datastax/astra-db-ts';

// Get a new Db instance
const db = new DataAPIClient('TOKEN').db('API_ENDPOINT');

// Define the schema for the collection
interface User extends VectorDoc {
  name: string,
  age?: number,
}

(async function () {
  // Create a basic untyped non-vector collection
  const users1 = await db.createCollection('users');
  await users1.insertOne({ name: 'John' });

  // Typed collection with custom options in a non-default keyspace
  const users2 = await db.createCollection<User>('users', {
    keyspace: 'KEYSPACE_NAME',
    defaultId: {
      type: 'objectId',
    },
    vector: {
      dimension: 5,
      metric: 'cosine',
    },
  });
  await users2.insertOne({ name: 'John' }, { sort: { $vector: [.12, .62, .87, .16, .72] } });
})();

See also:

The signature of this command changed in Java client version 2.0-preview.

If you are using client version 2.0-preview or later, see the description of this change in Data API client upgrade guide.

Create a collection to store vector data. For more information, see the Client reference.

Based on the collection parameters, you can provide embeddings when you load data or automatically generate embeddings with vectorize.

To automatically generate embeddings, you must enable the corresponding embedding provider integration, add the embedding provider API key in the Astra KMS, and make sure your database can access the embedding provider service. You can use the Data API to find supported embedding providers and their configuration parameters.

As an alternative to Astra KMS authentication, you can do one of the following:

  • Use the Astra-hosted NVIDIA embedding provider integration, if your database meets the cloud provider and region requirements.

  • Use header authentication to manually provide the embedding provider credentials with every request that requires embedding generation, including loading data and vector search with vectorize. For more information, see the explanation of the collectionOptions parameter in the Parameters table and Vector and vectorize.

// Given `db` Database object, create a new collection

// Create simple collection with given name.
Collection<Document> simple1 = db
  .createCollection(String collectionName);
Collection<MyBean> simple2 = db
  .createCollection(String collectionName, Class<MyBean> clazz);

// Create collections with vector options
Collection<Document> vector1 = createCollection(
  String collectionName,
  int dimension,
  SimilarityMetric metric);
Collection<MyBean> vector2 = createCollection(
  String collectionName,
  int dimension,
  SimilarityMetric metric,
  Class<MyBean> clazz);

// Full-Fledged CollectionOptions with a builder
Collection<Document> full1 = createCollection(
   String collectionName,
   CollectionOptions collectionOptions);
Collection<MyBean> full2 = createCollection(
   String collectionName,
   CollectionOptions collectionOptions,
   Class<MyBean> clazz);

Parameters:

Name Type Summary

collectionName

String

The name of the collection.

dimension

int

The dimension for the vectors in the collection. If you’re not sure what dimension to set, use whatever dimension vector your embeddings model produces.

metric

SimilarityMetric

The similarity metric to use for vector search: SimilarityMetric.cosine (default), SimilarityMetric.dot_product, or SimilarityMetric.euclidean.

collectionOptions

CollectionOptions

Fine-grained settings with vector, embedding provider, model name, authentication, selective indexing, and defaultId options.

clazz

Class<T>

Working with specialized beans for the collection and not the default Document type.

Example:

package com.datastax.astra.client.database;

import com.datastax.astra.client.Collection;
import com.datastax.astra.client.Database;
import com.datastax.astra.client.model.CollectionIdTypes;
import com.datastax.astra.client.model.CollectionOptions;
import com.datastax.astra.client.model.Document;
import com.datastax.astra.client.model.SimilarityMetric;

public class CreateCollection {
  public static void main(String[] args) {

    Database db = new Database(
            System.getenv("ASTRA_DB_API_ENDPOINT"),
            System.getenv("ASTRA_DB_APPLICATION_TOKEN"));

    // Create a non-vector collection
    Collection<Document> simple1 = db.createCollection("col");

    // Default Id Collection
    Collection<Document> defaultId = db.createCollection("defaultId", CollectionOptions
            .builder()
            .defaultIdType(CollectionIdTypes.OBJECT_ID)
            .build());

    // -- Indexing
    Collection<Document> indexingDeny = db.createCollection("indexing1", CollectionOptions
              .builder()
              .indexingDeny("blob")
              .build());
    // Create a collection with indexing (allow) - cannot use allow and denay at the same time
    Collection<Document> indexingAllow = db.createCollection("allow1", CollectionOptions
            .builder()
            .indexingAllow("metadata")
            .build());

    // Vector
    Collection<Document> vector1 = db.createCollection("vector1", 14, SimilarityMetric.DOT_PRODUCT);

    // Create a vector collection
    Collection<Document> vector2 = db.createCollection("vector2", CollectionOptions
      .builder()
      .vectorDimension(1536)
      .vectorSimilarity(SimilarityMetric.EUCLIDEAN)
      .build());

    // Create a collection for the db
    Collection<Document> collection_vectorize_header = db.createCollection(
            "collection_vectorize_header",
            // Create collection with a Service in vectorize (No API KEY)
            CollectionOptions.builder()
                    .vectorDimension(1536)
                    .vectorSimilarity(SimilarityMetric.DOT_PRODUCT)
                    .vectorize("openai", "text-embedding-ada-002")
                    .build());

    // Create a collection for the db
    Collection<Document> collection_vectorize_shared_key = db.createCollection(
            "collection_vectorize_shared_key",
            // Create collection with a Service in vectorize (No API KEY)
            CollectionOptions.builder()
                    .vectorDimension(1536)
                    .vectorSimilarity(SimilarityMetric.DOT_PRODUCT)
                    .vectorize("openai", "text-embedding-ada-002", "OPENAI_API_KEY" )
                    .build());



  }
}

Create a collection that isn’t vector-enabled:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "createCollection": {
    "name": "COLLECTION_NAME",
    "options": {}
  }
}' | jq

Create a vector-enabled collection where you plan to provide embeddings when you load data. This example also sets the defaultID type for documents loaded into the collection.

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "createCollection": {
    "name": "COLLECTION_NAME",
    "options": {
      "defaultId": {
        "type": "uuidv7"
      },
      "vector": {
        "dimension": 5,
        "metric": "cosine"
      }
    }
  }
}' | jq

Create a vector-enabled collection that automatically generates embeddings with vectorize.

To automatically generate embeddings, you must enable the corresponding embedding provider integration, add the embedding provider API key in the Astra KMS, and make sure your database can access the embedding provider service. You can use the Data API to find supported embedding providers and their configuration parameters.

As an alternative to Astra KMS authentication, you can do one of the following:

  • Use the Astra-hosted NVIDIA embedding provider integration, if your database meets the cloud provider and region requirements.

  • Use header authentication to manually provide the embedding provider credentials with every request that requires embedding generation, including loading data and vector search with vectorize. For more information, see the explanation for options.vector.service.authentication in the Parameters table and Vector and vectorize.

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "createCollection": {
    "name": "COLLECTION_NAME",
    "options": {
      "vector": {
        "dimension": 1536,
        "metric": "cosine",
        "service": {
          "provider": "openai",
          "modelName": "text-embedding-3-small",
          "authentication": {
            "providerKey": "ASTRA_KMS_API_KEY_NAME"
          }
        }
      }
    }
  }
}' | jq

Parameters:

Name Type Summary

createCollection

command

The Data API command to create a collection in a Serverless (Vector) database. It acts as a container for all the attributes and settings required to create the collection.

name

string

The name of the new collection. This must be unique within the database specified in the request URL.

options.defaultId

object

(Optional) Controls how the Data API allocates an`_id` for each document that doesn’t specify an ID value in the request. For backwards compatibility with Data API releases before version 1.0.3, if you omit a defaultId option on createCollection, a document’s _id value is a plain string version of version 4 random-based UUID.

options.defaultId.type

string

If you include defaultId, you must include one of objectId, uuidv7, uuidv6, uuid.

options.vector

object

(Optional, recommended) Creates a vector-enabled collection.

Vector-enabled collections can store either vector or non-vector data. Collections that aren’t vector-enabled can’t store vector data.

options.vector.dimension

int

The dimension for vector embeddings in the collection. If you’re not sure what dimension to set, use the dimension vector your embeddings model produces. This can be optional for vectorize, if the specified vector.service.modelName has a default dimension value. For more information, see the documentation for your embedding provider integration.

options.vector.metric

string

The similarity metric to use for vector search: cosine (default), dot_product, or euclidean.

options.vector.service

object

(Optional) Configure a vectorize embedding provider integration.

options.vector.service.provider

string

The vectorize embedding provider name.

options.vector.service.modelName

string

A valid model name for the specified vectorize embedding provider.

options.vector.service.authentication

string

Use credentials stored in Astra DB KMS to authenticate with your vectorize embedding provider. In options.vector.service.authentication.providerKey, provide the credential’s API Key name as given in Astra DB KMS.

Alternatively, you can omit the authentication object, and then provide the authentication key in an x-embedding-api-key header instead. If you use header authentication, you must provide the x-embedding-api-key header with every command that requires vectorize for this collection, including loading data and vector search with vectorize.

options.vector.service.parameters

object

Your embedding provider might require additional parameters. Use findEmbeddingProviders or see the documentation for your embedding provider integration.

options.indexing

object

(Optional) Enable selective indexing for data loaded to the collection. If you specify indexing, you must also specify either an allow or deny clause.

options.indexing.allow

array

Either allow or deny is required if you specify indexing. Provide an array of one or more properties to index. Alternatively, you can enter a wildcard "allow": ["*"] to index all properties during an update operation. This is the same as the default behavior if you omit indexing.

options.indexing.deny

array

Either allow or deny is required if you specify indexing. Provide an array of one or more properties to not index. If you enter a wildcard "deny": ["*"], then no properties are indexed during an update operation.

Returns:

A well-formed request returns 200 OK.

Example response
{
  "status": {
    "ok": 1
  }
}

The defaultId option

After you create a collection, you can’t change the defaultId option.

The defaultId option controls how the Data API allocates an _id for any document that doesn’t otherwise specify an _id value when added to a collection.

If you omit the defaultId option on createCollection, the default type is uuid. This means that the server generates a random stringified UUIDv4 as the _id for any document without an explicit _id field. This enables backwards compatibility with Data API versions 1.0.2 and earlier.

If you include the defaultId option with createCollection, you must specify one of the following case-sensitive ID types:

  • objectId: Each document’s generated _id is an objectId.

  • uuidv6: Each document’s generated _id is a version 6 UUID. This is field-compatible with version 1 time UUIDs, and it supports lexicographical sorting.

  • uuidv7: Each document’s _id is a version 7 UUID. This is designed as a replacement for version 1 time UUID, and it is recommended for use in new systems.

  • uuid: Each document’s generated _id is a version 4 random UUID. This type is analogous to the uuid type and functions in Apache Cassandra®.

Example createCollection with defaultId

This example creates a vector-enabled collection with the default ID type set to objectId:

{
  "createCollection": {
    "name": "some_collection2",
    "options": {
      "defaultId": {
        "type": "objectId"
      },
      "vector": {
        "dimension": 1024,
        "metric": "cosine"
      }
    }
  }
}

When you use a command such as insertOne or insertMany to add documents to a collection, you don’t need to include an _id value in the request. Instead, the server generates a unique identifier for each document based on the collection’s default ID type. However, if you provide an explicit _id value, then the server uses this value instead of generating an ID. For more information about specifying document identifiers, see Work with document IDs.

Client apps can detect the use of $objectId or $uuid in the response document, and then return to the caller the built-in objects representing those types. In this way, client apps can use generated IDs in methods based on Data API operations like findOneAndUpdate, updateOne, and updateMany.

Example client usage

For example, in Python, the client can specify the detected value for a document’s $objectId or $uuid:

# API Response with $objectId
{
    "_id": {"$objectId": "57f00cf47958af95dca29c0c"}
    "summary": "Retrieval-Augmented Generation is the process of optimizing the output of a large language model..."
}

# Client returns Dict from collection.find_one()
my_doc = {
    "_id": astrapy.ids.ObjectId("57f00cf47958af95dca29c0c"),
    "summary": "Retrieval-Augmented Generation is the process of optimizing the output of a large language model..."
}

# API Response with $uuid
{
    "_id": {"$uuid": "ffd1196e-d770-11ee-bc0e-4ec105f276b8"}
    "summary": "Retrieval-Augmented Generation is the process of optimizing the output of a large language model..."
}

# Client returns Dict from collection.find_one()
my_doc = {
    "_id": astrapy.ids.UUID("ffd1196e-d770-11ee-bc0e-4ec105f276b8"),
    "summary": "Retrieval-Augmented Generation is the process of optimizing the output of a large language model..."
}

There are advantages to using generated document IDs instead of manual document IDs. For example, the advantages of generated UUIDv7 document IDs include the following:

  • Uniqueness across the database: A generated _id value is designed to be globally unique across the entire database. This uniqueness is achieved through a combination of timestamp, machine identifier, process identifier, and a sequence number. Explicitly numbering documents might lead to clashes unless carefully managed, especially in distributed systems.

  • Automatic generation: The _id values are automatically generated by Astra DB Serverless. This means you won’t have to worry about creating and maintaining a unique ID system, reducing the complexity of the code and the risk of errors.

  • Timestamp information: A generated _id value includes a timestamp as its first component, representing the document’s creation time. This can be useful for tracking when a document was created without needing an additional field. In particular, type uuidv7 values provide a high degree of granularity (milliseconds) in timestamps.

  • Avoids manual sequence management: Managing sequential numeric IDs manually can be challenging, especially in environments with high concurrency or distributed systems. There’s a risk of ID collision or the need to lock tables or sequences to generate a new ID, which can affect performance. Generated _id values are designed to handle these issues automatically.

While numeric _id values might be simpler and more human-readable, the benefits of using generated _id values make it a superior choice for most applications, especially those that have many documents.

The indexing option

By default, when you add or modify data within a collection, all properties in the added or modified documents are indexed. If you don’t want to index all properties, you can use the Data API to configure selective indexing.

Selective indexing is not recommended for all collections. Consider the advantages and disadvantages of selective indexing before applying it to any collection. DataStax recommends that you test your application in a development environment before applying selective indexing in production.

Indexes enable Data API queries that need to filter or sort data based on indexed properties.

There are index limits for collections and databases. Furthermore, the index limit informs the collection limit. However, do not use selective indexing exclusively to bypass the collection limit. In most cases, selective indexing does not change a database’s collection limit due to the minimum required indexes for collections in Serverless (Vector) databases.

Carefully consider the advantages and disadvantages of selective indexing before applying it to your collections.

Considerations for selective indexing

The primary disadvantage of selective indexing is that sort and filter clauses can only use indexed fields. This means that you can’t perform these types of queries on fields that you do not index.

Non-indexed field error

The Data API returns an error if you attempt to sort or filter by a non-indexed property. For example:

UNINDEXED_FILTER_PATH("Unindexed filter path: The filter path ('*FILTER*') is not indexed")

UNINDEXED_SORT_PATH("Unindexed sort path")

ID_NOT_INDEXED("_id is not indexed")

If you apply selective indexing to a collection, consider which properties might be important in queries that rely on sort and filter clauses, and make sure that you index those fields.

Potential advantages to selective indexing include the following:

  • Read/write performance: Selective indexing can increase write-time performance by reducing the amount of content that needs to be indexed. If certain properties are irrelevant to your application, you can save time by not indexing them.

  • Data capacity: Indexed properties are bound by lower maximum size limits to ensure efficient and performant read operations through the index. By comparison, non-indexed properties can support larger quantities of data, such as the body content of blog posts.

These outcomes are not guaranteed. The results of selective indexing depend on the specific characteristics and use of your applications and data.

DataStax recommends testing your application’s performance, under average and peak demand, in a non-production environment before deploying selective indexing to production. Make adjustments as necessary to optimize your application’s performance.

Configure indexing

You set the indexing behavior when you create a collection. The configuration applies to all data that you load into the collection.

Collections that you create directly in the Astra Portal use default indexing and index all fields. You can’t change the indexing behavior for these collections.

Collections that you create with the Data API can use the optional indexing clause on the createCollection command to set the indexing behavior.

To maintain the default behavior and index all properties, omit the indexing clause from createCollection.

To apply selective indexing, include the indexing clause and either an allow or deny array that determines the fields to index.

If you apply selective indexing, make sure that your indexed fields support your application’s needs and query requirements.

Evaluate the value of each property in your collection’s documents before you create your collection and decide which fields to index.

  • Allow array

  • Deny array

To use the allow array in the indexing clause, specify the fields that you want to index.

For example, the following curl command creates a collection where the index includes only the values of the property1 and property2 fields:

curl -sS -L -X POST ${ASTRA_DB_API_ENDPOINT}/api/json/v1/${ASTRA_DB_KEYSPACE} \
--header "Token: ${ASTRA_DB_APPLICATION_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
  "createCollection": {
    "name": "some_collection",
    "options": {
      "vector": {
        "dimension": 5,
        "metric": "cosine"
      },
      "indexing": {
        "allow": [
          "property1",
          "property2"
        ]
      }
    }
  }
}' | jq

If you add data to the collection that includes additional properties that weren’t present when you first created the collection, the index remains limited to property1 and property2.

When you use an 'allow' array for selective indexing, subsequent Data API queries can perform sort and filter clauses only on property1, property2, or both. Attempting to perform these operations on any other fields returns an error.

Non-indexed field error

The Data API returns an error if you attempt to sort or filter by a non-indexed property. For example:

UNINDEXED_FILTER_PATH("Unindexed filter path: The filter path ('*FILTER*') is not indexed")

UNINDEXED_SORT_PATH("Unindexed sort path")

ID_NOT_INDEXED("_id is not indexed")

If you use a wildcard (*) for the allow array, all properties are indexed. This is equivalent to the default indexing behavior.

{
  "indexing": {
    "allow": [ "*" ]
  }
}

To use the deny array in the indexing clause, specify the fields that you do not want to index.

For example, the following curl command creates a collection where the index includes the values of all fields except property1, property3, property5.prop5b, and any sub-properties of property1 and property3:

curl -sS -L -X POST ${ASTRA_DB_API_ENDPOINT}/api/json/v1/${ASTRA_DB_KEYSPACE} \
--header "Token: ${ASTRA_DB_APPLICATION_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
  "createCollection": {
    "name": "some_collection",
    "options": {
      "vector": {
        "dimension": 5,
        "metric": "cosine"
      },
      "indexing": {
        "deny": [
          "property1",
          "property3",
          "property5.prop5b"
        ]
      }
    }
  }
}' | jq

If a property in the deny array has any sub-properties, those sub-properties are also inherently excluded from indexing. For example, if property3 has two sub-properties (property3.prop3a and property3.prop3b), those sub-properties are also excluded from indexing because the deny array includes only the parent property3.

If you want to exclude a parent property and some of its sub-properties, you must specify both the parent and the specific sub-properties that you want to exclude. For example, if you deny property3 and property3.prop3a, then property3.prop3b is still indexed.

To exclude specific sub-properties, but not the parent, you must specify those sub-properties in the deny array, as was done for property5.prop5b.

Furthermore, if you add data to the collection that includes additional properties or sub-properties that weren’t present when you first created the collection, those new properties are indexed if they are not named in the deny array, either explicitly or by inheritance.

When you use the deny array for selective indexing, subsequent Data API queries can perform sort and filter clauses on any field except the denied (non-indexed) fields. Attempting to perform these operations on denied fields returns an error.

Non-indexed field error

The Data API returns an error if you attempt to sort or filter by a non-indexed property. For example:

UNINDEXED_FILTER_PATH("Unindexed filter path: The filter path ('*FILTER*') is not indexed")

UNINDEXED_SORT_PATH("Unindexed sort path")

ID_NOT_INDEXED("_id is not indexed")

If you use a wildcard (*) for the deny array, no properties are indexed, not even $vector. However, the collection can still create a small number of indexes for minimal functionality.

{
  "indexing": {
    "deny": [ "*" ]
  }
}

Get a collection object

Get a reference to an existing collection for use with the Data API clients.

This command returns a Collection object even for collections that don’t exist. Make sure the collection exists before running this command because the command doesn’t check for you.

  • Python

  • TypeScript

  • Java

  • curl

For more information, see the Client reference.

collection = database.get_collection("COLLECTION_NAME")

The example above is equivalent to these two alternate notations:

collection1 = database["COLLECTION_NAME"]
collection2 = database.COLLECTION_NAME

Most astrapy objects have an asynchronous counterpart, for use within the asyncio framework. To get an AsyncCollection, use the get_collection method of instances of AsyncDatabase, or alternatively the to_async method of the synchronous Collection class.

See the AsyncCollection Client reference for details about the async API.

Parameters:

Name Type Summary

name

str

The name of the collection.

keyspace

Optional[str]

The keyspace containing the collection. If no keyspace is specified, the general setting for this database is used.

embedding_api_key

Optional[str]

An optional API key that is passed to the Data API with each request in the form of an x-embedding-api-key HTTP header.

If you instantiated the collection with embedding_api_key or specified authentication in the service configuration, then the client uses that key. You can use this optional parameter to pass a different key, if needed.

collection_max_time_ms

Optional[int]

A default timeout, in milliseconds, for the duration of each operation on the collection. Individual timeouts can be provided to each collection method call and will take precedence, with this value being an overall default. Note that for some methods involving multiple API calls (such as delete_many and insert_many), you should provide a timeout with sufficient duration for the operation you’re performing.

Returns:

Collection - An instance of the Collection class corresponding to the specified collection name.

Example response
Collection(name="COLLECTION_NAME", keyspace="default_keyspace", database=Database(api_endpoint="ASTRA_DB_API_ENDPOINT", token="APPLICATION_TOKEN", keyspace="default_keyspace"))

Example:

from astrapy import DataAPIClient
client = DataAPIClient("TOKEN")
database = client.get_database("API_ENDPOINT")

collection = database.get_collection("COLLECTION_NAME")
collection.count_documents({}, upper_bound=100)  # will print e.g.: 41

For more information, see the Client reference.

const collection = db.collection('COLLECTION_NAME');

A Collection is typed as Collection<Schema> where Schema is the type of the documents in the collection. Operations on the collection will be strongly typed if a specific schema is provided, otherwise remained largely weakly typed if no type is provided, which may be preferred for dynamic data access & operations. It’s up to the user to ensure that the provided type truly represents the documents in the collection.

Parameters:

Name Type Summary

collectionName

string

The name of the collection to create.

options?

CollectionSpawnOptions

The options for spawning the pre-existing collection.

Name Type Summary

embeddingApiKey?

string

An alternative to service.authentication.providerKey for the embedding provider. Provide the API key directly instead of using an API key in the Astra DB KMS. embeddingApiKey overrides the Astra DB KMS API key if you set both.

defaultMaxTimeMS?

number

The default maxTimeMS for each operation on the Collection.

keyspace?

string

Overrides the keyspace where the collection is created. If not set, the database’s working keyspace? is used.

Returns:

Collection<Schema> - An unverified reference to the collection.

Example:

import { DataAPIClient } from '@datastax/astra-db-ts';

// Get a new Db instance
const db = new DataAPIClient('TOKEN').db('API_ENDPOINT');

// Define the schema for the collection
interface User {
  name: string,
  age?: number,
}

(async function () {
  // Basic untyped collection
  const users1 = db.collection('users');
  await users1.insertOne({ name: 'John' });

  // Typed collection from different keyspace with a specific embedding API key
  const users2 = db.collection<User>('users', {
    keyspace: 'KEYSPACE_NAME',
    embeddingApiKey: 'EMBEDDINGS_API_KEY',
  });
  await users2.insertOne({ name: 'John' });
})();

See also:

For more information, see the Client reference.

// Given db Database object, list all collections
Collection<Document> collection = db.getCollection("COLLECTION_NAME");

// Gather collection information
CollectionOptions options = collection.getOptions();

Returns:

CollectionOptions - The Collection with all collection metadata, including the defaultId, vector, and indexing options.

Example:

package com.datastax.astra.client.database;

import com.datastax.astra.client.Collection;
import com.datastax.astra.client.Database;
import com.datastax.astra.client.model.Document;
import com.datastax.astra.client.model.CollectionOptions;

public class FindCollection {
  public static void main(String[] args) {
    Database db = new Database("TOKEN", "API_ENDPOINT");

    // Find a collection
    Collection<Document> collection = db.getCollection("collection_vector1");

    // Gather collection information
    CollectionOptions options = collection.getOptions();

    // Check if a collection exists
    boolean collectionExists = db.getCollection("collection_vector2").exists();
  }
}

This operation is not required with HTTP because you specify the target collection in the path, if required.

To get information about collections in a database, see List collection metadata.

List collection metadata

Get information about the collections in a specific keyspace.

For the clients, this operation retrieves an iterable object over collections. Unless otherwise specified, this implementation refers to the collections in the database’s working keyspace.

  • Python

  • TypeScript

  • Java

  • curl

  • CLI

For more information, see the Client reference.

collection_iterable = database.list_collections()

Parameters:

Name Type Summary

keyspace

Optional[str]

the keyspace to be inspected. If not specified, the database’s working keyspace is used.

max_time_ms

Optional[int]

A timeout, in milliseconds, for the underlying HTTP request.

Returns:

CommandCursor[CollectionDescriptor] - An iterable over CollectionDescriptor objects.

Example response

For clarity, this example is limited to a single collection descriptor from the cursor, and it is reformatted with indentation.

[
    ...,
    CollectionDescriptor(
        name='my_collection',
        options=CollectionOptions(
            vector=CollectionVectorOptions(
                dimension=3,
                metric='dot_product'
            ),
            indexing={'allow': ['field']}
        )
    ),
    ...,
]

Example:

from astrapy import DataAPIClient
client = DataAPIClient("TOKEN")
database = client.get_database("API_ENDPOINT")

coll_cursor = database.list_collections()
coll_cursor  # this looks like: CommandCursor("https://....astra.datastax.com", alive)
list(coll_cursor)  # [CollectionDescriptor(name='my_v_col', ...), ...]
for coll_desc in database.list_collections():
    print(coll_desc)
# will print:
#   CollectionDescriptor(name='my_v_col', options=CollectionOptions(vector=CollectionVectorOptions(dimension=3, metric='dot_product', service=None), raw_options=...), raw_descriptor=...)
#   ...

For more information, see the Client reference.

const collections = await db.listCollections();

Parameters:

Name Type Summary

options

ListCollectionsOptions

Options regarding listing collections.

Name Type Summary

nameOnly?

false

If true, only the name of the collection is returned. Else, the full information for each collection is returned. Defaults to true.

keyspace?

string

The keyspace to be inspected. If not specified, the database’s working keyspace is used.

maxTimeMs?

number

Maximum time in milliseconds the client should wait for the operation to complete.

Returns:

Promise<FullCollectionInfo[]> - A promise that resolves to an array of full collection information objects.

Example:

import { DataAPIClient } from '@datastax/astra-db-ts';

// Get a new Db instance
const db = new DataAPIClient('TOKEN').db('API_ENDPOINT');

(async function () {
  // Gets full info about all collections in db
  const collections = await db.listCollections();

  for (const collection of collections) {
    console.log(`Collection '${collection.name}' has default ID type '${collection.options.defaultId?.type}'`);
  }
})();

For more information, see the Client reference.

// Given `db` Database object, list all collections
Stream<CollectionInfo> collection = listCollections();

Returns:

Stream<CollectionInfo> - The definition elements of collections.

Example:

package com.datastax.astra.client.database;

import com.datastax.astra.client.Database;
import com.datastax.astra.client.model.CollectionInfo;

import java.util.stream.Stream;

public class ListCollections {
    public static void main(String[] args) {
        Database db = new Database("TOKEN", "API_ENDPOINT");

        // Get collection Names
        Stream<String> collectionNames = db.listCollectionNames();

        // Get Collection information (with options)
        Stream<CollectionInfo> collections = db.listCollections();
        collections.map(CollectionInfo::getOptions).forEach(System.out::println);
    }
}

Get an overview of collections in the specified database and keyspace that are available for query, insert, and other database commands:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "findCollections": {
    "options": {
      "explain": true
    }
  }
}' | jq

Parameters:

Name Type Summary

findCollections

command

The Data API command to find all collections in the specified database and keyspace. It acts as a container for all the attributes and settings required to find collections.

options.explain

boolean

If true, the response includes collection names and a brief explanation of metadata for each collection, such as vector, dimension, metric, defaultId, and indexing. If false or unset, the response includes only collection names.

Returns:

A successful request returns the collection details.

Example response

This example response contains information for one collection:

{
  "status" : {
    "collections" : [
      {
        "name" : "vector_collection",
        "options" : {
          "defaultId": {
            "type": "objectId"
          },
          "vector" : {
            "dimension" : 5,
            "metric" : "cosine"
          }
        }
      }
    ]
  }
}

To list all collections in a database, use the following command:

astra db list-collections DATABASE_NAME

Parameters:

Name Type Summary

db_name

String

The name of the database

Result:

+---------------------+-----------+-------------+
| Name                | Dimension | Metric      |
+---------------------+-----------+-------------+
| collection_simple   |           |             |
| collection_vector   | 14        | cosine      |
| msp                 | 1536      | dot_product |
+---------------------+-----------+-------------+

List collection names

Get the names of the collections in a specific keyspace as a list of strings.

For the clients, unless otherwise specified, this implementation refers to the collections in the database’s working keyspace.

  • Python

  • TypeScript

  • Java

  • curl

  • CLI

For more information, see the Client reference.

database.list_collection_names()

Get the names of the collections in a specified keyspace of the database.

database.list_collection_names(keyspace="KEYSPACE_NAME")

Parameters:

Name Type Summary

keyspace

Optional[str]

The keyspace to be inspected. If not specified, the database’s working keyspace is used.

max_time_ms

Optional[int]

A timeout, in milliseconds, for the underlying HTTP request.

Returns:

List[str] - A list of the collection names, in no particular order.

Example response
['a_collection', 'another_col']

Example:

from astrapy import DataAPIClient
client = DataAPIClient("TOKEN")
database = client.get_database("API_ENDPOINT")

database.list_collection_names()
# ['a_collection', 'another_col']

For more information, see the Client reference.

const collectionNames = await db.listCollections({ nameOnly: true });

Get the names of the collections in a specified keyspace of the database.

const collectionNames = await db.listCollections({ nameOnly: true, keyspace: 'KEYSPACE_NAME' });

Parameters:

Name Type Summary

options

ListCollectionsOptions

Options regarding listing collections.

Name Type Summary

nameOnly

true

If true, only the name of the collection is returned. Else, the full information for each collection is returned. Defaults to true.

keyspace?

string

The keyspace to be inspected. If not specified, the database’s working keyspace is used.

maxTimeMs?

number

Maximum time in milliseconds the client should wait for the operation to complete.

Returns:

Promise<string[]> - A promise that resolves to an array of the collection names.

Example:

import { DataAPIClient } from '@datastax/astra-db-ts';

// Get a new Db instance
const db = new DataAPIClient('TOKEN').db('API_ENDPOINT');

(async function () {
  // Gets just names of all collections in db
  const collections = await db.listCollections({ nameOnly: true });

  for (const collectionName of collections) {
    console.log(`Collection '${collectionName}' exists`);
  }
})();

For more information, see the Client reference.

// Given `db` Database object, list all collections
Stream<String> collection = listCollectionsNames();

Returns:

Stream<String> - The names of the collections.

Example:

package com.datastax.astra.client.database;

import com.datastax.astra.client.Database;
import com.datastax.astra.client.model.CollectionInfo;

import java.util.stream.Stream;

public class ListCollections {
    public static void main(String[] args) {
        Database db = new Database("TOKEN", "API_ENDPOINT");

        // Get collection Names
        Stream<String> collectionNames = db.listCollectionNames();

        // Get Collection information (with options)
        Stream<CollectionInfo> collections = db.listCollections();
        collections.map(CollectionInfo::getOptions).forEach(System.out::println);
    }
}

To get a list of collection names only, use List collection metadata with "explain": false.

To list all collections in a database, use the following command:

astra db list-collections DATABASE_NAME | cut -b 1-23

Parameters:

Name Type Summary

db_name

String

The name of the database

Result:

+---------------------+
| Name                |
+---------------------+
| collection_simple   |
| collection_vector   |
| msp                 |
+---------------------+

Drop a collection

Delete a collection from a database and erase all data stored in it.

Attempting to use the collection object after dropping the collection returns an API error because it references a non-existent collection.

  • Python

  • TypeScript

  • Java

  • curl

This command is equivalent to the collection’s own method collection.drop().

For more information, see the Client reference.

result = db.drop_collection(name_or_collection="COLLECTION")

Parameters:

Name Type Summary

name_or_collection

Union[str, Collection]

either the name of a collection or a Collection instance.

max_time_ms

Optional[int]

A timeout, in milliseconds, for the underlying HTTP request.

Returns:

Dict - A dictionary in the form {"ok": 1} if the method succeeds.

Example response
{'ok': 1}

Example:

from astrapy import DataAPIClient
client = DataAPIClient("TOKEN")
database = client.get_database("API_ENDPOINT")

database.list_collection_names()
# prints: ['a_collection', 'my_v_col', 'another_col']
database.drop_collection("my_v_col")  # {'ok': 1}
database.list_collection_names()
# prints: ['a_collection', 'another_col']

For more information, see the Client reference.

const ok = await db.dropCollection('COLLECTION');

Parameters:

Name Type Summary

name

string

The name of the collection to delete.

options?

DropCollectionOptions

Allows you to override the keyspace and set a maxTimeMs.

Name Type Summary

keyspace?

string

The keyspace containing the collection. If not specified, the database’s working keyspace is used.

maxTimeMS?

number

Maximum time in milliseconds the client should wait for the operation to complete.

Returns:

Promise<boolean> - A promise that resolves to true if the collection was dropped successfully.

Example:

import { DataAPIClient } from '@datastax/astra-db-ts';

// Get a new Db instance
const db = new DataAPIClient('TOKEN').db('API_ENDPOINT');

(async function () {
  // Uses db's default keyspace
  const success1 = await db.dropCollection('COLLECTION_NAME');
  console.log(success1); // true

  // Overrides db's default keyspace
  const success2 = await db.dropCollection('COLLECTION_NAME', {
    keyspace: 'KEYSPACE_NAME'
  });
  console.log(success2); // true
})();

For more information, see the Client reference.

// Given `db` Database object, list all collections
void db.dropCollection("collectionName");

Parameters:

Name Type Summary

collectionName

String

The name of the collection to delete.

Example:

package com.datastax.astra.client.database;

import com.datastax.astra.client.Database;

public class DropCollection {
  public static void main(String[] args) {
    Database db = new Database("API_ENDPOINT", "TOKEN");

    // Delete an existing collection
    db.dropCollection("collection_vector2");
  }
}

To delete a collection and all data that it contains, send a POST request with the deleteCollection command:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "deleteCollection": {
    "name": "COLLECTION_NAME"
  }
}' | jq

Parameter:

Name Type Summary

deleteCollection

command

The command to delete a specified collection and all of its data.

name

string

The name of the collection to delete.

Returns:

A well-formed returns 200 OK.

Example response
{
  "status": {
    "ok": 1
  }
}

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com