Work with collections

Collections store documents in Serverless (Vector) databases. Collections are best for semi-structured data.

With the Data API clients, use the Database class to manage collections and the Collection class to work with the data in collections.

Serverless (Vector) databases created after June 24, 2024 can have approximately 10 collections. Databases created before this date can have approximately 5 collections. The collection limit is based on the number of indexes.

For more information about the Data API and clients, see Get started with the Data API.

The defaultId option

After you create a collection, you can’t change the defaultId option.

The defaultId option controls how the Data API allocates an _id for any document that doesn’t otherwise specify an _id value when added to a collection.

If you omit the defaultId option on createCollection, the default type is uuid. This means that the server generates a random stringified UUIDv4 as the _id for any document without an explicit _id field. This enables backwards compatibility with Data API versions 1.0.2 and earlier.

If you include the defaultId option with createCollection, you must specify one of the following case-sensitive ID types:

  • objectId: Each document’s generated _id is an objectId.

  • uuidv6: Each document’s generated _id is a version 6 UUID. This is field-compatible with version 1 time UUIDs, and it supports lexicographical sorting.

  • uuidv7: Each document’s _id is a version 7 UUID. This is designed as a replacement for version 1 time UUID, and it is recommended for use in new systems.

  • uuid: Each document’s generated _id is a version 4 random UUID. This type is analogous to the uuid type and functions in Apache Cassandra®.

Example createCollection with defaultId

This example creates a vector-enabled collection with the default ID type set to objectId:

{
  "createCollection": {
    "name": "some_collection2",
    "options": {
      "defaultId": {
        "type": "objectId"
      },
      "vector": {
        "dimension": 1024,
        "metric": "cosine"
      }
    }
  }
}

When you use a command such as insertOne or insertMany to add documents to a collection, you don’t need to include an _id value in the request. Instead, the server generates a unique identifier for each document based on the collection’s default ID type. However, if you provide an explicit _id value, then the server uses this value instead of generating an ID. For more information about specifying document identifiers, see Work with document IDs.

Client apps can detect the use of $objectId or $uuid in the response document, and then return to the caller the built-in objects representing those types. In this way, client apps can use generated IDs in methods based on Data API operations like findOneAndUpdate, updateOne, and updateMany.

Example client usage

For example, in Python, the client can specify the detected value for a document’s $objectId or $uuid:

# API Response with $objectId
{
    "_id": {"$objectId": "57f00cf47958af95dca29c0c"}
    "summary": "Retrieval-Augmented Generation is the process of optimizing the output of a large language model..."
}

# Client returns Dict from collection.find_one()
my_doc = {
    "_id": astrapy.ids.ObjectId("57f00cf47958af95dca29c0c"),
    "summary": "Retrieval-Augmented Generation is the process of optimizing the output of a large language model..."
}

# API Response with $uuid
{
    "_id": {"$uuid": "ffd1196e-d770-11ee-bc0e-4ec105f276b8"}
    "summary": "Retrieval-Augmented Generation is the process of optimizing the output of a large language model..."
}

# Client returns Dict from collection.find_one()
my_doc = {
    "_id": astrapy.ids.UUID("ffd1196e-d770-11ee-bc0e-4ec105f276b8"),
    "summary": "Retrieval-Augmented Generation is the process of optimizing the output of a large language model..."
}

There are advantages to using generated document IDs instead of manual document IDs. For example, the advantages of generated UUIDv7 document IDs include the following:

  • Uniqueness across the database: A generated _id value is designed to be globally unique across the entire database. This uniqueness is achieved through a combination of timestamp, machine identifier, process identifier, and a sequence number. Explicitly numbering documents might lead to clashes unless carefully managed, especially in distributed systems.

  • Automatic generation: The _id values are automatically generated by Astra DB Serverless. This means you won’t have to worry about creating and maintaining a unique ID system, reducing the complexity of the code and the risk of errors.

  • Timestamp information: A generated _id value includes a timestamp as its first component, representing the document’s creation time. This can be useful for tracking when a document was created without needing an additional field. In particular, type uuidv7 values provide a high degree of granularity (milliseconds) in timestamps.

  • Avoids manual sequence management: Managing sequential numeric IDs manually can be challenging, especially in environments with high concurrency or distributed systems. There’s a risk of ID collision or the need to lock tables or sequences to generate a new ID, which can affect performance. Generated _id values are designed to handle these issues automatically.

While numeric _id values might be simpler and more human-readable, the benefits of using generated _id values make it a superior choice for most applications, especially those that have many documents.

The indexing option

By default, when you add or modify data within a collection, all properties in the added or modified documents are indexed. If you don’t want to index all properties, you can use the Data API to configure selective indexing.

Selective indexing is not recommended for all collections. Consider the advantages and disadvantages of selective indexing before applying it to any collection. DataStax recommends that you test your application in a development environment before applying selective indexing in production.

Indexes enable Data API queries that need to filter or sort data based on indexed properties.

There are index limits for collections and databases. Furthermore, the index limit informs the collection limit. However, do not use selective indexing exclusively to bypass the collection limit. In most cases, selective indexing does not change a database’s collection limit due to the minimum required indexes for collections in Serverless (Vector) databases.

Carefully consider the advantages and disadvantages of selective indexing before applying it to your collections.

Considerations for selective indexing

The primary disadvantage of selective indexing is that sort and filter clauses can only use indexed fields. This means that you can’t perform these types of queries on fields that you do not index.

Non-indexed field error

The Data API returns an error if you attempt to sort or filter by a non-indexed property. For example:

UNINDEXED_FILTER_PATH("Unindexed filter path: The filter path ('*FILTER*') is not indexed")

UNINDEXED_SORT_PATH("Unindexed sort path")

ID_NOT_INDEXED("_id is not indexed")

If you apply selective indexing to a collection, consider which properties might be important in queries that rely on sort and filter clauses, and make sure that you index those fields.

Potential advantages to selective indexing include the following:

  • Read/write performance: Selective indexing can increase write-time performance by reducing the amount of content that needs to be indexed. If certain properties are irrelevant to your application, you can save time by not indexing them.

  • Data capacity: Indexed properties are bound by lower maximum size limits to ensure efficient and performant read operations through the index. By comparison, non-indexed properties can support larger quantities of data, such as the body content of blog posts.

These outcomes are not guaranteed. The results of selective indexing depend on the specific characteristics and use of your applications and data.

DataStax recommends testing your application’s performance, under average and peak demand, in a non-production environment before deploying selective indexing to production. Make adjustments as necessary to optimize your application’s performance.

Configure indexing

You set the indexing behavior when you create a collection. The configuration applies to all data that you load into the collection.

Collections that you create directly in the Astra Portal use default indexing and index all fields. You can’t change the indexing behavior for these collections.

Collections that you create with the Data API can use the optional indexing clause on the createCollection command to set the indexing behavior.

To maintain the default behavior and index all properties, omit the indexing clause from createCollection.

To apply selective indexing, include the indexing clause and either an allow or deny array that determines the fields to index.

If you apply selective indexing, make sure that your indexed fields support your application’s needs and query requirements.

Evaluate the value of each property in your collection’s documents before you create your collection and decide which fields to index.

  • Allow array

  • Deny array

To use the allow array in the indexing clause, specify the fields that you want to index.

For example, the following curl command creates a collection where the index includes only the values of the property1 and property2 fields:

curl -sS -L -X POST ${ASTRA_DB_API_ENDPOINT}/api/json/v1/${ASTRA_DB_KEYSPACE} \
--header "Token: ${ASTRA_DB_APPLICATION_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
  "createCollection": {
    "name": "some_collection",
    "options": {
      "vector": {
        "dimension": 5,
        "metric": "cosine"
      },
      "indexing": {
        "allow": [
          "property1",
          "property2"
        ]
      }
    }
  }
}' | jq

If you add data to the collection that includes additional properties that weren’t present when you first created the collection, the index remains limited to property1 and property2.

When you use an 'allow' array for selective indexing, subsequent Data API queries can perform sort and filter clauses only on property1, property2, or both. Attempting to perform these operations on any other fields returns an error.

Non-indexed field error

The Data API returns an error if you attempt to sort or filter by a non-indexed property. For example:

UNINDEXED_FILTER_PATH("Unindexed filter path: The filter path ('*FILTER*') is not indexed")

UNINDEXED_SORT_PATH("Unindexed sort path")

ID_NOT_INDEXED("_id is not indexed")

If you use a wildcard (*) for the allow array, all properties are indexed. This is equivalent to the default indexing behavior.

{
  "indexing": {
    "allow": [ "*" ]
  }
}

To use the deny array in the indexing clause, specify the fields that you do not want to index.

For example, the following curl command creates a collection where the index includes the values of all fields except property1, property3, property5.prop5b, and any sub-properties of property1 and property3:

curl -sS -L -X POST ${ASTRA_DB_API_ENDPOINT}/api/json/v1/${ASTRA_DB_KEYSPACE} \
--header "Token: ${ASTRA_DB_APPLICATION_TOKEN}" \
--header "Content-Type: application/json" \
--data '{
  "createCollection": {
    "name": "some_collection",
    "options": {
      "vector": {
        "dimension": 5,
        "metric": "cosine"
      },
      "indexing": {
        "deny": [
          "property1",
          "property3",
          "property5.prop5b"
        ]
      }
    }
  }
}' | jq

If a property in the deny array has any sub-properties, those sub-properties are also inherently excluded from indexing. For example, if property3 has two sub-properties (property3.prop3a and property3.prop3b), those sub-properties are also excluded from indexing because the deny array includes only the parent property3.

If you want to exclude a parent property and some of its sub-properties, you must specify both the parent and the specific sub-properties that you want to exclude. For example, if you deny property3 and property3.prop3a, then property3.prop3b is still indexed.

To exclude specific sub-properties, but not the parent, you must specify those sub-properties in the deny array, as was done for property5.prop5b.

Furthermore, if you add data to the collection that includes additional properties or sub-properties that weren’t present when you first created the collection, those new properties are indexed if they are not named in the deny array, either explicitly or by inheritance.

When you use the deny array for selective indexing, subsequent Data API queries can perform sort and filter clauses on any field except the denied (non-indexed) fields. Attempting to perform these operations on denied fields returns an error.

Non-indexed field error

The Data API returns an error if you attempt to sort or filter by a non-indexed property. For example:

UNINDEXED_FILTER_PATH("Unindexed filter path: The filter path ('*FILTER*') is not indexed")

UNINDEXED_SORT_PATH("Unindexed sort path")

ID_NOT_INDEXED("_id is not indexed")

If you use a wildcard (*) for the deny array, no properties are indexed, not even $vector. However, the collection can still create a small number of indexes for minimal functionality.

{
  "indexing": {
    "deny": [ "*" ]
  }
}

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2025 DataStax | Privacy policy | Terms of use | Manage Privacy Choices

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com