Work with documents

Documents represent a single row or record of data in Astra DB Serverless databases.

You use the Collection class to work with documents through the Data API clients. For instructions to get a Collection object, see Work with collections.

For more information about the Data API and clients, see Get started with the Data API.

Field data types

The Data API supports the following data types for document fields in collections:

  • String

  • Number

  • Object (JSON object)

  • Array

  • Boolean

  • Vector (through $vector)

  • Date (through $date)

  • Null

  • UUID (through $uuid)

  • ObjectId (through $objectId)

$vector and $vectorize

When working with documents in the Astra Portal or Data API, there are two reserved fields for vector data: $vector and $vectorize.

Which fields you can use depends on the collection configuration.

Embedding generation methods

When you create a collection, you decide if the collection can store structured vector data. This is known as a vector-enabled collection. For vector-enabled collections, you also decide how to provide embeddings. You must decide which options you need when you create the collection:

  • For all vector-enabled collections, you can provide embeddings when you load data (also known as bring your own embeddings).

  • You can configure the collection to automatically generate embeddings with vectorize (the $vectorize reserved field).

    You can’t use $vectorize in a collection where you did not enable vectorize when you created the collection. If you want to use vectorize at all, then you must enable vectorize when you create the collection.

  • If you enable vectorize, you can use both options interchangeably but not simultaneously. For example, you can use vectorize to generate embeddings for a batch of documents, and then insert a few documents with pre-generated embeddings.

    To bring your own embeddings to a collection that uses vectorize, when you insert a document, include the document’s embedding in the $vector field.

    It is critical that all embeddings in a collection are generated by the same model with the same dimensions, regardless of whether you use vectorize, bring your own embeddings, or both.

    Astra DB only checks that the dimensions are the same; it does not produce an error if the embeddings are from different models. You must ensure that the embeddings are compatible. Using mismatched embeddings produces unreliable and incorrect results in similarity searches.

  • For all vector-enabled collections, you can insert non-vector data.

Reserved fields

$vector

The $vector parameter is a reserved field that stores vectors.

To bring your own embeddings when you insert documents, include $vector for each document that has an embedding.

If the collection uses vectorize, you have the option to omit $vector when you insert documents. You can use $vectorize to generate an embedding, and then Astra DB populates the document’s $vector field with the automatically generated embedding. Alternatively, if you want to bring your own embeddings to a collection that uses vectorize, you can include the $vector field when you insert documents.

Regardless of the embedding generation method, when you find, update, replace, or delete documents, you can use $vector to fetch documents by vector search. You can also use projections to include $vector in responses.

$vectorize

The $vectorize parameter is a reserved field that generates embeddings automatically based on a given text string.

You can’t use $vectorize in a collection where you did not enable vectorize when you created the collection. If you want to use vectorize at all, then you must enable vectorize when you create the collection.

If the collection uses vectorize, you have the option to include this parameter when you insert documents. The value of $vectorize is the text string from which you want to generate a document’s embedding. Make sure the vectorize text string is compliant with the embedding provider’s requirements, such a token size. Astra DB stores the resulting vector array in $vector.

When you find, update, replace, or delete documents in a collection that uses vectorize, you can use $vectorize to fetch documents by vector search with vectorize. You can also use projections to include $vectorize in responses.

For information about vectorize integrations and troubleshooting vectorize, see Auto-generate embeddings with vectorize.

$vector and $vectorize are excluded by default from Data API responses. You can use projections to include these properties in responses.

Insert non-vector data in a vector-enabled collection

To insert a document that doesn’t need an embedding, then you can omit $vector and $vectorize. When using the Astra Portal to load JSON or CSV data into a collection that uses vectorize, make sure the Vector Field is set to None (no embeddings).

$date

  • Python

  • TypeScript

  • Java

  • curl

The handling of datetime objects, with particular emphasis on usage of naive (i.e. timezone-unaware) datetimes, changed in the Python client version 2.0-preview.

If you are using client version 2.0-preview or later, see the description of this change in Data API client upgrade guide.

Date and datetime objects are instances of the Python standard library datetime.datetime and datetime.date classes that you can use anywhere in documents.

The following example uses dates in insert, update, and find commands. Read operations from a collection always return the datetime class, regardless of whether the original command used date or datetime.

import datetime

from astrapy import DataAPIClient
from astrapy.ids import ObjectId, uuid8, UUID
client = DataAPIClient("TOKEN")
database = client.get_database("API_ENDPOINT")
collection = database.my_collection

# Insert documents containing date and datetime values:
collection.insert_one({"when": datetime.datetime.now()})
collection.insert_one({"date_of_birth": datetime.date(2000, 1, 1)})
collection.insert_one({"registered_at": datetime.date(1999, 11, 14)})

# Update a document, using a date in the filter:
collection.update_one(
    {"registered_at": datetime.date(1999, 11, 14)},
    {"$set": {"message": "happy Sunday!"}},
)

# Update a document, setting "last_reviewed" to the current date:
collection.update_one(
    {"date_of_birth": {"$exists": True}},
    {"$currentDate": {"last_reviewed": True}},
)

# Find documents by inequality on a date value:
print(
    collection.find_one(
        {"date_of_birth": {"$lt": datetime.date(2001, 1, 1)}},
        projection={"_id": False},
    )
)
# will print:
# {'date_of_birth': datetime.datetime(2000, 1, 1, 0, 0), 'last_reviewed': datetime.datetime(...now...)}

You can use standard JS Date objects anywhere in documents to represent dates and times. Read operations also return Date objects for document fields stored using { $date: number }.

The following example uses dates in insert, update, and find commands:

import { DataAPIClient } from '@datastax/astra-db-ts';

// Reference an untyped collection
const client = new DataAPIClient('TOKEN');
const db = client.db('ENDPOINT', { keyspace: 'KEYSPACE' });

(async function () {
  // Create an untyped collection
  const collection = await db.createCollection('dates_test');

  // Insert documents with some dates
  await collection.insertOne({ dateOfBirth: new Date(1394104654000) });
  await collection.insertOne({ dateOfBirth: new Date('1863-05-28') });

  // Update a document with a date and setting lastModified to now
  await collection.updateOne(
    {
      dateOfBirth: new Date('1863-05-28'),
    },
    {
      $set: { message: 'Happy Birthday!' },
      $currentDate: { lastModified: true },
    },
  );

  // Will print around new Date()
  const found = await collection.findOne({ dateOfBirth: { $lt: new Date('1900-01-01') } });
  console.log(found?.lastModified);
})();

The Data API uses the ejson standard to represents time-related objects. The Java client introduces custom serializers as three types of objects: java.util.Date, java.util.Calendar, java.util.Instant. You can use these objects in documents as well as filter clauses and update clauses.

The following example uses dates in insert, update, and find commands:

package com.datastax.astra.client.collection;

import com.datastax.astra.client.Collection;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.model.Document;
import com.datastax.astra.client.model.FindOneOptions;
import com.datastax.astra.client.model.Projections;

import java.time.Instant;
import java.util.Calendar;
import java.util.Date;

import static com.datastax.astra.client.model.Filters.eq;
import static com.datastax.astra.client.model.Filters.lt;
import static com.datastax.astra.client.model.Updates.set;

public class WorkingWithDates {
    public static void main(String[] args) {
        // Given an existing collection
        Collection<Document> collection = new DataAPIClient("TOKEN")
                .getDatabase("API_ENDPOINT")
                .getCollection("COLLECTION_NAME");

        Calendar c = Calendar.getInstance();
        collection.insertOne(new Document().append("registered_at", c));
        collection.insertOne(new Document().append("date_of_birth", new Date()));
        collection.insertOne(new Document().append("just_a_date", Instant.now()));

        collection.updateOne(
                eq("registered_at", c), // filter clause
                set("message", "happy Sunday!")); // update clause

        collection.findOne(
                lt("date_of_birth", new Date(System.currentTimeMillis() - 1000 * 1000)),
                new FindOneOptions().projection(Projections.exclude("_id")));
    }
}

You can use $date to represent dates as Unix timestamps in the JSON payload of a Data API command:

"date_of_birth": { "$date": 1690045891 }

The following example includes a date in an insertOne command:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE/ASTRA_DB_COLLECTION" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "insertOne": {
    "document": {
      "$vector": [0.25, 0.25, 0.25, 0.25, 0.25],
      "date_of_birth": { "$date": 1690045891 }
    }
  }
}' | jq

The following example uses the date to find and update a document with the updateOne command:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE/ASTRA_DB_COLLECTION" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "updateOne": {
    "filter": {
      "date_of_birth": { "$date": 1690045891 }
    },
    "update": { "$set": { "message": "Happy birthday!" } }
  }
}' | jq

The following example uses the $currentDate update operator to set a property to the current date:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE/ASTRA_DB_COLLECTION" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "findOneAndUpdate": {
    "filter": { "_id": "doc1" },
    "update": {
      "$currentDate": {
        "createdAt": true
        }
      }
    }
}' | jq

Document IDs

Documents in a collection are always identified by an ID that is unique within the collection. This identifier is stored in the reserved field _id. There are multiple types of document identifiers, such as string, integer, or datetime; however, the uuid and ObjectId types are recommended. The Data API supports uuid identifiers up to version 8 and ObjectId identifiers as provided by the bson library.

When you create a collection, you can set a default ID type that specifies how the Data API generates an _id for any document that doesn’t have an explicit _id field when you insert it into the collection.

If you explicitly define a document’s _id, such as "_id": "12", then the server uses this value instead of generating an ID. If explicitly defined, the _id field must be a top-level document property. _id cannot be nested within another property.

Regardless of the defaultId setting, the Data API honors document identifiers of any type, anywhere in a document, that you explicitly provide at any time:

  • You can include identifiers anywhere in a document, not only in the _id field.

  • You can include different types of identifiers in different parts of the same document.

  • You can define identifiers at any time, such as when inserting or updating a document.

  • You can use any of a document’s identifiers for filter clauses and update/replace operations, just like any other data type.

  • Python

  • TypeScript

  • Java

  • curl

The Python client recognizes uuid versions 1 and 3 through 8, as provided by the uuid and uuid6 Python libraries. The Python client also recognizes the ObjectId from the bson package. For convenience, these utilities are exposed in AstraPy directly:

from astrapy.ids import (
    ObjectId,
    uuid1,
    uuid3,
    uuid4,
    uuid5,
    uuid6,
    uuid7,
    uuid8,
    UUID,
)

You can generate new identifiers with statements such as new_id = uuid8() or new_obj_id = ObjectId():

from astrapy import DataAPIClient
from astrapy.ids import ObjectId, uuid8, UUID
client = DataAPIClient("TOKEN")
database = client.get_database("API_ENDPOINT")
collection = database.my_collection

collection.insert_one({"_id": uuid8(), "tag": "new_id_v_8"})
collection.insert_one(
    {"_id": UUID("018e77bc-648d-8795-a0e2-1cad0fdd53f5"), "tag": "id_v_8"}
)
collection.insert_one({"id": ObjectId(), "tag": "new_obj_id"})
collection.insert_one(
    {"id": ObjectId("6601fb0f83ffc5f51ba22b88"), "tag": "obj_id"}
)
collection.find_one_and_update(
    {"_id": ObjectId("6601fb0f83ffc5f51ba22b88")},
    {"$set": {"item_inventory_id": UUID("1eeeaf80-e333-6613-b42f-f739b95106e6")}},
)

All uuid versions are instances of the UUID class, which exposes a version property, if you need to access it.

To use and generate identifiers, astra-db-ts provides the UUID and ObjectId classes. These are not the same as those exported from the bson or uuid libraries. Instead, these are custom classes that you must import from the astra-db-ts package:

import { UUID, ObjectId } from '@datastax/astra-db-ts';

To generate new identifiers, you can use UUID.v4(), UUID.v7(), or new ObjectId():

import { DataAPIClient, UUID, ObjectId } from '@datastax/astra-db-ts';

// Schema for the collection
interface Person {
  _id: UUID | ObjectId;
  name: string;
  friendId?: UUID;
}

// Reference the DB instance
const client = new DataAPIClient('TOKEN');
const db = client.db('ENDPOINT', { keyspace: 'KEYSPACE' });

(async function () {
  // Create the collection
  const collection = await db.createCollection<Person>('people');

  // Insert documents w/ various IDs
  await collection.insertOne({ name: 'John', _id: UUID.v4() });
  await collection.insertOne({ name: 'Jane', _id: new UUID('016b1cac-14ce-660e-8974-026c927b9b91') });

  await collection.insertOne({ name: 'Dan', _id: new ObjectId()});
  await collection.insertOne({ name: 'Tim', _id: new ObjectId('65fd9b52d7fabba03349d013') });

  // Update a document with a UUID in a non-_id field
  await collection.updateOne(
    { name: 'John' },
    { $set: { friendId: new UUID('016b1cac-14ce-660e-8974-026c927b9b91') } },
  );

  // Find a document by a UUID in a non-_id field
  const john = await collection.findOne({ name: 'John' });
  const jane = await collection.findOne({ _id: john!.friendId });

  // Prints 'Jane 016b1cac-14ce-660e-8974-026c927b9b91 6'
  console.log(jane?.name, jane?._id.toString(), (<UUID>jane?._id).version);
})();

All UUID methods return an instance of the same class, which exposes a version property, if you need to access it. UUIDs can also be constructed from a string representation of the IDs, if you want to use custom generation.

The Java client defines dedicated classes to support different implementations of UUID, particularly v6 and v7.

When a unique identifier is retrieved from the server, it is returned as a uuid, and then it is converted to the appropriate UUID class, based on the class definition in the defaultId option.

ObjectId classes are extracted from the BSON package, and they represent the ObjectId type. UUIDs from the Java UUID class are implemented in the UUID v4 standard.

To generate new identifiers, you can use methods like new UUIDv6(), new UUIDv7(), or new ObjectId():

package com.datastax.astra.client.collection;

import com.datastax.astra.client.Collection;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.model.Document;
import com.datastax.astra.client.model.ObjectId;
import com.datastax.astra.client.model.UUIDv6;
import com.datastax.astra.client.model.UUIDv7;

import java.time.Instant;
import java.util.UUID;

import static com.datastax.astra.client.model.Filters.eq;
import static com.datastax.astra.client.model.Updates.set;

public class WorkingWithDocumentIds {
    public static void main(String[] args) {
        // Given an existing collection
        Collection<Document> collection = new DataAPIClient("TOKEN")
                .getDatabase("API_ENDPOINT")
                .getCollection("COLLECTION_NAME");

        // Ids can be different Json scalar
        // ('defaultId' options NOT set for collection)
        new Document().id("abc");
        new Document().id(123);
        new Document().id(Instant.now());

        // Working with UUIDv4
        new Document().id(UUID.randomUUID());

        // Working with UUIDv6
        collection.insertOne(new Document().id(new UUIDv6()).append("tag", "new_id_v_6"));
        UUID uuidv4 = UUID.fromString("018e77bc-648d-8795-a0e2-1cad0fdd53f5");
        collection.insertOne(new Document().id(new UUIDv6(uuidv4)).append("tag", "id_v_8"));

        // Working with UUIDv7
        collection.insertOne(new Document().id(new UUIDv7()).append("tag", "new_id_v_7"));

        // Working with ObjectIds
        collection.insertOne(new Document().id(new ObjectId()).append("tag", "obj_id"));
        collection.insertOne(new Document().id(new ObjectId("6601fb0f83ffc5f51ba22b88")).append("tag", "obj_id"));

        collection.findOneAndUpdate(
                eq((new ObjectId("6601fb0f83ffc5f51ba22b88"))),
                set("item_inventory_id", UUID.fromString("1eeeaf80-e333-6613-b42f-f739b95106e6")));
    }
}

When you insert a document, you can omit _id to automatically generate an ID or you can manually specify an _id, such as "_id": "12".

The following example inserts two documents with manually-defined _id values. One document uses the objectId type, and the other uses the uuid type.

"insertMany": {
  "documents": [
    {
      "_id": { "$objectId": "6672e1cbd7fabb4e5493916f" },
      "$vector": [0.1, 0.15, 0.3, 0.12, 0.05],
      "key": "value",
      "amount": 53990
    },
    {
      "_id": { "$uuid": "1ef2e42c-1fdb-6ad6-aae4-e84679831739" },
      "$vector": [0.15, 0.1, 0.1, 0.35, 0.55],
      "key": "value",
      "amount": 4600
    }
  ]
}

When you add or update a document, you can include additional identifiers in any document property, other than _id, just as you would any other data type.

Sort clauses

Sort and filter clauses can use only indexed fields.

If you apply selective indexing when you create a collection, you can’t reference non-indexed fields in sort or filter queries.

Data API commands, such as find, findOne, deleteOne, updateOne, and so on, can use sort clauses to organize results based on similarity, or dissimilarity, to the given filter, such as a vector or field.

Additionally, you can use a projection to include specific document properties in the response. A projection is required if you want to return certain reserved fields, like $vector and $vectorize, that are excluded by default.

  • Python

  • TypeScript

  • Java

  • curl

  • You can’t use the $vector and $vectorize sort clauses together.

  • Some combinations of arguments impose an implicit upper bound on the number of documents that are returned by the Data API:

    • Vector ANN searches can return no more than 1000 documents per search operation, regardless of the limit parameter.

    • When using an ascending or descending sort criterion, the Data API returns up to 20 documents at once. The returned documents are the top results across the whole collection based on the filter criteria.

      These provisions can also apply when running subsequent commands on cursors, such as .distinct().

      For ascending or descending sort clauses that do not automatically paginate, it is sometimes possible to use the limit and skip options to control the number of rows returned and the starting point of the results, as a form of manual pagination.

  • When you don’t specify sorting criteria (by vector or otherwise), the cursor can scroll through an arbitrary number of documents because the Data API and the client periodically exchange new chunks of documents.

    If documents are added or removed after starting a find operation, the cursor behavior depends on database internals. There is no guarantee as to whether or not the cursor will pick up such "real-time" changes in the data.

When no particular order is required:

sort={}  # (default when parameter not provided)

When sorting by a certain value in ascending/descending order:

from astrapy.constants import SortDocuments

# Ascending sort
sort={"field": SortDocuments.ASCENDING}

# Descending sort
sort={"field": SortDocuments.DESCENDING}

Be aware of the order when chaining multiple sorts. For example, when sorting first by a specific field and then by a specific subfield:

sort={
    "field": SortDocuments.ASCENDING,
    "subfield": SortDocuments.ASCENDING,
}

While modern Python versions preserve the order of dictionaries, it is suggested for clarity to employ a collections.OrderedDict with chained sorts.

You can use sort to perform a vector similarity (ANN) search:

# Use the specified vector,
# And then sort by similarity to the given vector.
sort={"$vector": [0.4, 0.15, -0.5]}

# Generate a vector from a string,
# Run a similarity search,
# And then sort by similarity to the given vector.
# Requires a valid vectorize integration.
sort={"$vectorize": "Text to vectorize"}
Sort example
from astrapy import DataAPIClient
import astrapy
client = DataAPIClient("TOKEN")
database = client.get_database("API_ENDPOINT")
collection = database.my_collection

filter = {"seq": {"$exists": True}}
for doc in collection.find(filter, projection={"seq": True}, limit=5):
    print(doc["seq"])
...
# will print e.g.:
#   37
#   35
#   10
#   36
#   27
cursor1 = collection.find(
    {},
    limit=4,
    sort={"seq": astrapy.constants.SortDocuments.DESCENDING},
)
[doc["_id"] for doc in cursor1]
# prints: ['97e85f81-...', '1581efe4-...', '...', '...']
cursor2 = collection.find({}, limit=3)
cursor2.distinct("seq")
# prints: [37, 35, 10]
collection.insert_many([
    {"tag": "A", "$vector": [4, 5]},
    {"tag": "B", "$vector": [3, 4]},
    {"tag": "C", "$vector": [3, 2]},
    {"tag": "D", "$vector": [4, 1]},
    {"tag": "E", "$vector": [2, 5]},
])
ann_tags = [
    document["tag"]
    for document in collection.find(
        {},
        sort={"$vector": [3, 3]},
        limit=3,
    )
]
ann_tags
# prints: ['A', 'B', 'C']
# (assuming the collection has metric VectorMetric.COSINE)
  • You can’t use the $vector and $vectorize sort clauses together.

  • Some combinations of arguments impose an implicit upper bound on the number of documents that are returned by the Data API:

    • Vector ANN searches can return no more than 1000 documents per search operation, regardless of the limit parameter.

    • When using an ascending or descending sort criterion, the Data API returns up to 20 documents at once. The returned documents are the top results across the whole collection based on the filter criteria.

      These provisions can also apply when running subsequent commands on cursors, such as .distinct().

      For ascending or descending sort clauses that do not automatically paginate, it is sometimes possible to use the limit and skip options to control the number of rows returned and the starting point of the results, as a form of manual pagination.

  • When you don’t specify sorting criteria (by vector or otherwise), the cursor can scroll through an arbitrary number of documents because the Data API and the client periodically exchange new chunks of documents.

    If documents are added or removed after starting a find operation, the cursor behavior depends on database internals. There is no guarantee as to whether or not the cursor will pick up such "real-time" changes in the data.

Sort is very weakly typed by default. See StrictSort<Schema> for a stronger typed alternative that provides full autocomplete as well.

When no particular order is required:

{ sort: {} }  // (default when parameter not provided)

When sorting by a certain value in ascending/descending order:

{ sort: { field: +1 } }  // ascending
{ sort: { field: -1 } }  // descending

Be aware of the order when chaining multiple sorts because ES2015+ guarantees string keys in order of insertion For example, when sorting first by a field and then by a specific subfield:

{ sort: { field: 1, subfield: 1 } }

You can use sort to perform a vector similarity (ANN) search:

// Use the specified vector,
// And then sort by similarity to the given vector.
{ sort: { $vector: [0.4, 0.15, -0.5] } }

// Generate a vector from a string,
// Run a similarity search,
// And then sort by similarity to the given vector.
// Requires a valid vectorize integration
{ sort: { $vectorize: "Text to vectorize" } }

Example:

import { DataAPIClient } from '@datastax/astra-db-ts';

// Reference an untyped collection
const client = new DataAPIClient('TOKEN');
const db = client.db('ENDPOINT', { keyspace: 'KEYSPACE' });
const collection = db.collection('COLLECTION');

(async function () {
  // Insert some documents
  await collection.insertMany([
    { name: 'Jane', age: 25, $vector: [1.0, 1.0, 1.0, 1.0, 1.0] },
    { name: 'Dave', age: 40, $vector: [0.4, 0.5, 0.6, 0.7, 0.8] },
    { name: 'Jack', age: 40, $vector: [0.1, 0.9, 0.0, 0.5, 0.7] },
  ]);

  // Sort by age ascending, then by name descending (Jane, Jack, Dave)
  const sorted1 = await collection.find({}, { sort: { age: 1, name: -1 } }).toArray();
  console.log(sorted1.map(d => d.name));

  // Sort by vector distance (Jane, Dave, Jack)
  const sorted2 = await collection.find({}, { sort: { $vector: [1, 1, 1, 1, 1] } }).toArray();
  console.log(sorted2.map(d => d.name));
})();
  • You can’t use the $vector and $vectorize sort clauses together.

  • Some combinations of arguments impose an implicit upper bound on the number of documents that are returned by the Data API:

    • Vector ANN searches can return no more than 1000 documents per search operation, regardless of the limit parameter.

    • When using an ascending or descending sort criterion, the Data API returns up to 20 documents at once. The returned documents are the top results across the whole collection based on the filter criteria.

      These provisions can also apply when running subsequent commands on cursors, such as .distinct().

      For ascending or descending sort clauses that do not automatically paginate, it is sometimes possible to use the limit and skip options to control the number of rows returned and the starting point of the results, as a form of manual pagination.

  • When you don’t specify sorting criteria (by vector or otherwise), the cursor can scroll through an arbitrary number of documents because the Data API and the client periodically exchange new chunks of documents.

    If documents are added or removed after starting a find operation, the cursor behavior depends on database internals. There is no guarantee as to whether or not the cursor will pick up such "real-time" changes in the data.

The sort() operations are optional. Use them only when needed.

Be aware of the order when chaining multiple sorts:

Sort s1 = Sorts.ascending("field1");
Sort s2 = Sorts.descending("field2");
FindOptions.Builder.sort(s1, s2);

You can use sort to perform a vector similarity (ANN) search:

// Use the specified vector,
// And then sort by similarity to the given vector.
FindOptions.Builder
 .sort(new float[] {0.4f, 0.15f, -0.5f});

// Generate a vector from a string,
// Run a similarity search,
// And then sort by similarity to the given vector.
// Requires a valid vectorize integration
FindOptions.Builder
 .sort("Text to vectorize");

Example:

package com.datastax.astra.client.collection;

import com.datastax.astra.client.Collection;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.model.Document;
import com.datastax.astra.client.model.FindOptions;
import com.datastax.astra.client.model.Sort;
import com.datastax.astra.client.model.Sorts;

import static com.datastax.astra.client.model.Filters.lt;

public class WorkingWithSorts {
    public static void main(String[] args) {
        // Given an existing collection
        Collection<Document> collection = new DataAPIClient("TOKEN")
                .getDatabase("API_ENDPOINT")
                .getCollection("COLLECTION_NAME");

        // Sort Clause for a vector
        Sorts.vector(new float[] {0.25f, 0.25f, 0.25f,0.25f, 0.25f});;

        // Sort Clause for other fields
        Sort s1 = Sorts.ascending("field1");
        Sort s2 = Sorts.descending("field2");

        // Build the sort clause
        new FindOptions().sort(s1, s2);

        // Adding vector
        new FindOptions().sort(new float[] {0.25f, 0.25f, 0.25f,0.25f, 0.25f}, s1, s2);

    }
}
  • You can’t use the $vector and $vectorize sort clauses together.

  • Some combinations of arguments impose an implicit upper bound on the number of documents that are returned by the Data API:

    • Vector ANN searches can return no more than 1000 documents per search operation, regardless of the limit parameter.

    • If sort is ascending, descending, or unspecified, the Data API returns up to 20 documents at once. The returned documents are the top results across the whole collection based on the filter criteria. Pagination can occur if there are more than 20 matching documents, but, in some cases, the nextPageState is null regardless of the actual presence of additional results.

  • The search type and upper limit impact the response:

    • Vector search returns a single page of up to 1000 documents, unless you set a lower limit.

    • Searches without $vector or $vectorize return matching documents in batches of 20. Pagination occurs if there are more than 20 matching documents. For information about handling pagination, see Find documents using filter clauses.

  • If documents are added or removed after starting a find operation, paging behavior depends on database internals. There is no guarantee as to whether or not pagination will pick up such "real-time" changes in the data.

When you run a Find command, you can append nested JSON objects that define the search criteria (sort or filter), projection, and other options.

If no particular order is required, you can search with an empty filter:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE/ASTRA_DB_COLLECTION" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "find": {
    "filter": {},
  }
}' | jq

This example finds documents by performing a vector similarity search:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE/ASTRA_DB_COLLECTION" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "find": {
    "sort": { "$vector": [0.15, 0.1, 0.1, 0.35, 0.55] },
    "projection": { "$vector": 1 },
    "options": {
      "includeSimilarity": true,
      "includeSortVector": false,
      "limit": 100
    }
  }
}' | jq

This request does the following:

  • sort compares the given vector, [0.15, 0.1, 0.1, 0.35, 0.55], against the vectors for documents in the collection, and then returns results ranked by similarity. The $vector key is a reserved property name for storing vector data.

  • projection requests that the response return the $vector for each document.

  • options.includeSimilarity requests that the response include the $similarity key with the numeric similarity score, which represents the closeness of the sort vector and the document’s vector.

  • options.includeSortVector is set to false to exclude the sortVector from the response. This is only relevant if sort includes either $vector or $vectorize and you want the response to include the sort vector. This is particularly useful with $vectorize because you don’t know the sort vector in advance.

  • options.limit specifies the maximum number of documents to return. This example limits the entire list of matching documents to 100 documents or less.

    Vector search returns a single page of up to 1000 documents, unless you set a lower limit. Other searches (without $vector or $vectorize) return matching documents in batches of 20. Pagination occurs if there are more than 20 matching documents. For information about handling pagination, see Find documents using filter clauses.

The projection and options settings can make the response more focused and potentially reduce the amount of data transferred.

Response
{
  "data": {
    "documents": [
      {
        "$similarity": 1,
        "$vector": [
          0.15,
          0.1,
          0.1,
          0.35,
          0.55
        ],
        "_id": "3"
      },
      {
        "$similarity": 0.9953563,
        "$vector": [
          0.15,
          0.17,
          0.15,
          0.43,
          0.55
        ],
        "_id": "18"
      },
      {
        "$similarity": 0.9732053,
        "$vector": [
          0.21,
          0.22,
          0.33,
          0.44,
          0.53
        ],
        "_id": "21"
      }
    ],
    "nextPageState": null
  }
}

Projection clauses

Certain document operations, such as findOne, find, findOneAndUpdate, findOneAndReplace, and findOneAndDelete, support a projection option that specifies which part of a document to return. Typically, the projection specifies which fields to include or exclude.

If projection is empty or unspecified, the Data API applies the default projection. For documents, the default projection includes, at minimum, the document identifier (_id) and all regular fields, which are fields not prefixed by a dollar sign ($).

If you specify a projection, all special fields, such as _id, $vector, and $vectorize, have specific inclusion and exclusion defaults that you can override individually. However, for regular fields, the projection must either include or exclude those fields. The projection can’t define a mix of included and excluded regular fields.

If a projection includes fields that don’t exist in a returned document, then those fields are ignored for that document.

In order to optimize the response size and improve read performance, DataStax recommends always providing an explicit projection tailored to the needs of the application.

If an application relies on the presence of $vector, or other special fields, in the returned documents, make sure the projection explicitly includes that field.

A quick, but possibly suboptimal, way to ensure the presence of special fields is to use the wildcard projection { "*": true }.

Projection syntax

A projection is expressed as a mapping of field names to boolean values.

Use true mapping to include only the specified fields. For example, the following true mapping returns the document ID, field1, and field2:

{ "_id": true, "field1": true, "field2": true }

Alternatively, use a false mapping to exclude the specified fields. All other non-excluded fields are returned.

{ "field1": false, "field2": false }

The values in a projection map can be objects, booleans, decimals, or integers, but the Data API ultimately evaluates all of these as booleans.

For example, the following projection evaluates to true (include) for all four fields:

{ "field1": true, "field2": 1, "field3": 90.0, "field4": { "keep": "yes!" } }

Whereas this projection evaluates to false (exclude) for all four fields:

{ "field1": false, "field2": 0, "field3": 0.0, "field4": {} }

Passing null-like types (such as {}, null or 0) for the whole projection mapping is equivalent to omitting projection.

Projecting regular and special fields

For regular fields, a projection can’t mix include and exclude projections. It can contain only true or only false values for regular fields. For example, {"field1": true, "field2": false} is an invalid projection that results in an API error.

However, the special fields _id, $vector, and $vectorize have individual default inclusion and exclusion rules, regardless of the projection mapping. Unlike regular fields, you can set the projection values for special fields independently of regular fields:

  • The _id field is included by default. You can opt to exclude it in a true mapping, such as { "_id": false, "field1": true }.

  • The $vector and $vectorize fields are excluded by default. You can opt to include these in a false mapping, such as { "field1": false, "$vector": true }.

  • The $similarity key isn’t a document field, and you can’t use this key in a projection. The $similarity value is the result of a vector ANN search operation with $vector or $vectorize. Use the includeSimilarity parameter to control the presence of $similarity in the response.

Therefore, the following are all valid projections for regular and special fields:

{ "_id": true, "field1": true, "field2": true }
{ "_id": false, "field1": true, "field2": true }
{ "_id": false, "field1": false, "field2": false }
{ "_id": true, "field1": false, "field2": false }
{ "_id": true, "field1": true, "field2": true, "$vector": true }
{ "_id": true, "field1": true, "field2": true, "$vector": false }
{ "_id": false, "field1": true, "field2": true, "$vector": true }
{ "_id": false, "field1": true, "field2": true, "$vector": false }
{ "_id": false, "field1": false, "field2": false, "$vector": true }
{ "_id": false, "field1": false, "field2": false, "$vector": false }
{ "_id": true, "field1": false, "field2": false, "$vector": true }
{ "_id": true, "field1": false, "field2": false, "$vector": false }

The wildcard projection "*" represents the whole of the document. If you use this projection, it must be the only key in the projection.

If set to true ({ "*": true }), all fields are returned.

If set to false ({ "*": false }), no fields are returned, and each document is empty ({}).

Projecting arrays and nested objects

For array fields, you can use a $slice to specify which elements of the array to return. Use one of the following formats:

// Return the first two elements
{ "arr": { "$slice": 2 } }

// Return the last two elements
{ "arr": { "$slice": -2 } }

// Skip 4 elements (from 0th index), return the next 2
{ "arr": { "$slice": [4, 2] } }

// Skip backward 4 elements (from the end), return next 2 elements (forward)
{ "arr": { "$slice": [-4, 2] } }

If a projection refers to a nested field, the keys in the subdocument are includes or excluded as requested. If you exclude all keys of an existing subdocument, then the document is returned with the subdocument present and an empty nested object.

Examples of nested document projections

Given the following document:

{
  "_id": "z",
  "a": {
    "a1": 10,
    "a2": 20
  }
}

The results of various projections are as follows:

Projection Result

{ "a": true }

{ "_id": "z", "a": { "a1": 10, "a2": 20 } }

{ "a.a1": false}

{ "_id": "z", "a": { "a2": 20 } }

{ "a.a1": true}

{ "_id": "z", "a": { "a1": 10 } }

{ "a.a1": false, "a.a2": false }

{ "_id": "z", "a": {} }

{ "*": false }

{}

Referencing overlapping paths or subpaths in a projection can create conflicting clauses and return an API error. For example, this projection is invalid:

// Invalid:
{ "a.a1": true, "a": true }

Projection examples by language

  • Python

  • TypeScript

  • Java

  • curl

For the Python client, the projection can be any of the following:

  • A dictionary (Dict[str, Any]) to include specific fields in the response, like {field_name: True}.

  • A dictionary (Dict[str, Any]) to exclude specific fields from the response, like {field_name: False}.

  • A list or other iterable over key names that are implied to be included in the projection.

The following two projections are equivalent:

document = collection.find_one(
   {"_id": 101},
   projection={"name": True, "city": True},
)

document = collection.find_one(
   {"_id": 101},
   projection=["name", "city"],
)

For information about default projections and handling for special fields, see the preceding explanation of projection clauses.

The TypeScript client takes in an untyped Plain Old JavaScript Object (POJO) for the projection parameter. The client also offers a StrictProjection<Schema> type that provides full autocomplete and type checking for your document schema.

When specifying a projection, make sure that you handle the return type carefully. Consider type-casting.

import { StrictProjection } from '@datastax/astra-db-ts';

const doc = await collection.findOne({}, {
  projection: {
    'name': true,
    'address.city': true,
  },
});

interface MySchema {
  name: string,
  address: {
    city: string,
    state: string,
  },
}

const doc = await collection.findOne({}, {
  projection: {
    'name': 1,
    'address.city': 1,
    // @ts-expect-error - `'address.car'` does not exist in type `StrictProjection<MySchema>`
    'address.car': 0,
    // @ts-expect-error - Type `{ $slice: number }` is not assignable to type `boolean | 0 | 1 | undefined`
    'address.state': { $slice: 3 }
  } satisfies StrictProjection<MySchema>,
});

For information about default projections and handling for special fields, see the preceding explanation of projection clauses.

To support the projection mechanism, the Java client has different Options classes that provide the projection method in the helpers. This method takes an array of Projection classes with the field name and a boolean flag indicating inclusion or exclusion.

Projection p1 = new Projection("field1", true);
Projection p2 = new Projection("field2", true);
FindOptions options1 = FindOptions.Builder.projection(p1, p2);

To simplify this syntax, you can use the Projections syntactic sugar:

FindOptions options2 = FindOptions.Builder
  .projection(Projections.include("field1", "field2"));

FindOptions options3 = FindOptions.Builder
  .projection(Projections.exclude("field1", "field2"));

The Projection class also provides a method to support $slice for array fields:

// {"arr": {"$slice": 2}}
Projection sliceOnlyStart = Projections.slice("arr", 2, null);

// {"arr": {"$slice": [-4, 2]}}
Projection sliceOnlyRange =Projections.slice("arr", -4, 2);

// An you can use then freely in the different builders
FindOptions options4 = FindOptions.Builder
  .projection(sliceOnlyStart);

For information about default projections and handling for special fields, see the preceding explanation of projection clauses.

In an HTTP request, include projection as a find parameter:

curl -sS -L -X POST "ASTRA_DB_API_ENDPOINT/api/json/v1/ASTRA_DB_KEYSPACE/ASTRA_DB_COLLECTION" \
--header "Token: ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "find": {
    "sort": { "$vector": [0.15, 0.1, 0.1, 0.35, 0.55] },
    "projection": { "$vector": true, "name": true, "city": true }
    "options": {
      "includeSimilarity": true,
      "includeSortVector": false,
      "limit": 100
    }
  }
}' | jq

For information about default projections and handling for special fields, see the preceding explanation of projection clauses.

Operators

Data API provides query and update operators that you can use in filters to find, update, replace, and delete documents:

Operator type Name Purpose

Logical query

$and

Joins query clauses with a logical AND, returning the documents that match the conditions of both clauses.

$or

Joins query clauses with a logical OR, returning the documents that match the conditions of either clause.

$not

Returns documents that do not match the conditions of the filter clause.

Range query

$gt

Matches documents where the given property is greater than the specified value.

$gte

Matches documents where the given property is greater than or equal to the specified value.

$lt

Matches documents where the given property is less than the specified value.

$lte

Matches documents where the given property is less than or equal to the specified value.

Comparison query

$eq

Matches documents where the value of a property equals the specified value. This is the default when you do not specify an operator.

$ne

Matches documents where the value of a property does not equal the specified value.

$in

Match one or more of an array of specified values. For example, "filter": { "FIELD_NAME": { "$in": [ "VALUE", "VALUE" ] } }.

If you have only one value to match, an array is not necessary, such as { "$in": "VALUE" }.

The $in operator also functions as a $contains operator. For example, a field containing the array [ 1, 2, 3 ] will match filters like { "$in": [ 2, 6 ] } or { "$in": 1 }.

$nin

Matches any of the values that are NOT IN the array.

Element query

$exists

Matches documents that have the specified property.

Array query

$all

Matches arrays that contain all elements in the specified array.

$size

Selects documents where the array has the specified number of elements.

Property update

$currentDate

Used in an update operation to set a property to the current date.

$inc

Increments the value of the property by the specified amount.

$min

Updates the property only if the specified value is less than the existing property value.

$max

Updates the property only if the specified value is greater than the existing property value.

$mul

Multiply the value of a property in the document.

$rename

Renames the specified property in each matching document.

$set

Sets the value of a property in each matching document.

$setOnInsert

Set the value of a property in the document if an upsert is performed.

$unset

Removes the specified property from each matching document.

Array update

$addToSet

Adds elements to the array only if they do not already exist in the set.

You can use $each to append multiple items.

$pop

Removes the first or last item of the array, depending on the value of the operator. Use -1 to remove the first item. Use 1 to remove the last item.

$push

Adds or appends data to the end of the property value. If the value is not yet an array and the property has no value, this operator creates a one-element array that contains the given item. If the value is not yet an array and the property has a non-array value, this operator creates a two-element array that has the existing value as the first entry and the given item as the second entry.

You can use $each and $position to modify where and how the data is added to the array.

$each

Modify the $push or $addToSet operators to append multiple items in array updates.

$position

Modify the $push operator to specify the position in the array to add elements.

Use $position to add an element to a specific position in an array. $position is only valid with $push, and $each is required, even if you want to insert a single item at the specified position.

For an example, see the curl tab for Find and update a document.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com