Work with rows

This Astra DB Serverless feature is currently in public preview. Development is ongoing, and the features and functionality are subject to change. Astra DB Serverless, and the use of such, is subject to the DataStax Preview Terms.

The Data API tables commands are available through HTTP and the clients.

If you use a client, tables commands are available only in client versions 2.0-preview or later. For more information, see Data API client upgrade guide.

A row represents a single record of data in a table in an Astra DB Serverless database.

You use the Table class to work with rows through the Data API clients. For instructions to get a Table object, see Work with tables.

For more information about the Data API and clients, see Get started with the Data API.

Column data types

The Data API doesn’t support all CQL type literals and formats. For example, it doesn’t support all methods for representing dates, times, and timestamps with numerical values. Only specific formats are accepted and returned.

The following sections describe the data types that the Data API supports for table columns as well as guidance for working with certain data types.

Number types

The Data API supports the following column types for numbers that are not reserved for representing dates and times.

The Java client accepts and returns native Java types. The TypeScript client accepts and returns JavaScript numbers for all number types except varint and decimal. For HTTP and the Python client, use the formats noted in the following list.

  • bigint: A long number that accepts values containing digits and - ranging from -9223372036854775808 to 9223372036854775807 (-2^63 to 2^63-1). Only numbers that can be converted to an integer value without loss of information are valid.

    The Python client accepts and returns int. The TypeScript client accepts and returns JavaScript numbers.

  • decimal: A variable-precision decimal returned as a number, including optional - and exponent.

    You can store non-decimal values, and those values can be returned without decimal notation. For example, 23 is a valid decimal, and it can be returned as 23 or 23.0.

    decimal does not support NaN, Infinity, or -Infinity values.

    The Python client accepts decimal.DECIMAL, int, and float, and it returns decimal.DECIMAL.

  • double: A 64-bit IEEE-754 floating point value that can be a number (including an exponent and +/- 0) or a quoted string containing NaN, Infinity, or -Infinity.

    The Python client accepts int and float, including float("NaN"), float("Infinity"), and float("-Infinity"), and it returns float. For more information, see Python client usage: Serdes options and custom data types.

  • float: A 32-bit IEEE-754 floating point value that can be a number (including E and +/- 0) or a quoted string containing NaN, Infinity, or -Infinity.

    The Python client accepts int and float, including float("NaN"), float("Infinity"), and float("-Infinity"), and it returns float. For more information, see Python client usage: Serdes options and custom data types.

  • int: A 32-bit signed integer returned and accepted as a number, including optional - and exponent. Only numbers that can be converted to a 4-byte integer without loss of information are valid. Exponents must express a whole number.

    The Python client accepts and returns int.

  • smallint: A 2-byte integer returned and accepted as a number, including optional - and exponent. Only numbers that can be converted to a 2-byte integer without loss of information are valid. Exponents must express a whole number.

    The Python client accepts and returns int.

  • tinyint: A 1-byte integer returned and accepted as a number, including optional - and exponent. Only numbers that can be converted to a 1-byte integer without loss of information are valid. Exponents must express a whole number.

    The Python client accepts and returns int.

  • varint: An arbitrary-precision integer of Java type returned and accepted as a number, including optional - and exponent.

    The Python client accepts and returns int.

In Data API requests, the limit on maximum number length applies to the literal number of characters sent in the command’s JSON payload. A value’s literal representation in JSON can’t exceed 100 characters, such as a number that is 100 digits long. However, you can use exponents to pass values that represent more than 100 digits, such as 1e120000. In Data API responses, the Data API may return the fully-expressed number, beyond 100 characters.

Date and time types

The Data API supports date, duration, time, and timestamp types for representing dates and times.

For more information about reading and writing these types in the Python client, see Python client usage: Serdes options and custom data types.

Date

A date is a date with no corresponding time value represented in RFC 3339 full-date format (YYYY-MM-DD) where the year is at least 4 digits. If the year is positive and greater than 4 digits, it must be prefixed by +, such as +YYYYYY-MM-DD.

Duration

A duration is a value representing a duration. The preferred format is ISO-8601 interval format, such as P3Y6M4DT12H30M5S.

ISO-8601 duration format rules
  • Prefixed by P (positive duration) or -P (negative duration).

  • Contains a non-empty sequence of one or more sets of quantities and units in strict decreasing order.

  • Quantities are non-negative integers, with the exception of S (seconds), which can be a non-negative decimal.

  • Units must use the following values in the order given: Y, M, D, H, M, S. Note that the first M is months and the second M is minutes.

  • If the duration includes sub-day units (hours, minutes, seconds), you must include T before the first sub-day quantity, such as P10DT2H30M or PT2H30M.

  • Can skip units, but the units must still follow the given order, such as P1Y22DT45M30S.

  • Cannot repeat units.

  • Cannot be empty. For zero duration, you must specify at least one unit with a 0 quantity, such as P0Y.

For lossless inserts down to nanosecond precision, use the Apache Cassandra® notation for durations, such as 12y3mo1d or 12y3mo1d12h30m5s.

Cassandra duration notation rules
  • A non-empty sequence of one or more sets of quantities and units in strict decreasing order.

  • Optionally prefixed by - (negative duration).

  • Quantities are non-negative integers of at least one digit.

  • Units must use the following case-insensitive values in the order given: y, mo, w, d, h, m, s, ms, us (or µs), ns.

  • Can skip units, but the units must still follow the given order, such as 1y22d45m30s.

  • Cannot repeat units.

  • Cannot be empty. For zero duration, you must specify at least one unit with a 0 quantity, such as 0s.

Time

A time is a time of day without an associated date represented in RFC 3339 partial-time format including HH:MM:SS and optional nanoseconds (.nnnnnnnnn).

In responses, sub-seconds are expanded in groups of three digits as needed. For example, a time written as "12:34:56.7887" is subsequently returned as "12:34:56.788700".

Timestamp

A timestamp is a date and time with millisecond precision as a string in complete RFC 3339 date-time format.

RFC 3339 date-time format rules
  • Include the following components in order given: Year, month, day, T, hour, minute, second, sub-seconds, offset. For example, 1984-01-10T12:01Z.

    Only the sub-seconds are optional. All other components are required. If a required component has no value, use zero-padding to achieve the required format.

  • The year must be at least 4 digits, optionally prefixed by -. If the year is positive and more than 4 digits, it must be prefixed by +.

  • Month, day, hour, and minute must be 2 digits. For values 0-9, you must include a leading 0, such as 01 for January.

  • T must separate the date from the time, as in 1984-01-10T12:01Z

  • The offset can be Z (shorthand for +00:00) or an ISO-formatted offset consisting of + or - followed by HH:MM, such as +05:00 or -08:00. The offset hour and minute must be 2 digits.

  • Sub-seconds are optional and can include up to 3 digits (millisecond precision), defined as .nnn, such as 1984-01-10T12:01:23.456Z or 1984-01-10T12:01:23.4Z. Sub-seconds are always returned as 3 digits. For example, 1984-01-10T12:01:23.4Z is returned as 1984-01-10T12:01:23.400Z.

timestamp format examples
# Minimum expression
YYYY-MM-DDTHH:MM:SSZ
YYYY-MM-DDTHH:MM:SS+HH:MM
YYYY-MM-DDTHH:MM:SS-HH:MM

# Optional sub-seconds
YYYY-MM-DDTHH:MM:SS.nnnZ
YYYY-MM-DDTHH:MM:SS.nnn+HH:MM
YYYY-MM-DDTHH:MM:SS.nnn-HH:MM

# Optional negative sign before the year
-YYYY-MM-DDTHH:MM:SS.nnnZ
-YYYY-MM-DDTHH:MM:SS.nnn+HH:MM
-YYYY-MM-DDTHH:MM:SS.nnn-HH:MM

# Positive sign before the year if the year is positive and greater than 4 digits
+YYYYYY-MM-DDTHH:MM:SS.nnnZ
+YYYYYY-MM-DDTHH:MM:SS.nnn+HH:MM
+YYYYYY-MM-DDTHH:MM:SS.nnn-HH:MM

Limited timestamp access

The Data API supports the timestamp column type, but it doesn’t provide access to column timestamps. Additionally, the Data API doesn’t set timestamps for any mutation. If you require this timestamp information, DataStax recommends that use you CQL instead of the Data API.

UUID types

Unlike collections, which assign a default _id value to each document, the Data API does not assign default identifiers or default values to rows in tables. Each table’s primary key is the unique identifier for that table’s rows.

To store explicitly-typed UUID data in a table, the Data API fully supports the uuid type and partially supports the timeuuid type:

  • uuid: Fully supported by the Data API, and it accepts all UUID versions.

    uuid data is generally returned and accepted as quoted strings in UUID format, including dashes, such as "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx". The Python client accepts str and UUID (all of astrapy.ids.uuid*), and it returns astrapy.ids.UUID (alias for stdlib uuid.UUID).

    objectId is not supported in tables. You can store such hexidecimal strings in tables, but they are not recognized or handled as a true objectId.

  • timeuuid: Partially supported by the Data API. It is a specific type of UUID that includes timestamp information, and timeuuid data is returned and accepted as a JSON string.

    With the Data API, you can read, write, update, and delete values in existing timeuuid columns, but you can’t use the Data API to create tables with timeuuid columns or add columns of type timeuuid to tables.

    If you need a timeuuid column in a table, then you must create or alter the table with CQL before using the Data API to manage the values in the timeuuid column. Alternatively, consider using the uuid type.

Vector type

The Data API supports the special vector type (vector<float,dimensions>) for storing vectors, such as embeddings, which you can then index and use for vector ANN searches.

In vector columns, you can store pre-generated vectors. You can also automatically generate vector embeddings from text strings.

When you create or alter a table, you can define multiple vector columns to store vectors with different dimensions or from different embedding models. However, you can run a vector search on only one column at a time.

Define a column to store pre-generated vectors

To store pre-generated vectors in a table, define a vector column with the dimension of the vectors. If you’re not sure what dimension to set, use the vector dimension your embeddings model produces.

The following examples are truncated to focus on the vector column.

  • Python

  • TypeScript

  • Java

  • curl

Define the vector column in your table definition using a fluent interface, CreateTableDefinition, or a plain dictionary.

  • With fluent interface

  • With CreateTableDefinition

  • With plain dictionary

table_definition = (
    CreateTableDefinition.builder()
    ...
    .add_vector_column("VECTOR_COLUMN_NAME", dimension=NUM_DIMENSIONS)
    ...
    .build()
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
table_definition = CreateTableDefinition(
    columns={
        ...
        "VECTOR_COLUMN_NAME": TableVectorColumnTypeDescriptor(
            dimension=NUM_DIMENSIONS
        ),
        ...
    },
    ...
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
table_definition = {
    "columns": {
        ...
        "VECTOR_COLUMN_NAME": {"type": "vector", "dimension": NUM_DIMENSIONS},
        ...
    },
    ...
}
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
const table = await db.createTable('TABLE_NAME', {
  definition: {
    columns: {
      VECTOR_COLUMN_NAME: {
        type: 'vector',
        dimension: NUM_DIMENSIONS,
      },
      ...
    },
    ...
  },
});
import com.datastax.astra.client.tables.Table;
import com.datastax.astra.client.tables.definition.TableDefinition;
import com.datastax.astra.client.tables.definition.columns.ColumnDefinitionVector;

int NUM_DIMENSIONS = 3;

TableDefinition tableDefinition = new TableDefinition()
 //...
 .addColumnVector("VECTOR_COLUMN_NAME", new ColumnDefinitionVector()
   .dimension(NUM_DIMENSIONS)
   .metric(COSINE))
 //...
;
Table myTable = database.createTable("TABLE_NAME", tableDefinition);
"columns": {
  "VECTOR_COLUMN_NAME": {
    "type": "vector",
    "dimension": NUM_DIMENSIONS
  }
}

Define a column to automatically generate vector embeddings

To automatically generate embeddings with vectorize, define a vector column with embedding provider service options. Astra DB stores the automatically-generated embeddings in this column.

  • Python

  • TypeScript

  • Java

  • curl

  • With fluent interface

  • With CreateTableDefinition

  • With plain dictionary

table_definition = (
    CreateTableDefinition.builder()
    ...
    .add_vector_column(
        "VECTOR_COLUMN_NAME",
        dimension=NUM_DIMENSIONS,
        service=VectorServiceOptions(
            provider="EMBEDDINGS_PROVIDER_NAME",
            model_name="MODEL_NAME",
            authentication={
                "providerKey": "ASTRA_KMS_API_KEY_NAME",
            },
        )
    )
    ...
    .build()
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
# CreateTableDefinition object:
table_definition = CreateTableDefinition(
    columns={
        ...
        "VECTOR_COLUMN_NAME": TableVectorColumnTypeDescriptor(
            dimension=NUM_DIMENSIONS,
            service=VectorServiceOptions(
                provider="EMBEDDINGS_PROVIDER_NAME",
                model_name="MODEL_NAME",
                authentication={
                    "providerKey": "ASTRA_KMS_API_KEY_NAME",
                },
            )
        ),
        ...
    },
    ...
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
table_definition = {
    "columns": {
        ...
        "VECTOR_COLUMN_NAME": {
            "type": "vector",
            "dimension": NUM_DIMENSIONS,
            "service": {
                "provider": "EMBEDDINGS_PROVIDER_NAME",
                "modelName": "MODEL_NAME",
                "authentication": {
                    "providerKey": "ASTRA_KMS_API_KEY_NAME",
                },
            },
        },
        ...
    },
    ...
}
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
const table = await db.createTable('TABLE_NAME', {
  definition: {
    columns: {
      VECTOR_COLUMN_NAME: {
        type: 'vector',
        dimension: NUM_DIMENSIONS,
        service: {
          provider: 'EMBEDDINGS_PROVIDER_NAME',
          modelName: 'MODEL_NAME',
          authentication: {
            providerKey: 'ASTRA_KMS_API_KEY_NAME',
          },
        },
      },
      ...
    },
    ...
  },
});
VectorServiceOptions vectorServiceOptions = new VectorServiceOptions()
  .provider("EMBEDDINGS_PROVIDER_NAME")
  .modelName("MODEL_NAME")
  .authentication(Map.of("providerKey", "ASTRA_KMS_API_KEY_NAME"));

ColumnDefinitionVector columnDefinition = new ColumnDefinitionVector()
  .dimension(3)
  .metric(COSINE)
  .service(vectorServiceOptions);

TableDefinition tableDefinition = new TableDefinition()
  //...
  .addColumnVector("VECTOR_COLUMN_NAME", columnDefinition);

Table myTable = database.createTable("TABLE_NAME", tableDefinition);
"columns": {
  "COLUMN_NAME": "DATA_TYPE",
  "COLUMN_NAME": "DATA_TYPE",
  "VECTOR_COLUMN_NAME": {
    "type": "vector",
    "dimension": NUM_DIMENSIONS,
    "service": {
      "provider": "EMBEDDINGS_PROVIDER_NAME",
      "modelName": "MODEL_NAME",
      "authentication": {
        "providerKey": "ASTRA_KMS_API_KEY_NAME"
      }
    }
  }
}
Vectorize parameters

At minimum, you must specify the embedding provider and model name. However, dimension is optional if the specified model name has a default dimension value, and some embedding providers require additional parameters. To find required parameters and accepted values, use findEmbeddingProviders or see the documentation for your embedding provider integration.

If a table has multiple vector columns with vectorize integrations, all columns must use the same embedding provider integration, including the same dimension, model name, credential, and other settings.

Vectorize authentication

Most embedding provider integrations require authentication through Astra DB KMS or header authentication:

  • Astra DB KMS (providerKey): Use credentials stored in Astra DB KMS to authenticate with your vectorize embedding provider. In your vector column definition, include an authentication object with providerKey set to the credential’s API Key name as given in Astra DB KMS. For more information, see Embedding provider authentication.

  • Header authentication: As an alternative to providerKey, you can omit the authentication object, and then provide the authentication key in an x-embedding-api-key header instead. Header authentication overrides Astra DB KMS if you set both.

    Over HTTP, specify this header literally, such as --header "x-embedding-api-key: $EMBEDDING_PROVIDER_API_KEY", and include it with every command that uses vectorize, including writes and vector search.

    With the clients, you include the respective header authentication parameter (embedding-api-key, embeddingApiKey, or embeddingAuthProvider) in the top-level create table or get table command parameters (not in the vector column definition). This attaches the embedding provider API key to the resulting Table instance, and then the client automatically attaches the x-embedding-api-key header with the specified key to any underlying HTTP request that requires vectorize authentication.

  • No authentication: Astra-hosted embedding provider integrations don’t require authentication. However, your database must be in a supported region. For more information, see the documentation for the Astra-hosted embedding provider integrations.

You can use alterTable to add and drop entire vector columns or enable and disable embedding provider integrations on tables.

If you drop the embedding provider integration, the column retains the existing embeddings. You cannot automatically generate new embeddings unless you re-enable the embedding provider integration.

If you add an embedding provider integration to a vector column with existing embeddings, make sure the service options are compatible with the existing embeddings. This ensures accurate vector search results.

Write vector data

vector columns are required to perform vector searches or create vector indexes.

  • Python

  • TypeScript

  • Java

  • curl

You can insert pre-generated vector data into a vector column by supplying either a list of float values or a DataAPIVector object. The end result on the database will be identical. However, the DataAPIVector values are binary-encoded before sending, which results in improved insertion performance, especially when writing multiple rows.

If the column has a vectorize integration, you can automatically generate an embedding from a string by having the string in the row directly.

# Insert a list of floats sending it as array:
my_table.insert_one({"vector_column": [0.1, -0.2, 0.3], ...})

# Insert a list of floats sending it as binary-encoded:
my_table.insert_one({"vector_column": DataAPIVector([0.1, -0.2, 0.3]), ...})

# Generate the embedding with vectorize
my_table.insert_one({"vector_column": "Text to vectorize", ...})

For additional control over the behavior of vector encoding, you can set the following attributes of the table APIOptions.serdes_options:

  • Setting unroll_iterables_to_lists to True means that any iterable over numbers is unrolled into a list prior to insertion. That means you can directly use numpy objects, generators, and similar in all cases where a vector is expected. This setting defaults to False as it incurs a slight performance cost.

  • Setting binary_encode_vectors to False disables binary-encoding of vectors altogether: all vectors are sent to the API as plain arrays of floats. This defaults to True for an improved insertion performance.

You can insert pre-generated vector data into a vector column by supplying either a list of numbers or a DataAPIVector object. The end result on the database will be identical. However, the DataAPIVector values are binary-encoded before sending, which results in improved insertion performance, especially when writing multiple rows.

Example using DataAPIVector (recommended):

// Using the explicit class constructor
await table.insertOne({ vectorColumn: new DataAPIVector([0.1, -0.2, 0.3]), ... });

// Using the vector function shorthand
await table.insertOne({ vectorColumn: vector([0.1, -0.2, 0.3]), ... });

You can create a vector from any DataAPIVectorLike object.

export type DataAPIVectorLike = DataAPIVector | number[] | { $binary: string } | Float32Array;

// Creating a DataAPIVector from a Float32Array
const vec = vector(Float32Array.of(.1, -.2, .3));

You can insert a number[] directly, but insertion performance will be decreased. You also will need to provide your own typing for the Table, since InferTableSchema assumes that you are using DataAPIVector objects.

When reading back the vector data, it will be returned as a DataAPIVector object, unless you provide your own deserialization logic to convert it back to a number[].

// Define the table schema manually
interface TableSchema {
  vectorColumn?: number[] | null,
  ...
}
const table = await db.createTable<TableSchema>('TABLE_NAME', { ... });

// Insert as a list of floats
await table.insertOne({ vectorColumn: [0.1, -0.2, 0.3], ... });

If the column has a vectorize integration, you can automatically generate an embedding from a string by having the string in the row directly.

The ability to be able to pass a string will be picked up by InferTableSchema if you provide a service block in the vector column definition.

// Insert using the vectorize feature
await table.insertOne({ vectorColumn: 'Text to vectorize', ... });

// You can still provide an explicit vector if you want
await table.insertOne({ vectorColumn: vector([0.1, -0.2, 0.3]), ... });

You can insert pre-generated vector data into a vector column by supplying either a list of floats or a DataAPIVector object. The end result on the database is the same. However, with the DataAPIVector object, the values are sent in binary-encoded format, which is more performant, especially when writing multiple rows.

Use a list of floats:

tableGames.insertOne(new Row()
  // primary key of the table
 .addText("match_id", "game2").addInt("round", 1)
 // add a vector
 .addVector("m_vector", new float[] {0.1f, 0.2f, 0.3f}));

(Recommended) Use DataAPIVector:

DataAPIVector vector
  = new DataAPIVector(new float[] {0.1f, 0.2f, 0.3f});

tableGames.insertOne(new Row()
 // primary key of the table
 .addText("match_id", "game1").addInt("round", 1)
 // add a vector with the DataAPIVector wrapper
 .addVector("m_vector", vector));
Underlying JSON representation for DataAPIVector

Because the DataAPIVector class uses binary-formatted serialization, the underlying HTTP request sends $binary in the JSON body. For example:

{
  "insertOne": {
    "document": {
      "match_id": "game1",
      "round": 1,
      "m_vector": {
        "$binary": "PczMzT5MzM0+mZma"
      }
    }
  }
}

For more information about DataAPIVector, see Java client usage.

If the column has a vectorize integration, you can automatically generate an embedding from a String by adding the string to the row directly:

// Assuming m_vector is a vector column with vectorize integration
tableGames.insertOne(new Row()
  // primary key of the table
  .addText("match_id", "game2").addInt("round", 1)
  // Add a vector as a string to generate vector embeddings from
  .addVectorize("m_vector", "Query text to vectorize"));

// Even if the column has a vectorize integration
// it is still valid to add embeddings directly
tableGames.insertOne(new Row()
  // We need a primary Key
  .addText("match_id", "game3").addInt("round", 1)
 // add a vector
  .addVector("m_vector", new float[] {0.1f, 0.2f, 0.3f}));

You can insert pre-generated vector data into a vector column as an array or as a binary encoded value. Alternatively, if the column has a vectorize integration, you can automatically generate an embedding from a string.

# Insert vector data as an array
"vector_column": [ 0.1, -0.2, 0.3 ]

# Insert vector data with $binary
"vector_column": { "$binary": "PczMzb5MzM0+mZma" }

# Generate an embedding with vectorize
"vector_column": "Text to vectorize"

The $binary form is for writes only. Binary encoding can be more economical, in terms of space and resources, when sending vectors in JSON payloads. Once received, the Data API decodes the binary format, and then stores the resulting list of numbers in the table. When you read data from a vector column, the Data API returns the vector as a list of numbers, regardless of the initial write format.

Vector binary encodings specification

A d-dimensional vector is a list of d floating-point numbers that can be binary encoded.

To prepare for encoding, the list must be transformed into a sequence of bytes where each float is represented as four bytes in big-endian format. Then, the byte sequence is Base64-encoded, with = padding, if needed. For example, here are some vectors and their resulting Base64 encoded strings:

[0.1, -0.2, 0.3] = "PczMzb5MzM0+mZma"
[0.1, 0.2] = "PczMzT5MzM0="
[10, 10.5, 100, -91.19] = "QSAAAEEoAABCyAAAwrZhSA=="

Once encoded, you use $binary to pass the Base64 string to the Data API:

{ "$binary": "BASE64_STRING" }

You can use a script to encode your vectors, for example:

python
import base64
import struct

input_vector = [0.1, -0.2, 0.3]
d = len(input_vector)
pack_format = ">" + "f" * d
binary_encode = base64.b64encode(struct.pack(pack_format, *input_vector)).decode()

With vectorize, the table stores the resulting embedding only. The original text string is discarded, unless you manually write the value in another column in the same insert or update command.

If you enable vectorize on a vector column, you are not prevented from using pre-generated embeddings. When you write data to that column, you can use pre-generated embeddings or automatically generate embeddings. Likewise, when you run a vector search on that column, you can either provide a pre-generated query vector or automatically generate a query vector from a string.

Query and read vector data

A vector search retrieves rows that are most similar to a given vector. To run a vector search on a table, use a sort clause with an indexed vector column and a query vector. If your table has multiple vector columns, you can only sort on one vector column at a time.

You can provide a pre-generated query vector or, if the vector column has a vectorize integration, you can automatically generate a query vector from a string.

  • Python

  • TypeScript

  • Java

  • curl

# Run a vector search with the query vector encoded as array
my_table.find(sort={"vector_column": [0.1, -0.2, 0.3]}, ...)

# Run a vector search with the query vector binary-encoded
my_table.find(sort={"vector_column": DataAPIVector([0.1, -0.2, 0.3])}, ...)

# Run a vector search by generating the query embedding with vectorize:
my_table.find(sort={"vector_column": "Query text to vectorize"}, ...)

By default, the results are returned with DataAPIVector objects in their vector columns. If you set the table APIOptions.serdes_options.custom_datatypes_in_reading to False, all vectors are returned as plain lists of floats, regardless of the format in the underlying HTTP response. This setting also controls how the query vector itself is returned when you invoke find with include_sort_vector=True. For more information, see Python client usage: Serdes options and custom data types.

You can provide sort vectors in the same way that you write vectors:

// Run a vector search using a DataAPIVector
await table.find({}, { sort: { vectorColumn: vector([0.1, -0.2, 0.3]) } });

// Run a vector search using a plain list of floats
// You must provide your own typing or use an untyped table
await table.find({}, { sort: { vectorColumn: [0.1, -0.2, 0.3] } });

// Run a vector search using vectorize
await table.find({}, { sort: { vectorColumn: 'Query text to vectorize' } });

Regardless of how you provide vectors, the vector results are always returned as DataAPIVector objects. The can be converted to arrays lazily using vector.asArray().

If you want to avoid using DataAPIVector objects, you must provide your own custom serialization/deserialization logic for the Table instance. See Custom Ser/Des for more information.

This also applies to find operations that include the includeSortVector option:

// Run a vector search using vectorize
const cursor = await table.find({})
  .sort({ vectorColumn: 'Query text to vectorize' })
  .includeSortVector();

// The query vector is returned as a DataAPIVector object
const queryVector = await cursor.getSortVector();

To perform a vector search, use the Sort.vector(..,..) and Sort.vectorize(..,..) methods.

These methods take two arguments: a column name and a query vector.

If you omit the column name, the client uses the default of $vector or $vectorize, depending on the method. These are reserved fields for collections. With tables, it is important that you specify the name of your vector column in the first argument. This is because a table can contain multiple vector columns, and vector columns can have any valid name.

You can provide sort vectors to these methods in the same way that you write vectors:

// This example uses TableFindOptions, but you can use any sortable object

// Sort with DataAPIVector
DataAPIVector vector = new DataAPIVector(new float[] {0.1f, 0.2f, 0.3f});
new TableFindOptions()
  .sort(Sort.vector("m_vector", vector));

// Sort with float[]
new TableFindOptions()
  .sort(Sort.vector("m_vector", new float[] {0.1f, 0.2f, 0.3f}));

// Sort with vectorize
new TableFindOptions()
  .sort(Sort.vectorize("m_vector", "Text To Serialize"));
# Provide the query vector as an array
"sort": { "vector_column": [ 0.1, -0.2, 0.3 ] }

# Provide the query vector with $binary
"sort": { "vector_column": { "$binary": "PczMzb5MzM0+mZma" } }

# Generate a query vector with vectorize
"sort": { "vector_column": "Text to vectorize" }

The $binary form can be more economical, in terms of space and resources, when sending vectors in JSON payloads. Once received, the Data API decodes the binary format, and then uses the resulting list of numbers for the vector search.

The includeSortVector option returns the query vector as a list of numbers, regardless of the format supplied in the original command.

For more information and examples, see Find rows reference.

BLOB type

The Data API supports blob type.

Due to a known issue with filtering on blob, DataStax does not recommend using blob in primary keys.

  • Python

  • TypeScript

  • Java

  • curl

The Python client automatically handles binary data for tables.

Any bytes values sent to the Data API, such as rows to insert or search filters values, are automatically converted to the format expected by the Data API. Likewise, binary data received from the Data API is returned as a bytes value.

# Assume the table has primary key "primary_column" and a "bytes_column" BLOB column
bytes_val = b"=\xfb\xe7m>\xe9x\xd5?I\xfb\xe7"

my_table.insert_one({"primary_column": "row0", "bytes_column": bytes_val})
# Return TableInsertOneResult(inserted_id={'primary_column': 'row0'}, ...  (shortened)

my_table.find_one({"primary_column": "row0"})["bytes_column"]
# Returns b'=\xfb\xe7m>\xe9x\xd5?I\xfb\xe7'

In the previous example, bytes_val, being bytes_val.hex() == "3dfbe76d3ee978d53f49fbe7", translates to the JSON representation {"$binary": "PfvnbT7peNU/Sfvn"}. For more information about the JSON representation, see the curl tab for BLOB.

The TypeScript client automatically handles binary data in a runtime-independent way, through the DataAPIBlob class:

// Using the explicit class constructor
await table.insertOne({ blobColumn: new DataAPIBlob([0x0, 0x1, 0x2]), ... });

// Using the `blob` function shorthand
await table.insertOne({ blobColumn: blob(Buffer.from[0x0, 0x1, 0x2]), ... });

You can create a vector from any DataAPIBlobLike object:

export type DataAPIBlobLike = DataAPIBlob | ArrayBuffer | Buffer | { $binary: string };

// Creating a DataAPIBlob from a base64 encoded string
const binary = blob({ $binary: 'ZXZlciBkcmVhbQ==' });

The same class is also returned when reading back the BLOB data:

const row = await table.findOne({ pkey: 'row0' });
console.log(row.blobColumn.asArrayBuffer());

The Java client automatically handles binary data for tables.

Arrays of bytes (byte[]) sent to the Data API, such as rows to insert or search filters, are automatically converted to the format expected by the Data API. Likewise, binary data received from the Data API is returned as a byte[].

byte[] bytesVal = {
 (byte) 0x3D, (byte) 0xFB, (byte) 0xE7, (byte) 0x6D,
 (byte) 0x3E, (byte) 0xE9, (byte) 0x78, (byte) 0xD5,
 (byte) 0x3F, (byte) 0x49, (byte) 0xFB, (byte) 0xE7
};

// Assume 'myTable' has TEXT-type primary key "primary_column" and "bytes_column" BLOB column
myTable.insertOne(new Row()
 .addText("primary_column", "row0")
 .addBlob("bytes_column", bytesVal));

myTable
 .findOne(Filters.eq("primary_column", "row0"))
 .ifPresent(row -> {
   byte[] bytesColumn = row.getBlob("bytes_column");
   // prints: [61, -5, -25, 109, 62, -23, 120, -43, 63, 73, -5, -25]
});

In the previous example, the array bytes_column translates to the JSON representation {"$binary": "PfvnbT7peNU/Sfvn"}. For more information about the JSON representation, see the curl tab for BLOB.

To write data to a blob column, use $binary with a Base64-encoded string representing the byte sequence you want to store:

"blob_column": { "$binary": "PfvnbT7peNU/Sfvn" }

Over HTTP, you can use $binary to write binary-encoded vectors to vector columns. However, writing binary-encoded vectors to blob columns does not treat that data as true vectors, such as for vector search. For more information about storing vector data and running vector search on tables, see Vector type.

For information and examples for encoding arrays and writing vectors with $binary, see the curl tab for Write vector data.

Other scalar types

  • ascii: A string of US-ASCII characters.

    The Python client accepts and returns str.

  • boolean

    The Python client accepts and returns bool.

  • inet: Accepts literal IPV4 and IPV6 values. Domain names are rejected for security reasons.

    The Python client accepts and returns ipaddress.IPV4Addres and ipaddress.IPV6Address.

  • text: A UTF-8 encoded string.

    The Python client accepts and returns str.

  • varchar: A UTF-8 encoded string. This is an alias for text. However, varchar is partially supported by the Data API, whereas text is fully supported.

    With the Data API, you can read, write, update, and delete values in existing varchar columns, but you can’t use the Data API to create tables with varchar columns or add columns of type varchar to tables.

    If you need a varchar column in a table, then you must create or alter the table with CQL before using the Data API to manage the values in the varchar column. Alternatively, consider using the text type.

Map, list, and set types

The Data API supports map, list, and set types for table columns with the following limitations:

  • No nested maps, lists, or sets, such as map<string,map<string,string>>.

  • Maps, lists, and sets are not supported in primary keys, indexes, or filter clauses.

  • Maps can have only text or ascii keys.

  • Frozen maps, lists, and sets created in CQL (outside the Data API) aren’t accessible through the Data API. For more information, see Support for existing tables.

  • Map

  • List

  • Set

map<keyType,valueType> is a typed set of key-value pairs represented as a JSON object where the keys are unique. When written to a table, maps are internally ordered by their keys, and they are returned in that order.

To define a map column in a table, you specify the map type, as well as the keyType and valueType of the items in the map. The keyType can be only text or ascii. Other key types are not supported.

  • Python

  • TypeScript

  • Java

  • curl

Example using the fluent interface:

table_definition = (
    CreateTableDefinition.builder()
    ...
    .add_map_column(
        "map_column",
        ColumnType.TEXT,  # or "text"
        ColumnType.INT,  # or "int"
    )
    ...
    .build()
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)

Example using CreateTableDefinition:

table_definition = CreateTableDefinition(
    columns={
        ...
        "map_column": TableKeyValuedColumnTypeDescriptor(
            column_type="map",  # or TableKeyValuedColumnType.MAP
            key_type=ColumnType.TEXT,  # or "text"
            value_type=ColumnType.INT,  # or "int"
        ),
        ...
    },
    ...
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)

Example using a plain dictionary:

table_definition = {
    "columns": {
        ...
        "map_column": {
            "type": "map", "keyType": "text", "valueType": "int"
        },
        ...
    },
    ...
}
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
const table = await db.createTable('TABLE_NAME', {
  definition: {
    columns: {
      MAP_COLUMN_NAME: {
        type: 'map',
        keyType: 'text',
        valueType: 'int',
      },
      ...
    },
    ...
  },
});
TableDefinition def = new TableDefinition()
  .addColumnMap("map_column", ColumnTypes.TEXT, ColumnTypes.INT);
database
  .createTable("TABLE_NAME", def);
"map_column": {
  "type": "map",
  "keyType": "text",
  "valueType": "int"
}

When you write to a map column, you provide key-value pairs representing items in the map. Keys must match the defined keyType format, and values must use the defined valueType format.

  • Python

  • TypeScript

  • Java

  • curl

The Python client accepts both simple dictionaries and instances of DataAPIMap for writes.

# Insert a map as a plain dictionary
my_table.insert_one({
    "map_column": {"key1": 42, "key2": 298},
    ...
})

# Insert a map using a DataAPIMap, initialized with a dictionary
my_table.insert_one({
    "map_column": DataAPIMap({"key1": 42, "key2": 298}),
    ...
})

# Insert a map using a DataAPIMap, initialized with a list of pairs
my_table.insert_one({
    "map_column": DataAPIMap([("key1", 42), ("key2", 298)]),
    ...
})

When reading rows, the Python client returns maps as DataAPIMap objects. To return standard Python dictionaries, set the table APIOptions.serdes_options.custom_datatypes_in_reading to False. The default for this setting is True.

Map data is represented by native JavaScript Maps.

await table.insertOne({ mapColumn: new Map([['key1', 42], ['key2', 298]]), ... });

While you can use plain JavaScript objects when inserting map data, it will still be read back as a native Map, unless custom serialization/deserialization logic is used.

Map<String, Integer> myMap = new HashMap<>();
myMap.put("key1", 42);
myMap.put("key2", 298);
table.insertOne(new Row().addMap("map_column", myMap));
"map_column": {
  "key1": 42,
  "key2": 298
}

list<valueType> is a typed collection of ordered, non-unique values represented as a JSON array. When written to a table, lists elements are ordered by their position in the list.

To define a list column in a table, you specify the list type and the valueType of the items in the list.

  • Python

  • TypeScript

  • Java

  • curl

Example using the fluent interface:

table_definition = (
    CreateTableDefinition.builder()
    ...
    .add_list_column(
        "list_column",
        ColumnType.TEXT,  # or "text"
    )
    ...
    .build()
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)

Example using CreateTableDefinition:

table_definition = CreateTableDefinition(
    columns={
        ...
        "list_column": TableValuedColumnTypeDescriptor(
            column_type="list",  # or TableValuedColumnType.LIST
            value_type=ColumnType.TEXT,  # or "text"
        ),
        ...
    },
    ...
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)

Example using a plain dictionary:

table_definition = {
    "columns": {
        ...
        "list_column": {
            "type": "list", "valueType": "text"
        },
        ...
    },
    ...
}
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
const table = await db.createTable('TABLE_NAME', {
  definition: {
    columns: {
      LIST_COLUMN_NAME: {
        type: 'list',
        valueType: 'text',
      },
      ...
    },
    ...
  },
});
TableDefinition def = new TableDefinition()
  .addColumnList("list_column", ColumnTypes.TEXT);
Table<Row> table = db.createTable("TABLE_NAME", def);
"list_column": {
  "type": "list",
  "valueType": "text"
}

When you write to a list column, you provide an array of values in the defined valueType format.

  • Python

  • TypeScript

  • Java

  • curl

my_table.insert_one({
    "list_column": ["Hello", "World", "!"],
    ...
})

List data is represented by native JavaScript arrays.

await table.insertOne({ listColumn: ['Hello', 'World', '!'], ... });
table.insertOne(new Row()
  .addList("list_column", List.of("Hello", "World", "!"))
);
"list_column": [ "Hello", "World", "!" ]

set<valueType> is a typed collection of unique values represented as a JSON array. When written to a table, sets are ordered by their values.

To define a set column in a table, you specify the set type and the valueType of the items in the set.

  • Python

  • TypeScript

  • Java

  • curl

Example using the fluent interface:

table_definition = (
    CreateTableDefinition.builder()
    ...
    .add_set_column(
        "set_column",
        ColumnType.INT,  # or "int"
    )
    ...
    .build()
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)

Example using CreateTableDefinition:

table_definition = CreateTableDefinition(
    columns={
        ...
        "set_column": TableValuedColumnTypeDescriptor(
            column_type="set",  # or TableValuedColumnType.SET
            value_type=ColumnType.INT,  # or "int"
        ),
        ...
    },
    ...
)
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)

Example using a plain dictionary:

table_definition = {
    "columns": {
        ...
        "set_column": {
            "type": "set", "valueType": "int"
        },
        ...
    },
    ...
}
my_table = database.create_table(
    "TABLE_NAME",
    definition=table_definition,
    ...
)
const table = await db.createTable('TABLE_NAME', {
  definition: {
    columns: {
      SET_COLUMN_NAME: {
        type: 'set',
        valueType: 'int',
      },
      ...
    },
    ...
  },
});
type TableSchema = InferTableSchema<typeof table>;
TableDefinition def = new TableDefinition()
  .addColumnSet("set_column", ColumnTypes.INT);
Table<Row> table = db.createTable("TABLE_NAME", def);
"set_column": {
  "type": "set",
  "valueType": "int"
}

When you write to a set column, you provide an set of values in the defined valueType format.

  • Python

  • TypeScript

  • Java

  • curl

The Python client accepts both Python sets and instances of DataAPISet.

# Insert a set as a plain Python set
my_table.insert_one({
    "set_column": {9, 8, 7},
    ...
})

# Insert a set using a DataAPISet, initialized with a plain set
my_table.insert_one({
    "set_column": DataAPISet({9, 8, 7}),
    ...
})

# Insert a set using a DataAPISet, initialized with a list of elements
my_table.insert_one({
    "set_column": DataAPISet([9, 8, 7]),
    ...
})

When reading rows, the Python client returns sets as DataAPISet objects, unless the table APIOptions.serdes_options.custom_datatypes_in_reading is False, in which case standard Python sets are used. The default for this setting is True.

Set data is represented by native JavaScript Sets.

await table.insertOne({ setColumn: new Set([9, 8, 8, 7]), ... });

While you can use plain JavaScript arrays when you insert set data, the data is read back as a native Set, unless you use custom serialization/deserialization logic.

table.insertOne(new Row()
  .addSet("set_column", Set.of(9,8,7))
);
"set_column": [ 9, 8, 7 ]

Unsupported types

Unsupported types, formats, and modifiers include:

  • User-defined types (UDTs)

  • Counters

  • Time To Live (TTL)

  • Column timestamps

  • Frozen

  • Static

  • Tuple

The following types are partially supported or supported with limitations:

When working with existing CQL tables that you created outside the Data API, some of these types are partially supported. For more information, see Migrate from CQL.

Mismatched and unsupported type errors

The Data API returns warnings and errors if you use an unsupported type or attempt to insert a value that doesn’t match the column’s defined type.

For example, if you attempt to insert a value that doesn’t match the column’s type, the Data API returns "INVALID_COLUMN_VALUES".

Similarly, the Data API returns an error if you attempt to perform an operation on a column that isn’t compatible with the specified operation. For example, if you attempt to filter on a vector column, the Data API returns "INVALID_FILTER_EXPRESSION".

Operators

Data API provides query and update operators that you can use in filters to find and insert rows.

For examples, see the references for commands that use these operators, such as Find rows reference and Update rows reference.

Query (filter) operators

You can use filter operators on all supported types except map, list, set, and vector.

Operator type Name Purpose

Comparison query (Range)

$gt

Matches rows where the given column’s value is greater than the specified value.

$gte

Matches rows where the given column’s value is greater than or equal to the specified value.

$lt

Matches rows where the given column’s value is less than the specified value.

$lte

Matches rows where the given column’s value is less than or equal to the specified value.

Comparison query (Exact)

$eq

Matches rows where a given column’s value equals the specified value. This is the default when you do not specify an operator.

This is the only filter operator allowed in updateOne and deleteOne.

$ne

Matches rows where a given column’s value does not equal the specified value.

$in

Match one or more of an array of specified values. For example, "filter": { "COLUMN_NAME": { "$in": [ "VALUE", "VALUE" ] } }.

If you have only one value to match, an array is not necessary, such as { "$in": "VALUE" }.

$nin

Matches any of the values that are NOT IN a given array.

Update operators

You can use update operators on all supported types.

Name Purpose

$set

Sets a given column’s value to the specified value.

$unset

Sets a given column’s value to null or the equivalent empty form, such as [] or {} for map, list, and set types.

$rename isn’t supported. Instead, use alterTable to add and drop columns.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com