Integrate Hugging Face Dedicated as an embedding provider

Integrate Hugging Face Dedicated as an external embedding provider for Astra DB vectorize to leverage Hugging Face Dedicated’s embeddings API within Astra DB Serverless.

Prerequisites

To integrate Hugging Face Dedicated as an external embedding provider, you need the following:

Create the Hugging Face Dedicated user access token

Log into your Hugging Face Dedicated account and create a new user access token with unrestricted access to the API. Make sure to copy the user access token to a secure location.

Don’t modify or delete the user access token in your Hugging Face Dedicated account after you’ve added it to Astra. This breaks the integration.

Add the Hugging Face Dedicated integration to your organization

Use the Astra Portal to add the Hugging Face Dedicated embedding provider integration to your Astra organization:

  1. In the Astra Portal navigation menu, click Integrations.

  2. In the All Integrations section, select Hugging Face Dedicated Embedding provider.

  3. Click Add integration.

  4. In the Add Integration dialog, do the following:

    1. Enter a unique User access token name.

      You can’t change user access token names. Make sure the name is meaningful and that it helps you identify your Hugging Face Dedicated user access token in Astra.

    2. Enter your Hugging Face Dedicated user access token.

    3. Under Add databases to scope, use the dropdown menu to select a Serverless (Vector) database that you want scoped to your Hugging Face Dedicated user access token.

      You can select up to 10 databases at a time. You can add more databases later if needed.

  5. Click Add Integration.

    The Hugging Face Dedicated integration switches to ACTIVE rss_feed, and your user access token and scoped databases appear in the API keys section.

You can now select the Hugging Face Dedicated integration as an embedding generation method when creating new collections in a scoped database.

You can scope the same database to multiple user access tokens. This lets you select the most appropriate user access token for each collection.

To add another user access token with additional scopes, click Add API key and repeat the previous steps.

Add the Hugging Face Dedicated integration to a new collection

Before you can use the Hugging Face Dedicated integration to generate embeddings, you must add the integration to a new collection.

  • Astra Portal

  • Python

  • TypeScript

  • Java

Use the Astra Portal to add the Hugging Face Dedicated integration to a new collection:

  1. In the Astra Portal, go to Databases, and then select your Serverless (Vector) database.

  2. Click Data Explorer.

  3. Optional: Use the Namespace dropdown to select the namespace where you want to create the collection. Otherwise, leave default_keyspace selected to create the collection in the default namespace.

  4. Click Create Collection.

  5. In the Create collection dialog, enter a name for the new collection in the Collection name field.

  6. Under Embedding generation method, select the Hugging Face Dedicated embedding provider integration.

    If the integration isn’t listed, go to Integrations > Hugging Face Dedicated and check that it shows ACTIVE rss_feed and that your database is scoped to the user access token that you want to use for your collection.

  7. Complete the following fields:

    • User access token: The user access token that you want to use for your collection. This dropdown menu is only active if you’ve scoped your database to multiple Hugging Face Dedicated user access tokens.

    • Endpoint name: The programmatically-generated name of your Hugging Face Dedicated endpoint. This is the first part of the endpoint URL. For example, if your endpoint URL is https://mtp1x7muf6qyn3yh.us-east-1.aws.endpoints.huggingface.cloud, the endpoint name is mtp1x7muf6qyn3yh.

    • Region name: The cloud provider region your Hugging Face Dedicated endpoint is deployed to. For example, us-east-1.

    • Cloud name: The cloud provider your Hugging Face Dedicated endpoint is deployed to. For example, aws.

    • Embedding model: The model that you want to use to generate embeddings. The available models are: endpoint-defined-model. For Hugging Face Dedicated, the integration uses the model that you defined in your dedicated endpoint configuration.

    • Dimensions: The number of dimensions that you want the generated vectors to have. Most models automatically populate the Dimensions. You can edit this field if the model supports a range of dimensions or the embedding provider integration uses an endpoint-defined model. Your chosen embedding model must support the specified number of dimensions.

    • Similarity metric: The method you want to use to calculate vector similarities.

      The available metrics are:

  8. Click Create collection.

    If you get a Collection Limit Reached message, you’ll need to delete a collection before you can create a new one.

An empty collection appears in the list of collections. You can now load data into this collection.

Use the Python client to create a collection that uses the Hugging Face Dedicated integration.

Initialize the client

If you haven’t done so already, initialize the client before creating a collection:

import os
from astrapy import DataAPIClient
from astrapy.constants import VectorMetric
from astrapy.ids import UUID
from astrapy.info import CollectionVectorServiceOptions

# Initialize the client and get a "Database" object
client = DataAPIClient(os.environ["ASTRA_DB_APPLICATION_TOKEN"])
database = client.get_database(os.environ["ASTRA_DB_API_ENDPOINT"])
print(f"* Database: {database.info().name}\n")

Create a collection integrated with Hugging Face Dedicated:

collection = database.create_collection(
    "COLLECTION_NAME",
    metric=VectorMetric.COSINE,
    dimension=MODEL_DIMENSIONS,
    service=CollectionVectorServiceOptions(
        provider="huggingfaceDedicated",
        model_name="endpoint-defined-model",
        authentication={
            "providerKey": "API_KEY_NAME",
        },
        parameters={
            "endpointName": "ENDPOINT_NAME",
            "regionName": "REGION_NAME",
            "cloudName": "CLOUD_NAME",
        },
    ),
)
print(f"* Collection: {collection.full_name}\n")

Replace the following:

  • COLLECTION_NAME: The name for your collection.

  • API_KEY_NAME: The name of the Hugging Face Dedicated user access token that you want to use for your collection. Must be the name of an existing Hugging Face Dedicated user access token in the Astra Portal.

  • MODEL_NAME: The desired model to use to generate embeddings. For Hugging Face Dedicated, you can choose from the following models: endpoint-defined-model. Unless otherwise specified, the vector dimensions are automatically set based on the chosen model.

  • MODEL_DIMENSIONS: The number of dimensions that you want the generated vectors to have. Your embedding model must support the specified number of dimensions.

  • ENDPOINT_NAME: The programmatically-generated name of your Hugging Face Dedicated endpoint. This is the first part of the endpoint URL. For example, if your endpoint URL is https://mtp1x7muf6qyn3yh.us-east-1.aws.endpoints.huggingface.cloud, the endpoint name is mtp1x7muf6qyn3yh.

  • REGION_NAME: The cloud provider region your Hugging Face Dedicated endpoint is deployed to. For example, us-east-1.

  • CLOUD_NAME: The cloud provider your Hugging Face Dedicated endpoint is deployed to. For example, aws.

The model name must be set to endpoint-defined-model because this integration uses the model specified in your dedicated endpoint configuration.

Use the TypeScript client to create a collection that uses the Hugging Face Dedicated integration.

Initialize the client

If you haven’t done so already, initialize the client before creating a collection:

import { DataAPIClient, VectorDoc, UUID } from '@datastax/astra-db-ts';

const { ASTRA_DB_APPLICATION_TOKEN, ASTRA_DB_API_ENDPOINT } = process.env;

// Initialize the client and get a 'Db' object
const client = new DataAPIClient(ASTRA_DB_APPLICATION_TOKEN);
const db = client.db(ASTRA_DB_API_ENDPOINT);

console.log(`* Connected to DB ${db.id}`);

Create a collection integrated with Hugging Face Dedicated:

(async function () {
  const collection = await db.createCollection('COLLECTION_NAME', {
    vector: {
      dimension: MODEL_DIMENSIONS,
      service: {
        provider: 'huggingfaceDedicated',
        modelName: 'endpoint-defined-model',
        authentication: {
          providerKey: 'API_KEY_NAME',
        },
        parameters: {
          endpointName: 'ENDPOINT_NAME',
          regionName: 'REGION_NAME',
          cloudName: 'CLOUD_NAME',
        },
      },
    },
  });
  console.log(`* Created collection ${collection.namespace}.${collection.collectionName}`);

Replace the following:

  • COLLECTION_NAME: The name for your collection.

  • API_KEY_NAME: The name of the Hugging Face Dedicated user access token that you want to use for your collection. Must be the name of an existing Hugging Face Dedicated user access token in the Astra Portal.

  • MODEL_NAME: The desired model to use to generate embeddings. For Hugging Face Dedicated, you can choose from the following models: endpoint-defined-model. Unless otherwise specified, the vector dimensions are automatically set based on the chosen model.

  • MODEL_DIMENSIONS: The number of dimensions that you want the generated vectors to have. Your embedding model must support the specified number of dimensions.

  • ENDPOINT_NAME: The programmatically-generated name of your Hugging Face Dedicated endpoint. This is the first part of the endpoint URL. For example, if your endpoint URL is https://mtp1x7muf6qyn3yh.us-east-1.aws.endpoints.huggingface.cloud, the endpoint name is mtp1x7muf6qyn3yh.

  • REGION_NAME: The cloud provider region your Hugging Face Dedicated endpoint is deployed to. For example, us-east-1.

  • CLOUD_NAME: The cloud provider your Hugging Face Dedicated endpoint is deployed to. For example, aws.

The model name must be set to endpoint-defined-model because this integration uses the model specified in your dedicated endpoint configuration.

Use the Java client to create a collection that uses the Hugging Face Dedicated integration.

Initialize the client

If you haven’t done so already, initialize the client before creating a collection:

import com.datastax.astra.client.Collection;
import com.datastax.astra.client.DataAPIClient;
import com.datastax.astra.client.Database;
import com.datastax.astra.client.model.CollectionOptions;
import com.datastax.astra.client.model.Document;
import com.datastax.astra.client.model.FindIterable;
import com.datastax.astra.client.model.FindOptions;
import com.datastax.astra.client.model.SimilarityMetric;

import static com.datastax.astra.client.model.SimilarityMetric.COSINE;

public class Quickstart {

  public static void main(String[] args) {
    // Loading Arguments
    String astraToken = System.getenv("ASTRA_DB_APPLICATION_TOKEN");
    String astraApiEndpoint = System.getenv("ASTRA_DB_API_ENDPOINT");

    // Initialize the client
    DataAPIClient client = new DataAPIClient(astraToken);
    System.out.println("Connected to AstraDB");

    Database db = client.getDatabase(astraApiEndpoint);
    System.out.println("Connected to Database.");

Create a collection integrated with Hugging Face Dedicated:

Map<String, Object > params = new HashMap<>();
params.put("endpointName", "ENDPOINT_NAME");
params.put("regionName", "REGION_NAME");
params.put("cloudName", "CLOUD_NAME");
CollectionOptions.CollectionOptionsBuilder builder = CollectionOptions
       .builder()
       .vectorSimilarity(SimilarityMetric.COSINE)
       .vectorDimension(MODEL_DIMENSIONS)
       .defaultIdType(CollectionIdTypes.UUID)
       .vectorize("huggingfaceDedicated", "endpoint-defined-model", "API_KEY_NAME", params);
Collection<Document> collection = db
       .createCollection("COLLECTION_NAME", builder.build());

Replace the following:

  • COLLECTION_NAME: The name for your collection.

  • API_KEY_NAME: The name of the Hugging Face Dedicated user access token that you want to use for your collection. Must be the name of an existing Hugging Face Dedicated user access token in the Astra Portal.

  • MODEL_NAME: The desired model to use to generate embeddings. For Hugging Face Dedicated, you can choose from the following models: endpoint-defined-model. Unless otherwise specified, the vector dimensions are automatically set based on the chosen model.

  • MODEL_DIMENSIONS: The number of dimensions that you want the generated vectors to have. Your embedding model must support the specified number of dimensions.

  • ENDPOINT_NAME: The programmatically-generated name of your Hugging Face Dedicated endpoint. This is the first part of the endpoint URL. For example, if your endpoint URL is https://mtp1x7muf6qyn3yh.us-east-1.aws.endpoints.huggingface.cloud, the endpoint name is mtp1x7muf6qyn3yh.

  • REGION_NAME: The cloud provider region your Hugging Face Dedicated endpoint is deployed to. For example, us-east-1.

  • CLOUD_NAME: The cloud provider your Hugging Face Dedicated endpoint is deployed to. For example, aws.

The model name must be set to endpoint-defined-model because this integration uses the model specified in your dedicated endpoint configuration.

You can’t change your collection’s embedding provider after you’ve created it. To use a different embedding provider, you must create a new collection with a different embedding provider integration.

Load data using vectorize to auto-generate embeddings

Use the following methods to load vector data into a collection and use $vectorize to auto-generate embeddings.

  • Astra Portal

  • Python

  • TypeScript

  • Java

Use the Astra Portal to load a dataset from a JSON or a CSV file.

  1. In the Astra Portal, go to Databases, and then select a database that contains a collection that uses the Hugging Face Dedicated integration.

  2. Click Data Explorer.

  3. Select the collection that uses the Hugging Face Dedicated integration.

  4. Click Load Data.

  5. In the Load Data dialog, click Select File.

  6. Select the file on your computer that contains your dataset.

    Once the file upload is complete, the first ten rows of your data appear in the Data Preview section.

    If you get a Selected embedding does not match collection dimensions error, you need to create a new collection with vector dimensions that match your dataset.

  7. Use the Vector Field dropdown to select the field that you want to auto-generate embeddings for.

    The Load Data dialog with Vector Field dropdown expanded.

    The data importer will apply the top-level $vectorize key to the Vector Field, and automatically generate an embedding vector from its contents. The resulting documents in the collection will have the actual text stored in the special $vectorize field, and the resulting embedding stored in the $vector field. The original field name (such as reviewtext) isn’t preserved in the documents in the database.

  8. Optional: Configure field data types.

    In the Data Preview section, use the drop-down controls to change the data type for each field or column.

    The options are:

    • String

    • Number

    • Array

    • Object

    • Vector

    Data type selections you make in the Data Preview section only apply to the initial data that you load (with the exception of Vector, which permanently maps the field to the reserved key $vector). These selections aren’t fixed in the schema, and don’t apply to documents inserted later on. The same field can be a string in one document, and a number in another. You can also have different sets of fields in different documents in the same collection.

  9. Click Load Data.

Once your dataset has loaded, you can interact with it and do a vector search using the Data Explorer and the client APIs.

# Insert documents into the collection.
# (UUIDs here are version 7.)
documents = [
    {
        "_id": UUID("018e65c9-df45-7913-89f8-175f28bd7f74"),
        "$vectorize": "Chat bot integrated sneakers that talk to you",
    },
    {
        "_id": UUID("018e65c9-e1b7-7048-a593-db452be1e4c2"),
        "$vectorize": "An AI quilt to help you sleep forever",
    },
    {
        "_id": UUID("018e65c9-e33d-749b-9386-e848739582f0"),
        "$vectorize": "A deep learning display that controls your mood",
    },
]
insertion_result = collection.insert_many(documents)
print(f"* Inserted {len(insertion_result.inserted_ids)} items.\n")
  // Insert documents into the collection (using UUIDv7s)
  const documents = [
    {
      _id: new UUID('018e65c9-df45-7913-89f8-175f28bd7f74'),
      $vectorize: 'Chat bot integrated sneakers that talk to you',
    },
    {
      _id: new UUID('018e65c9-e1b7-7048-a593-db452be1e4c2'),
      $vectorize: 'An AI quilt to help you sleep forever',
    },
    {
      _id: new UUID('018e65c9-e33d-749b-9386-e848739582f0'),
      $vectorize: 'A deep learning display that controls your mood',
    },
  ];

  try {
    const inserted = await collection.insertMany(documents);
    console.log(`* Inserted ${inserted.insertedCount} items.`);
  } catch (e) {
    console.log('* Documents found on DB already. Let\'s move on!');
  }
// Insert documents into the collection
InsertManyResult insertResult = collection.insertMany(
  new Document()
   .id(UUID.fromString("018e65c9-df45-7913-89f8-175f28bd7f74"))
   .vectorize("Chat bot integrated sneakers that talk to you"),
  new Document()
   .id(UUID.fromString("018e65c9-e1b7-7048-a593-db452be1e4c2"))
   .vectorize("An AI quilt to help you sleep forever"),
  new Document()
   .id(UUID.fromString("018e65c9-e33d-749b-9386-e848739582f0"))
   .vectorize("A deep learning display that controls your mood")
);
System.out.println("Insert " + insertResult.getInsertedIds().size() + " items.");

Search your data with vectorize

Perform a similarity search using text, rather than a vector.

  • Astra Portal

  • Python

  • TypeScript

  • Java

Use the Astra Portal to perform a search with vectorize:

  1. In the Astra Portal, go to Databases, and then select your Serverless (Vector) database.

  2. Click Data Explorer.

  3. Select the Namespace and Collection that contain the data you want to view.

    Your data is displayed in the Collection Data section. The field you configured to auto-generate embeddings is notated with ($vectorize) in the column title. The $vector field contains the generated embeddings.

  4. Enter a text query into the Hybrid Search field, and then click Apply.

    Astra DB auto-generates a vector from the text query and performs a similarity search. The search uses the similarity metric that you chose when you created the collection.

  5. Optional: Use Add Filter to filter your search results by the other fields in the collection. For more information about using filters, see Add a metadata filter.

The Collection Data section updates to show the rows that match your search criteria.

Use the Python client to perform a search with vectorize:

# Perform a similarity search
query = "I'd like some talking shoes"
results = collection.find(
    sort={"$vectorize": query},
    limit=2,
    projection={"$vectorize": True},
    include_similarity=True,
)
print(f"Vector search results for '{query}':")
for document in results:
    print("    ", document)

Use the TypeScript client to perform a search with vectorize:

  // Perform a similarity search
  const cursor = await collection.find({}, {
    sort: { $vectorize: 'shoes' },
    limit: 2,
    includeSimilarity: true,
  });

  console.log('* Search results:')
  for await (const doc of cursor) {
    console.log('  ', doc.text, doc.$similarity);
  }

Use the Java client to perform a search with vectorize:

// Perform a similarity search
FindOptions findOptions = new FindOptions()
       .limit(2)
       .includeSimilarity()
       .sort("I'd like some talking shoes");
FindIterable<Document> results = collection.find(findOptions);
for (Document document : results) {
   System.out.println("Document: " + document);
}

Manage database scopes

To manage database scopes for an existing Hugging Face Dedicated user access token:

  1. In the Astra Portal navigation menu, click Integrations, and then select Hugging Face Dedicated Embedding provider.

  2. In the API keys section, locate your user access token and then click the chevron_right expander arrow to show the list of scoped databases.

  3. To remove a scoped database, click delete Delete.

    In the confirmation dialog, enter the Database name, and then click Remove scope.

  4. To add a database scope, click more_vert More, and then select Add database.

    Under Add databases to API key scope, use the dropdown menu to select a Serverless (Vector) database, and then click Add database.

Remove an existing user access token from the Hugging Face Dedicated integration

Ensure that no active collections are using an user access token before removing it, as it will immediately disable $vectorize embedding generation for those collections.

You cannot assign a new user access token to an existing collection.

To remove an existing Hugging Face Dedicated user access token:

  1. In the Astra Portal navigation menu, click Integrations, and then select Hugging Face Dedicated Embedding provider.

  2. In the API keys section, locate the user access token you want to remove, click more_vert More, and then select Remove API key.

  3. In the confirmation dialog, enter the API key name, and then click Remove key.

Remove the Hugging Face Dedicated integration from your organization

To remove the Hugging Face Dedicated embedding provider integration from your Astra organization, remove all existing Hugging Face Dedicated user access token.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com