Integrate Glean with Astra DB Serverless

query_builder 30 min

With the Glean platform integration, you can use information from Astra DB Serverless databases as a datasource for your Glean searches.

To do this, you use the Data API to push data from Astra DB through the Glean Indexing API to a custom datasource in Glean.

Use this integration if you need to query your Astra DB data alongside your other Glean datasources. This is recommended for use cases where your Astra DB data contains human-readable information that you want users to find in Glean search results. This integration is best for non-vector data, such as non-vector CSV data, that you can’t ingest through other Glean datasources.

You can build and run a script locally or use this guide’s Colab notebook.

For a complete script example, including the Prepare Glean and Prepare Astra DB steps, see the mini-demo-astradb-glean GitHub repository.

Prerequisites

This integration requires the following:

Prepare Astra DB

  1. If you don’t already have one, create a Serverless (Vector) database.

  2. Generate an application token with the Database Administrator role, and then get your database’s API endpoint in the form of https://DATABASE_ID-REGION.apps.astra.datastax.com. For more information, see Generate an application token for a database.

  3. Set the Astra DB environment variables for your integration script:

    .env
    ASTRA_DB_APPLICATION_TOKEN=TOKEN
    ASTRA_DB_API_ENDPOINT=API_ENDPOINT
    ASTRA_DB_COLLECTION_NAME=COLLECTION_NAME
    ASTRA_DB_KEYSPACE="default_keyspace"  # Optional

    Replace the following:

    • The TOKEN and API_ENDPOINT are the application token and API endpoint that you retrieved in the previous step.

    • The COLLECTION_NAME is the name of the collection in your database where you will store the data you want to push to Glean. This can be the name of an existing collection or a collection that you will create in the next step.

    • The keyspace can be left unspecified to use the Data API default value.

  4. Create a collection in your database or use an existing collection. Make sure the collection’s name matches your ASTRA_DB_COLLECTION_NAME environment variable.

    The Data API requires a collection in a Serverless (Vector) database. Both vector and non-vector data can be saved in collections.

    The integration script shown on this page creates a collection and populates the collection with sample data. Adjust the script to use an existing collection if you desire.

    Example: Replace collection creation

    In order to read from an existing collection, replace the following part of the script:

    # Create collection
    source_collection = database.create_collection(ASTRA_DB_COLLECTION_NAME)
    print(
        f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}"
        f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}."
    )

    with this:

    # Get an existing collection
    source_collection = database.get_collection(ASTRA_DB_COLLECTION_NAME)
    print(
        f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}"
        f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}."
    )
  5. Insert data that you want Glean to index. This can be vector or non-vector data.

    If your collection contains vector data, Glean does not differentiate embeddings from other data. In the same way that Glean processes your other datasources, Glean indexes your Astra DB data, including embeddings, as text data.

    In the integration script shown on this page, the collection is populated with sample data from a standard dataset (of philosophical quotes).

Prepare Glean

You must be a Glean administrator to complete some of the following steps. If you are not, ask your Glean administrator for assistance.

  1. Create a Glean custom datasource for your Astra DB Serverless database and give it a distinctive name. You can do so by creating a custom app, or use the Indexing API /adddatasource endpoint.

    In the integration script shown on this page, the datasource is created programmatically as part of the script. Comment out that step in the script part if you choose to accomplish this step otherwise.

  2. Create a Glean Indexing API token.

  3. Get your customer name from your Glean API endpoint. The endpoint follows the format https://GLEAN_CUSTOMER_NAME-be.glean.com/api/index/v1.

  4. Set the Glean environment variables for your integration script:

    .env
    GLEAN_CUSTOMER=GLEAN_CUSTOMER_NAME
    GLEAN_DATASOURCE_NAME=GLEAN_DATASOURCE_NAME
    GLEAN_API_TOKEN=GLEAN_INDEXING_API_TOKEN

    You can collate these lines to the end of the .env file you already prepared for the Astra DB credentials.

Create a Glean indexing script

The script shown on this page uses the Data API Python client. You may choose to use another client, or issue simple HTTP requests, for interacting with the Data API: in that case, you will need to modify the following script. See Astra DB Data API for more information.

  1. Ensure your Python version is 3.9 or higher and prepare a virtual environment:

    python3 -m venv my_virtual_env
  2. Activate the virtual environment and install the required packages:

    source my_virtual_env/bin/activate  # on Windows run: my_virtual_env\Scripts\activate
    
    pip install \
        "astrapy>=2.0,<3.0" \
        "datasets>=3.5,<4.0" \
        "https://app.glean.com/meta/indexing_api_client.zip"
  3. Import the required dependencies into the script:

    This and the following steps show the various parts of the indexing procedure. You will be able to run it successfully once you have assembled it into a complete Python script.

    astra-glean-import-job.py
    import os
    
    from astrapy import DataAPIClient
    from colorama import Fore, Style
    from datasets import load_dataset
    from dotenv import load_dotenv
    
    import glean_indexing_api_client as indexing_api
    from glean_indexing_api_client.api import datasources_api, documents_api
    from glean_indexing_api_client.model.custom_datasource_config import (
        CustomDatasourceConfig,
    )
    from glean_indexing_api_client.model.object_definition import ObjectDefinition
    from glean_indexing_api_client.model.index_document_request import IndexDocumentRequest
    from glean_indexing_api_client.model.document_definition import DocumentDefinition
    from glean_indexing_api_client.model.content_definition import ContentDefinition
    from glean_indexing_api_client.model.document_permissions_definition import (
        DocumentPermissionsDefinition,
    )
  4. Load the required environment variables:

    astra-glean-import-job.py
    # Load environment variables from .env
    load_dotenv()
    
    ASTRA_DB_APPLICATION_TOKEN = os.environ["ASTRA_DB_APPLICATION_TOKEN"]
    ASTRA_DB_API_ENDPOINT = os.environ["ASTRA_DB_API_ENDPOINT"]
    ASTRA_DB_COLLECTION_NAME = os.environ["ASTRA_DB_COLLECTION_NAME"]
    ASTRA_DB_KEYSPACE = os.getenv("ASTRA_DB_KEYSPACE")
    
    GLEAN_API_TOKEN = os.environ["GLEAN_API_TOKEN"]
    GLEAN_CUSTOMER = os.environ["GLEAN_CUSTOMER"]
    GLEAN_DATASOURCE_NAME = os.environ["GLEAN_DATASOURCE_NAME"]
    
    
    print(f"{Fore.GREEN}============================={Style.RESET_ALL}")
    print(f"{Fore.GREEN} ASTRADB - GLEAN INTEGRATION {Style.RESET_ALL}")
    print(f"{Fore.GREEN}============================={Style.RESET_ALL}\n")
  5. Initialize Astra DB client and database:

    astra-glean-import-job.py
    # Initialize Astra DB client
    client = DataAPIClient(callers=[("glean", "1.0")])
    database = client.get_database(
        ASTRA_DB_API_ENDPOINT,
        token=ASTRA_DB_APPLICATION_TOKEN,
        keyspace=ASTRA_DB_KEYSPACE,
    )
    print(
        f"{Fore.CYAN}[ OK ] - Credentials are OK, your database name is "
        f"{Style.RESET_ALL}{database.name()}{Fore.CYAN}."
    )
  6. Create Astra DB collection:

    astra-glean-import-job.py
    # Create collection
    source_collection = database.create_collection(ASTRA_DB_COLLECTION_NAME)
    print(
        f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}"
        f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}."
    )
  7. Load a sample dataset and populate the Astra DB collection:

    astra-glean-import-job.py
    # Load philosophers dataset
    print(f"{Fore.CYAN}[INFO] - Downloading data from Hugging Face 🤗.{Style.RESET_ALL}")
    philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]
    print(f"{Fore.CYAN}[ OK ] - Dataset loaded in memory.{Style.RESET_ALL}")
    print(f"{Fore.CYAN}[INFO] - Sample record: {Style.RESET_ALL}{philo_dataset[16]}")
    
    
    def load_to_astra_db(data_to_insert, collection):
        """Load all of the provided data into a collection."""
        def split_tags(t):
            return [tag for tag in (t or "").split(";") if tag]
    
        documents_to_insert = [
            {
                **item,
                **{"_id": index, "tags": split_tags(item["tags"])},
            }
            for index, item in enumerate(data_to_insert)
        ]
        collection.insert_many(documents_to_insert)
    
    # Insert documents into Astra DB
    philo_count = len(philo_dataset)
    print(
        f"{Fore.CYAN}[INFO] - Inserting {philo_count} documents into Astra DB..."
        f"{Style.RESET_ALL}"
    )
    load_to_astra_db(philo_dataset, source_collection)
    print(f"{Fore.CYAN}[ OK ] - Insertion finished.{Style.RESET_ALL}")
  8. Initialize Glean API client:

    astra-glean-import-job.py
    # Setup Glean API
    GLEAN_API_ENDPOINT = f"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1"
    print(
        f"{Fore.CYAN}[INFO] - Glean API setup, endpoint is:"
        f"{Style.RESET_ALL} {GLEAN_API_ENDPOINT}"
    )
    
    # Initialize Glean client
    configuration = indexing_api.Configuration(
        host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN
    )
    api_client = indexing_api.ApiClient(configuration)
    datasource_api = datasources_api.DatasourcesApi(api_client)
    print(f"{Fore.CYAN}[ OK ] - Glean client initialized{Style.RESET_ALL}")
  9. Create and register a new Glean datasource:

    astra-glean-import-job.py
    # Create and register datasource in Glean
    datasource_config = CustomDatasourceConfig(
        name=GLEAN_DATASOURCE_NAME,
        display_name="Astra DB Collection DataSource",
        datasource_category="PUBLISHED_CONTENT",
        url_regex=f"^{ASTRA_DB_API_ENDPOINT}",
        object_definitions=[
            ObjectDefinition(doc_category="PUBLISHED_CONTENT", name="AstraVectorEntry")
        ],
    )
    
    try:
        datasource_api.adddatasource_post(datasource_config)
        print(
            f"{Fore.GREEN}[ OK ] - DataSource has been created!"
            f"{Style.RESET_ALL}{Fore.GREEN}."
        )
    except indexing_api.ApiException as e:
        print(
            f"{Fore.RED}[ ERROR ] - Error creating datasource: "
            f"{e}{Style.RESET_ALL}{Fore.GREEN}."
        )
  10. Index Astra DB documents in Glean:

    astra-glean-import-job.py
    def index_astra_db_document_into_glean(astra_document):
        """Index one Astra DB document into Glean."""
        document_id = str(astra_document["_id"])
        title = f"{astra_document['author']} quote_{astra_document['_id']}"
        body_text = astra_document["quote"]
        datasource_name = GLEAN_DATASOURCE_NAME
        request = IndexDocumentRequest(
            document=DocumentDefinition(
                datasource=datasource_name,
                title=title,
                id=document_id,
                view_url=ASTRA_DB_API_ENDPOINT,
                body=ContentDefinition(mime_type="text/plain", text_content=body_text),
                permissions=DocumentPermissionsDefinition(allow_anonymous_access=True),
            )
        )
        documents_api_client = documents_api.DocumentsApi(api_client)
        try:
            documents_api_client.indexdocument_post(request)
        except indexing_api.ApiException as e:
            print(f"{Fore.RED}Error indexing document {document_id}: {e}{Style.RESET_ALL}")
    
    
    def index_documents_to_glean(collection):
        """Index all documents from an Astra DB collection to Glean."""
        total_docs = collection.count_documents({}, upper_bound=1000)
        print(
            f"{Fore.CYAN}[INFO] - Indexing {total_docs} "
            f"documents into Glean...{Style.RESET_ALL}"
        )
        for doc in collection.find():
            try:
                index_astra_db_document_into_glean(doc)
            except Exception as error:
                print(
                    f"{Fore.RED}Error indexing document "
                    f"{doc['_id']}: {error}{Style.RESET_ALL}"
                )
        print(f"{Fore.CYAN}[ OK ] - Indexing finished.{Style.RESET_ALL}")
    
    
    # Use the function to index documents into Glean
    index_documents_to_glean(source_collection)
    
    print(f"{Fore.GREEN}Import job completed successfully!{Style.RESET_ALL}")

Run and test the Glean integration

  1. Run the script:

    python3 astra-glean-import-job.py
  2. After the script finishes, try searching Glean for the content indexed from Astra DB.

    If the results reflect your Astra DB data, the integration has completed successfully.

    If there are no results or no evidence of the expected response, check the following:

    • The script ran without error.

    • The datasource category is accurate in the datasource configuration.

    • Indexing is complete, and the datasource is populated in Glean. For more information, see the Glean documentation on Debugging.

    • Your Astra DB collection contains only text data. Currently, Glean can index text data only. If your database contains other types of data, the content might not be fully supported in your Glean searches.

Next steps

Try these next steps to expand this integration:

  • Use a cron job to automatically run the script at regular intervals.

  • Extend the existing script or create additional scripts to pull from other databases and collections.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2025 DataStax | Privacy policy | Terms of use | Manage Privacy Choices

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com