Integrate Glean with Astra DB Serverless

query_builder 30 min

With the Glean platform integration, you can use information from Astra DB Serverless databases as a datasource for your Glean searches.

To do this, you use the Data API to push data from Astra DB through the Glean Indexing API to a custom datasource in Glean.

Use this integration if you need to query your Astra DB data alongside your other Glean datasources. This is recommended for use cases where your Astra DB data contains human-readable information that you want users to find in Glean search results. This integration is best for non-vector data, such as non-vector CSV data, that you can’t ingest through other Glean datasources.

You can build and run a script locally or use this guide’s Colab notebook.

For a complete script example, including the Prepare Glean and Prepare Astra DB steps, see the mini-demo-astradb-glean GitHub repository.

Prerequisites

This integration requires the following:

Prepare Astra DB

  1. If you don’t already have one, create a Serverless (Vector) database.

  2. Generate an application token with the Database Administrator role, and then get your database’s API endpoint in the form of https://DATABASE_ID-REGION.apps.astra.datastax.com. For more information, see Generate an application token for a database.

  3. Set Astra DB environment variables for your integration script:

    .env
    ASTRA_DB_APPLICATION_TOKEN=TOKEN
    ASTRA_DB_API_ENDPOINT=API_ENDPOINT
    ASTRA_DB_COLLECTION_NAME=COLLECTION_NAME

    Replace the following:

    • The TOKEN and API_ENDPOINT are the application token and API endpoint that you retrieved in the previous step.

    • The COLLECTION_NAME is the name of the collection in your database where you will store the data you want to push to Glean. This can be the name of an existing collection or a collection that you will create in the next step.

  4. Create a collection in your database or use an existing collection. Make sure the collection’s name matches your ASTRA_DB_COLLECTION_NAME environment variable.

    The Data API requires a collection in a Serverless (Vector) database, but you can store both vector and non-vector data in collections.

    Example: Create collection script

    You can create a collection in the Astra Portal or with the Data API. You can create a collection in your integration script or before running the script.

    For example:

    # Import dependencies
    import os
    from dotenv import load_dotenv
    
    # Load environment variables from .env
    load_dotenv()
    
    # Set collection name from environment variable
    astra_db_collection_name = os.getenv("ASTRA_DB_COLLECTION_NAME", "default_collection_name")
    
    # Create collection
    collection = database.create_collection(astra_db_collection_name, check_exists=False)
    print(f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}'{collection.full_name}'{Fore.CYAN} is ready{Style.RESET_ALL}")
  5. Load data that you want Glean to index. This can be vector or non-vector data.

    If your collection contains vector data, Glean does not differentiate embeddings from other data. In the same way that Glean processes your other datasources, Glean indexes your Astra DB data, including embeddings, as text data.

    Example: Load data script

    You can load data in the Astra Portal or programmatically. You can load data in your integration script or before running the script.

    For example, the following script downloads a philosopher-quotes dataset from Hugging Face, ensures that the collection is empty before loading the data, and then loads the data into the collection with a progress bar. Much of this script is optional. For more examples, options, and information about loading data, see Load your data.

    # Import dependencies
    import pandas as pd
    from astrapy import DataAPIClient
    from datasets import load_dataset
    
    # Download philosopher-quotes dataset
    print(f"{Fore.CYAN}[INFO] - Downloading Data from Hugging Face 🤗{Style.RESET_ALL}")
    philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]
    print(f"{Fore.CYAN}[ OK ] - Dataset loaded in memory{Style.RESET_ALL}")
    print(f"{Fore.CYAN}[INFO] - Record: {Style.RESET_ALL}{philo_dataset[16]}")
    philo_dataframe = pd.DataFrame.from_dict(philo_dataset)
    
    # Load data to Astra DB with a progress bar
    def load_to_astra(df, collection):
        len_df = len(df)
        print(f"{Fore.CYAN}[INFO] - Starting data insertion to Astra DB...{Style.RESET_ALL}")
    
        for i in tqdm(range(len_df), desc="Inserting documents", colour="green"):
            try:
                collection.insert_one({
                    "_id": i,
                    "author": df.loc[i, "author"],
                    "quote": df.loc[i, "quote"],
                    "tags": df.loc[i, "tags"].split(";") if pd.notna(df.loc[i, "tags"]) else []
                })
            except Exception as error:
                print(f"{Fore.RED}Error while inserting document {i}: {error}{Style.RESET_ALL}")
    
    # Flush the collection before inserting new data
    collection.delete_many({})
    print(f"{Fore.CYAN}[ OK ] - Collection flushed{Style.RESET_ALL}")
    
    # Insert documents into Astra DB
    load_to_astra(philo_dataframe, plain_collection)
    print(f"{Fore.CYAN}[ OK ] - Finished loading data{Style.RESET_ALL}")

Prepare Glean

You must be a Glean administrator to complete the following steps. If you are not a Glean administrator, ask your Glean administrator to complete these steps for you.

  1. In Glean, create a custom datasource for your Astra DB Serverless database. You can create a custom app or use the Indexing API /adddatasource endpoint.

    Example: Create datasource script

    The following script sets up the Glean Indexing API and client, and then creates an Astra DB datasource in Glean. It requires certain environment variables and the Glean Indexing API client.

    # Setup Glean API
    GLEAN_API_ENDPOINT = f"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1"
    print(f"{Fore.CYAN}[INFO] - Glean API setup, endpoint is:{Style.RESET_ALL} {GLEAN_API_ENDPOINT}")
    
    # Initialize Glean client
    configuration = indexing_api.Configuration(host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN)
    api_client = indexing_api.ApiClient(configuration)
    datasource_api = datasources_api.DatasourcesApi(api_client)
    print(f"{Fore.CYAN}[ OK ] - Glean client initialized{Style.RESET_ALL}")
    
    # Create and register datasource in Glean
    datasource_config = CustomDatasourceConfig(
        name=GLEAN_DATASOURCE_NAME,
        display_name='AstraDB Collection Datasource',
        datasource_category='PUBLISHED_CONTENT',
        url_regex='^https://ASTRA_DB_URL',  # Replace with actual regex
        object_definitions=[
            ObjectDefinition(
                doc_category='PUBLISHED_CONTENT',
                name='AstraVectorEntry'
            )
        ]
    )
    
    try:
        datasource_api.adddatasource_post(datasource_config)
        print(f"{Fore.GREEN}[ OK ] - DataSource has been created!{Style.RESET_ALL}")
    except indexing_api.ApiException as e:
        print(f"{Fore.RED}[ ERROR ] - Error creating datasource: {e}{Style.RESET_ALL}")
  2. Create a Glean Indexing API token.

  3. Get your customer name for your Glean API endpoint, such as https://CUSTOMER-be.glean.com/api/index/v1.

  4. Set Glean environment variables for your integration script:

    .env
    GLEAN_CUSTOMER=GLEAN_CUSTOMER_NAME
    GLEAN_DATASOURCE_NAME=ASTRA_DB_DATASOURCE_NAME
    GLEAN_API_TOKEN=GLEAN_INDEXING_API_TOKEN

Create a Glean indexing script

  1. Install Python, and then prepare a virtual environment.

  2. Install a Data API client or a utility to make API calls, such as curl.

    This guide uses the Data API Python client, astrapy. If you choose to use HTTP or a different client, you must modify the Data API commands to your chosen tool.

  3. Import dependencies:

    astra-glean-import-job.py
    import os
    from dotenv import load_dotenv
    from getpass import getpass
    import pandas as pd
    from astrapy import DataAPIClient
    from tqdm import tqdm  # Optional: Progress bar with tqdm
    import json
    import glean_indexing_api_client as indexing_api
    from glean_indexing_api_client.api import datasources_api, documents_api
    from glean_indexing_api_client.model.custom_datasource_config import CustomDatasourceConfig
    from glean_indexing_api_client.model.object_definition import ObjectDefinition
    from glean_indexing_api_client.model.index_document_request import IndexDocumentRequest
    from glean_indexing_api_client.model.document_definition import DocumentDefinition
    from glean_indexing_api_client.model.content_definition import ContentDefinition
    from glean_indexing_api_client.model.document_permissions_definition import DocumentPermissionsDefinition
    from datasets import load_dataset
    from colorama import Fore, Style  # Optional: Output color coding
  4. Load environment variables from .env:

    astra-glean-import-job.py
    # Load environment variables from .env
    load_dotenv()
    
    ASTRA_DB_APPLICATION_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
    ASTRA_DB_API_ENDPOINT = os.getenv("ASTRA_DB_API_ENDPOINT")
    ASTRA_DB_COLLECTION_NAME = os.getenv("ASTRA_DB_COLLECTION_NAME")
    GLEAN_API_TOKEN = os.getenv("GLEAN_API_TOKEN")
    GLEAN_CUSTOMER = os.getenv("GLEAN_CUSTOMER")
    GLEAN_DATASOURCE_NAME = os.getenv("GLEAN_DATASOURCE_NAME")
  5. Initialize the Data API client:

    astra-glean-import-job.py
    # Initialize astrapy client
    client = DataAPIClient(ASTRA_DB_APPLICATION_TOKEN, caller_name="glean", caller_version="1.0")
    database = client.get_database(ASTRA_DB_API_ENDPOINT)
    print(f"{Fore.CYAN}[ OK ] - Credentials are OK, your database name is {Style.RESET_ALL}{database.info().name}")

    The caller_name and caller_version parameters are optional. These parameters help you track the source of the API calls in your logs.

  6. Initialize the Glean Indexing API client:

    astra-glean-import-job.py
    # Setup Glean API
    GLEAN_API_ENDPOINT = f"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1"
    print(f"{Fore.CYAN}[INFO] - Glean API setup, endpoint is:{Style.RESET_ALL} {GLEAN_API_ENDPOINT}")
    
    # Initialize Glean client
    configuration = indexing_api.Configuration(host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN)
    api_client = indexing_api.ApiClient(configuration)
    datasource_api = datasources_api.DatasourcesApi(api_client)
    print(f"{Fore.CYAN}[ OK ] - Glean client initialized{Style.RESET_ALL}")
  7. Index documents into Glean:

    astra-glean-import-job.py
    # Function to index Astra DB documents into Glean
    def index_astra_document_into_glean(astra_document):
        document_id = str(astra_document['id'])
        title = astra_document['author'] + ' quote' + str(astra_document['_id'])
        body_text = astra_document['quote']
        datasource_name = GLEAN_DATASOURCE_NAME
        request = IndexDocumentRequest(
            document=DocumentDefinition(
                datasource=datasource_name,
                title=title,
                id=document_id,
                view_url="https://ASTRA_DB_URL",
                body=ContentDefinition(mime_type="text/plain", text_content=body_text),
                permissions=DocumentPermissionsDefinition(allow_anonymous_access=True),
            )
        )
        documents_api_client = documents_api.DocumentsApi(api_client)
        try:
            documents_api_client.indexdocument_post(request)
        except indexing_api.ApiException as e:
            print(f"{Fore.RED}Error indexing document {document_id}: {e}{Style.RESET_ALL}")
    
    # Index documents from Astra DB to Glean
    def index_documents_to_glean(collection):
        total_docs = collection.estimated_document_count()
        with tqdm(total=total_docs, desc="Indexing documents to Glean", unit="doc", colour="blue") as pbar:
            for doc in collection.find():
                try:
                    index_astra_document_into_glean(doc)
                    pbar.set_postfix({"Status": f"Indexed {doc['_id']}"})
                except Exception as error:
                    pbar.set_postfix({"Status": f"Error with {doc['_id']}"})
                    print(f"{Fore.RED}Error indexing document {doc['_id']}: {error}{Style.RESET_ALL}")
                pbar.update(1)  # Update progress bar after each document
    
    # Use the function to index documents into Glean
    index_documents_to_glean(ASTRA_DB_COLLECTION_NAME)
    
    print(f"{Fore.GREEN}Batch Ended Successfully!{Style.RESET_ALL}")

Run and test the Glean integration

  1. Run the script:

    python3 astra-glean-import-job.py
  2. After indexing completes, search Glean for the content indexed from Astra DB.

    If the results reflect your Astra DB data, the integration is successful.

    If there are no results or no evidence of correct matches, check the following:

    • The script ran without error.

    • The datasource category is accurate in the datasource configuration.

    • Indexing is complete, and the datasource is populated in Glean. For more information, see the Glean documentation on Debugging/Troubleshooting using the Indexing API.

    • Your Astra DB collection contains only text data. Currently, Glean can index text data only. If your database contains other types of data, the content might not be fully supported in your Glean searches.

Next steps

Try these next steps to expand this integration:

  • Use a cron job to automatically run the script at regular intervals.

  • Extend the existing script or create additional scripts to pull from other databases and collections.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com