Integrate Glean with Astra DB Serverless

query_builder 30 min

With the Glean platform integration, you can use information from Astra DB Serverless databases as a data source for your Glean searches.

To do this, you use the Data API to push data from Astra DB through the Glean Indexing API to a custom data source in Glean.

Use this integration if you need to query your Astra DB data alongside your other Glean data sources. This is recommended for use cases where your Astra DB data contains human-readable information that you want users to find in Glean search results. This integration is best for non-vector data, such as non-vector CSV data, that you cannot ingest through other Glean data sources.

For a complete script example, including the Prepare Glean and Prepare Astra DB steps, see the mini-demo-astradb-glean GitHub repository.

Prerequisites

This integration requires the following:

Glean administrator privileges or access to your Glean administrator
Familiarity with the Astra DB Data API and the Glean Indexing API

Prepare Astra DB

If you don’t already have one, create a Serverless (vector) database.
Generate an application token with the Database Administrator role, and then get your database’s API endpoint in the form of https://DATABASE_ID-REGION.apps.astra.datastax.com. For more information, see Generate an application token for a database.
Set the Astra DB environment variables for your integration script:
.env
```
APPLICATION_TOKEN=APPLICATION_TOKEN
API_ENDPOINT=API_ENDPOINT
ASTRA_DB_COLLECTION_NAME=COLLECTION_NAME
KEYSPACE_NAME="default_keyspace" # Optional
```
Replace the following:
- The TOKEN and API_ENDPOINT are the application token and API endpoint that you retrieved in the previous step.
- The COLLECTION_NAME is the name of the collection in your database where you will store the data you want to push to Glean. This can be the name of an existing collection or a collection that you will create in the next step.
- The keyspace can be left unspecified to use the Data API default value.
Create a collection in your database or use an existing collection. Make sure the collection’s name matches your ASTRA_DB_COLLECTION_NAME environment variable.

The Data API requires a collection in a Serverless (vector) database. Both vector and non-vector data can be saved in collections.

The integration script shown on this page creates a collection and populates the collection with sample data. Adjust the script to use an existing collection if you desire.
Example: Replace collection creation
In order to read from an existing collection, replace the following part of the script:
# Create collection source_collection = database.create_collection(ASTRA_DB_COLLECTION_NAME) print( f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}" f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}." )
with this:
# Get an existing collection source_collection = database.get_collection(ASTRA_DB_COLLECTION_NAME) print( f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}" f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}." )
Insert data that you want Glean to index. This can be vector or non-vector data.

If your collection contains vector data, Glean does not differentiate embeddings from other data. In the same way that Glean processes your other data sources, Glean indexes your Astra DB data, including embeddings, as text data.

In the integration script shown on this page, the collection is populated with sample data from a standard dataset (of philosophical quotes).

Prepare Glean

You must be a Glean administrator to complete some of the following steps. If you are not, ask your Glean administrator for assistance.

Create a Glean custom data source for your Astra DB Serverless database and give it a distinctive name. You can do so by creating a custom app, or use the Indexing API /adddatasource endpoint.

In the integration script shown on this page, the data source is created programmatically as part of the script. Comment out that step in the script part if you choose to accomplish this step otherwise.
Create a Glean Indexing API token.
Get your customer name from your Glean API endpoint. The endpoint follows the format https://GLEAN_CUSTOMER_NAME-be.glean.com/api/index/v1.
Set the Glean environment variables for your integration script:
.env
```
GLEAN_CUSTOMER=GLEAN_CUSTOMER_NAME
GLEAN_DATASOURCE_NAME=GLEAN_DATASOURCE_NAME
GLEAN_API_TOKEN=GLEAN_INDEXING_API_TOKEN
```
You can collate these lines to the end of the .env file you already prepared for the Astra DB credentials.

Create a Glean indexing script

The script shown on this page uses the Data API Python client. If you use a different Data API client or direct Data API HTTP requests, you must modify the example script. See Get started with the Data API for more information.

Ensure your Python version is 3.9 or higher and prepare a virtual environment:
```
python3 -m venv my_virtual_env
```

Activate the virtual environment and install the required packages:

source my_virtual_env/bin/activate # on Windows run: my_virtual_env\Scripts\activate

pip install \
    "astrapy>=2.0,<3.0" \
    "datasets>=3.5,<4.0" \
    "https://app.glean.com/meta/indexing_api_client.zip"

Import the required dependencies into the script:

This and the following steps show the various parts of the indexing procedure. You will be able to run it successfully once you have assembled it into a complete Python script.

astra-glean-import-job.py

import os

from astrapy import DataAPIClient
from colorama import Fore, Style
from datasets import load_dataset
from dotenv import load_dotenv

import glean_indexing_api_client as indexing_api
from glean_indexing_api_client.api import datasources_api, documents_api
from glean_indexing_api_client.model.custom_datasource_config import (
    CustomDatasourceConfig,
)
from glean_indexing_api_client.model.object_definition import ObjectDefinition
from glean_indexing_api_client.model.index_document_request import IndexDocumentRequest
from glean_indexing_api_client.model.document_definition import DocumentDefinition
from glean_indexing_api_client.model.content_definition import ContentDefinition
from glean_indexing_api_client.model.document_permissions_definition import (
    DocumentPermissionsDefinition,
)

Load the required environment variables:

astra-glean-import-job.py

# Load environment variables from .env
load_dotenv()

APPLICATION_TOKEN = os.environ["APPLICATION_TOKEN"]
API_ENDPOINT = os.environ["API_ENDPOINT"]
ASTRA_DB_COLLECTION_NAME = os.environ["ASTRA_DB_COLLECTION_NAME"]
KEYSPACE_NAME = os.getenv("KEYSPACE_NAME")

GLEAN_API_TOKEN = os.environ["GLEAN_API_TOKEN"]
GLEAN_CUSTOMER = os.environ["GLEAN_CUSTOMER"]
GLEAN_DATASOURCE_NAME = os.environ["GLEAN_DATASOURCE_NAME"]


print(f"{Fore.GREEN}============================={Style.RESET_ALL}")
print(f"{Fore.GREEN} ASTRADB - GLEAN INTEGRATION {Style.RESET_ALL}")
print(f"{Fore.GREEN}============================={Style.RESET_ALL}\n")

Initialize Astra DB client and database:

astra-glean-import-job.py

# Initialize Astra DB client
client = DataAPIClient(callers=[("glean", "1.0")])
database = client.get_database(
    API_ENDPOINT,
    token=APPLICATION_TOKEN,
    keyspace=KEYSPACE_NAME,
)
print(
    f"{Fore.CYAN}[ OK ] - Credentials are OK, your database name is "
    f"{Style.RESET_ALL}{database.name()}{Fore.CYAN}."
)

Create Astra DB collection:

astra-glean-import-job.py

# Create collection
source_collection = database.create_collection(ASTRA_DB_COLLECTION_NAME)
print(
    f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}"
    f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}."
)

Load a sample dataset and populate the Astra DB collection:

astra-glean-import-job.py

# Load philosophers dataset
print(f"{Fore.CYAN}[INFO] - Downloading data from Hugging Face 🤗.{Style.RESET_ALL}")
philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]
print(f"{Fore.CYAN}[ OK ] - Dataset loaded in memory.{Style.RESET_ALL}")
print(f"{Fore.CYAN}[INFO] - Sample record: {Style.RESET_ALL}{philo_dataset[16]}")


def load_to_astra_db(data_to_insert, collection):
    """Load all of the provided data into a collection."""
    def split_tags(t):
        return [tag for tag in (t or "").split(";") if tag]

    documents_to_insert = [
        {
            **item,
            **{"_id": index, "tags": split_tags(item["tags"])},
        }
        for index, item in enumerate(data_to_insert)
    ]
    collection.insert_many(documents_to_insert)

# Insert documents into Astra DB
philo_count = len(philo_dataset)
print(
    f"{Fore.CYAN}[INFO] - Inserting {philo_count} documents into Astra DB..."
    f"{Style.RESET_ALL}"
)
load_to_astra_db(philo_dataset, source_collection)
print(f"{Fore.CYAN}[ OK ] - Insertion finished.{Style.RESET_ALL}")

Initialize Glean API client:

astra-glean-import-job.py

# Setup Glean API
GLEAN_API_ENDPOINT = f"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1"
print(
    f"{Fore.CYAN}[INFO] - Glean API setup, endpoint is:"
    f"{Style.RESET_ALL} {GLEAN_API_ENDPOINT}"
)

# Initialize Glean client
configuration = indexing_api.Configuration(
    host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN
)
api_client = indexing_api.ApiClient(configuration)
datasource_api = datasources_api.DatasourcesApi(api_client)
print(f"{Fore.CYAN}[ OK ] - Glean client initialized{Style.RESET_ALL}")

Create and register a new Glean data source:

astra-glean-import-job.py

# Create and register data source in Glean
datasource_config = CustomDatasourceConfig(
    name=GLEAN_DATASOURCE_NAME,
    display_name="Astra DB Collection Data Source",
    datasource_category="PUBLISHED_CONTENT",
    url_regex=f"^{API_ENDPOINT}",
    object_definitions=[
        ObjectDefinition(doc_category="PUBLISHED_CONTENT", name="AstraVectorEntry")
    ],
)

try:
    datasource_api.adddatasource_post(datasource_config)
    print(
        f"{Fore.GREEN}[ OK ] - Data source has been created!"
        f"{Style.RESET_ALL}{Fore.GREEN}."
    )
except indexing_api.ApiException as e:
    print(
        f"{Fore.RED}[ ERROR ] - Error creating data source: "
        f"{e}{Style.RESET_ALL}{Fore.GREEN}."
    )

Index Astra DB documents in Glean:

astra-glean-import-job.py

def index_astra_db_document_into_glean(astra_document):
    """Index one Astra DB document into Glean."""
    document_id = str(astra_document["_id"])
    title = f"{astra_document['author']} quote_{astra_document['_id']}"
    body_text = astra_document["quote"]
    datasource_name = GLEAN_DATASOURCE_NAME
    request = IndexDocumentRequest(
        document=DocumentDefinition(
            datasource=datasource_name,
            title=title,
            id=document_id,
            view_url=API_ENDPOINT,
            body=ContentDefinition(mime_type="text/plain", text_content=body_text),
            permissions=DocumentPermissionsDefinition(allow_anonymous_access=True),
        )
    )
    documents_api_client = documents_api.DocumentsApi(api_client)
    try:
        documents_api_client.indexdocument_post(request)
    except indexing_api.ApiException as e:
        print(f"{Fore.RED}Error indexing document {document_id}: {e}{Style.RESET_ALL}")


def index_documents_to_glean(collection):
    """Index all documents from an Astra DB collection to Glean."""
    total_docs = collection.count_documents({}, upper_bound=1000)
    print(
        f"{Fore.CYAN}[INFO] - Indexing {total_docs} "
        f"documents into Glean...{Style.RESET_ALL}"
    )
    for doc in collection.find():
        try:
            index_astra_db_document_into_glean(doc)
        except Exception as error:
            print(
                f"{Fore.RED}Error indexing document "
                f"{doc['_id']}: {error}{Style.RESET_ALL}"
            )
    print(f"{Fore.CYAN}[ OK ] - Indexing finished.{Style.RESET_ALL}")


# Use the function to index documents into Glean
index_documents_to_glean(source_collection)

print(f"{Fore.GREEN}Import job completed successfully!{Style.RESET_ALL}")

Run and test the Glean integration

Run the script:
```
python3 astra-glean-import-job.py
```
After the script finishes, try searching Glean for the content indexed from Astra DB.

If the results reflect your Astra DB data, the integration has completed successfully.

If there are no results or no evidence of the expected response, check the following:
- The script ran without error.
- The data source category is accurate in the data source configuration.
- Indexing is complete, and the data source is populated in Glean. For more information, see the Glean documentation on Debugging.
- Your Astra DB collection contains only text data. Currently, Glean can index text data only. If your database contains other types of data, the content might not be fully supported in your Glean searches.

Next steps

Try these next steps to expand this integration:

Use a cron job to automatically run the script at regular intervals.
Extend the existing script or create additional scripts to pull from other databases and collections.

Integrate Glean with Astra DB Serverless

Prerequisites

Prepare Astra DB

Prepare Glean

Create a Glean indexing script

Run and test the Glean integration

Next steps

Was this helpful?

Give Feedback