Integrate Glean with Astra DB Serverless
With the Glean platform integration, you can use information from Astra DB Serverless databases as a datasource for your Glean searches.
To do this, you use the Data API to push data from Astra DB through the Glean Indexing API to a custom datasource in Glean.
Use this integration if you need to query your Astra DB data alongside your other Glean datasources. This is recommended for use cases where your Astra DB data contains human-readable information that you want users to find in Glean search results. This integration is best for non-vector data, such as non-vector CSV data, that you can’t ingest through other Glean datasources.
You can build and run a script locally or use this guide’s Colab notebook. For a complete script example, including the Prepare Glean and Prepare Astra DB steps, see the |
Prerequisites
This integration requires the following:
-
Glean administrator privileges or access to your Glean administrator.
-
An active Astra account.
-
Familiarity with the Astra DB Data API and the Glean Indexing API.
Prepare Astra DB
-
If you don’t already have one, create a Serverless (Vector) database.
-
Generate an application token with the Database Administrator role, and then get your database’s API endpoint in the form of
https://DATABASE_ID-REGION.apps.astra.datastax.com
. For more information, see Generate an application token for a database. -
Set Astra DB environment variables for your integration script:
.envASTRA_DB_APPLICATION_TOKEN=TOKEN ASTRA_DB_API_ENDPOINT=API_ENDPOINT ASTRA_DB_COLLECTION_NAME=COLLECTION_NAME
Replace the following:
-
The
TOKEN
andAPI_ENDPOINT
are the application token and API endpoint that you retrieved in the previous step. -
The
COLLECTION_NAME
is the name of the collection in your database where you will store the data you want to push to Glean. This can be the name of an existing collection or a collection that you will create in the next step.
-
-
Create a collection in your database or use an existing collection. Make sure the collection’s name matches your
ASTRA_DB_COLLECTION_NAME
environment variable.The Data API requires a collection in a Serverless (Vector) database, but you can store both vector and non-vector data in collections.
Example: Create collection script
You can create a collection in the Astra Portal or with the Data API. You can create a collection in your integration script or before running the script.
For example:
# Import dependencies import os from dotenv import load_dotenv # Load environment variables from .env load_dotenv() # Set collection name from environment variable astra_db_collection_name = os.getenv("ASTRA_DB_COLLECTION_NAME", "default_collection_name") # Create collection collection = database.create_collection(astra_db_collection_name) print(f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}'{collection.full_name}'{Fore.CYAN} is ready{Style.RESET_ALL}")
-
Load data that you want Glean to index. This can be vector or non-vector data.
If your collection contains vector data, Glean does not differentiate embeddings from other data. In the same way that Glean processes your other datasources, Glean indexes your Astra DB data, including embeddings, as text data.
Example: Load data script
You can load data in the Astra Portal or programmatically. You can load data in your integration script or before running the script.
For example, the following script downloads a
philosopher-quotes
dataset from Hugging Face, ensures that the collection is empty before loading the data, and then loads the data into the collection with a progress bar. Much of this script is optional. For more examples, options, and information about loading data, see Load your data.# Import dependencies import pandas as pd from astrapy import DataAPIClient from datasets import load_dataset # Download philosopher-quotes dataset print(f"{Fore.CYAN}[INFO] - Downloading Data from Hugging Face 🤗{Style.RESET_ALL}") philo_dataset = load_dataset("datastax/philosopher-quotes")["train"] print(f"{Fore.CYAN}[ OK ] - Dataset loaded in memory{Style.RESET_ALL}") print(f"{Fore.CYAN}[INFO] - Record: {Style.RESET_ALL}{philo_dataset[16]}") philo_dataframe = pd.DataFrame.from_dict(philo_dataset) # Load data to Astra DB with a progress bar def load_to_astra(df, collection): len_df = len(df) print(f"{Fore.CYAN}[INFO] - Starting data insertion to Astra DB...{Style.RESET_ALL}") for i in tqdm(range(len_df), desc="Inserting documents", colour="green"): try: collection.insert_one({ "_id": i, "author": df.loc[i, "author"], "quote": df.loc[i, "quote"], "tags": df.loc[i, "tags"].split(";") if pd.notna(df.loc[i, "tags"]) else [] }) except Exception as error: print(f"{Fore.RED}Error while inserting document {i}: {error}{Style.RESET_ALL}") # Flush the collection before inserting new data collection.delete_many({}) print(f"{Fore.CYAN}[ OK ] - Collection flushed{Style.RESET_ALL}") # Insert documents into Astra DB load_to_astra(philo_dataframe, plain_collection) print(f"{Fore.CYAN}[ OK ] - Finished loading data{Style.RESET_ALL}")
Prepare Glean
You must be a Glean administrator to complete the following steps. If you are not a Glean administrator, ask your Glean administrator to complete these steps for you.
-
In Glean, create a custom datasource for your Astra DB Serverless database. You can create a custom app or use the Indexing API
/adddatasource
endpoint.Example: Create datasource script
The following script sets up the Glean Indexing API and client, and then creates an Astra DB datasource in Glean. It requires certain environment variables and the Glean Indexing API client.
# Setup Glean API GLEAN_API_ENDPOINT = f"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1" print(f"{Fore.CYAN}[INFO] - Glean API setup, endpoint is:{Style.RESET_ALL} {GLEAN_API_ENDPOINT}") # Initialize Glean client configuration = indexing_api.Configuration(host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN) api_client = indexing_api.ApiClient(configuration) datasource_api = datasources_api.DatasourcesApi(api_client) print(f"{Fore.CYAN}[ OK ] - Glean client initialized{Style.RESET_ALL}") # Create and register datasource in Glean datasource_config = CustomDatasourceConfig( name=GLEAN_DATASOURCE_NAME, display_name='AstraDB Collection Datasource', datasource_category='PUBLISHED_CONTENT', url_regex='^https://ASTRA_DB_URL', # Replace with actual regex object_definitions=[ ObjectDefinition( doc_category='PUBLISHED_CONTENT', name='AstraVectorEntry' ) ] ) try: datasource_api.adddatasource_post(datasource_config) print(f"{Fore.GREEN}[ OK ] - DataSource has been created!{Style.RESET_ALL}") except indexing_api.ApiException as e: print(f"{Fore.RED}[ ERROR ] - Error creating datasource: {e}{Style.RESET_ALL}")
-
Create a Glean Indexing API token.
-
Get your customer name for your Glean API endpoint, such as
https://CUSTOMER-be.glean.com/api/index/v1
. -
Set Glean environment variables for your integration script:
.envGLEAN_CUSTOMER=GLEAN_CUSTOMER_NAME GLEAN_DATASOURCE_NAME=ASTRA_DB_DATASOURCE_NAME GLEAN_API_TOKEN=GLEAN_INDEXING_API_TOKEN
Create a Glean indexing script
-
Install Python, and then prepare a virtual environment.
-
Install a Data API client or a utility to make API calls, such as curl.
This guide uses the Data API Python client, astrapy. If you choose to use HTTP or a different client, you must modify the Data API commands to your chosen tool.
-
Import dependencies:
astra-glean-import-job.pyimport os from dotenv import load_dotenv from getpass import getpass import pandas as pd from astrapy import DataAPIClient from tqdm import tqdm # Optional: Progress bar with tqdm import json import glean_indexing_api_client as indexing_api from glean_indexing_api_client.api import datasources_api, documents_api from glean_indexing_api_client.model.custom_datasource_config import CustomDatasourceConfig from glean_indexing_api_client.model.object_definition import ObjectDefinition from glean_indexing_api_client.model.index_document_request import IndexDocumentRequest from glean_indexing_api_client.model.document_definition import DocumentDefinition from glean_indexing_api_client.model.content_definition import ContentDefinition from glean_indexing_api_client.model.document_permissions_definition import DocumentPermissionsDefinition from datasets import load_dataset from colorama import Fore, Style # Optional: Output color coding
-
Load environment variables from
.env
:astra-glean-import-job.py# Load environment variables from .env load_dotenv() ASTRA_DB_APPLICATION_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN") ASTRA_DB_API_ENDPOINT = os.getenv("ASTRA_DB_API_ENDPOINT") ASTRA_DB_COLLECTION_NAME = os.getenv("ASTRA_DB_COLLECTION_NAME") GLEAN_API_TOKEN = os.getenv("GLEAN_API_TOKEN") GLEAN_CUSTOMER = os.getenv("GLEAN_CUSTOMER") GLEAN_DATASOURCE_NAME = os.getenv("GLEAN_DATASOURCE_NAME")
-
Initialize the Data API client:
astra-glean-import-job.py# Initialize astrapy client] client = DataAPIClient(ASTRA_DB_APPLICATION_TOKEN, callers=[("glean", "1.0")]) database = client.get_database(ASTRA_DB_API_ENDPOINT) print(f"{Fore.CYAN}[ OK ] - Credentials are OK, your database name is {Style.RESET_ALL}{database.info().name}")
callers
is optional. -
Initialize the Glean Indexing API client:
astra-glean-import-job.py# Setup Glean API GLEAN_API_ENDPOINT = f"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1" print(f"{Fore.CYAN}[INFO] - Glean API setup, endpoint is:{Style.RESET_ALL} {GLEAN_API_ENDPOINT}") # Initialize Glean client configuration = indexing_api.Configuration(host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN) api_client = indexing_api.ApiClient(configuration) datasource_api = datasources_api.DatasourcesApi(api_client) print(f"{Fore.CYAN}[ OK ] - Glean client initialized{Style.RESET_ALL}")
-
Index documents into Glean:
astra-glean-import-job.py# Function to index Astra DB documents into Glean def index_astra_document_into_glean(astra_document): document_id = str(astra_document['id']) title = astra_document['author'] + ' quote' + str(astra_document['_id']) body_text = astra_document['quote'] datasource_name = GLEAN_DATASOURCE_NAME request = IndexDocumentRequest( document=DocumentDefinition( datasource=datasource_name, title=title, id=document_id, view_url="https://ASTRA_DB_URL", body=ContentDefinition(mime_type="text/plain", text_content=body_text), permissions=DocumentPermissionsDefinition(allow_anonymous_access=True), ) ) documents_api_client = documents_api.DocumentsApi(api_client) try: documents_api_client.indexdocument_post(request) except indexing_api.ApiException as e: print(f"{Fore.RED}Error indexing document {document_id}: {e}{Style.RESET_ALL}") # Index documents from Astra DB to Glean def index_documents_to_glean(collection): total_docs = collection.estimated_document_count() with tqdm(total=total_docs, desc="Indexing documents to Glean", unit="doc", colour="blue") as pbar: for doc in collection.find(): try: index_astra_document_into_glean(doc) pbar.set_postfix({"Status": f"Indexed {doc['_id']}"}) except Exception as error: pbar.set_postfix({"Status": f"Error with {doc['_id']}"}) print(f"{Fore.RED}Error indexing document {doc['_id']}: {error}{Style.RESET_ALL}") pbar.update(1) # Update progress bar after each document # Use the function to index documents into Glean index_documents_to_glean(ASTRA_DB_COLLECTION_NAME) print(f"{Fore.GREEN}Batch Ended Successfully!{Style.RESET_ALL}")
Run and test the Glean integration
-
Run the script:
python3 astra-glean-import-job.py
-
After indexing completes, search Glean for the content indexed from Astra DB.
If the results reflect your Astra DB data, the integration is successful.
If there are no results or no evidence of correct matches, check the following:
-
The script ran without error.
-
The datasource category is accurate in the datasource configuration.
-
Indexing is complete, and the datasource is populated in Glean. For more information, see the Glean documentation on Debugging/Troubleshooting using the Indexing API.
-
Your Astra DB collection contains only text data. Currently, Glean can index text data only. If your database contains other types of data, the content might not be fully supported in your Glean searches.
-
Next steps
Try these next steps to expand this integration:
-
Use a cron job to automatically run the script at regular intervals.
-
Extend the existing script or create additional scripts to pull from other databases and collections.