Integrate Glean with Astra DB Serverless
With the Glean platform integration, you can use information from Astra DB Serverless databases as a datasource for your Glean searches.
To do this, you use the Data API to push data from Astra DB through the Glean Indexing API to a custom datasource in Glean.
Use this integration if you need to query your Astra DB data alongside your other Glean datasources. This is recommended for use cases where your Astra DB data contains human-readable information that you want users to find in Glean search results. This integration is best for non-vector data, such as non-vector CSV data, that you can’t ingest through other Glean datasources.
You can build and run a script locally or use this guide’s Colab notebook. For a complete script example, including the Prepare Glean and Prepare Astra DB steps, see the |
Prerequisites
This integration requires the following:
-
Glean administrator privileges or access to your Glean administrator.
-
An active Astra account.
-
Familiarity with the Astra DB Data API and the Glean Indexing API.
Prepare Astra DB
-
If you don’t already have one, create a Serverless (Vector) database.
-
Generate an application token with the Database Administrator role, and then get your database’s API endpoint in the form of
https://DATABASE_ID-REGION.apps.astra.datastax.com
. For more information, see Generate an application token for a database. -
Set the Astra DB environment variables for your integration script:
.envASTRA_DB_APPLICATION_TOKEN=TOKEN ASTRA_DB_API_ENDPOINT=API_ENDPOINT ASTRA_DB_COLLECTION_NAME=COLLECTION_NAME ASTRA_DB_KEYSPACE="default_keyspace" # Optional
Replace the following:
-
The
TOKEN
andAPI_ENDPOINT
are the application token and API endpoint that you retrieved in the previous step. -
The
COLLECTION_NAME
is the name of the collection in your database where you will store the data you want to push to Glean. This can be the name of an existing collection or a collection that you will create in the next step. -
The keyspace can be left unspecified to use the Data API default value.
-
-
Create a collection in your database or use an existing collection. Make sure the collection’s name matches your
ASTRA_DB_COLLECTION_NAME
environment variable.The Data API requires a collection in a Serverless (Vector) database. Both vector and non-vector data can be saved in collections.
The integration script shown on this page creates a collection and populates the collection with sample data. Adjust the script to use an existing collection if you desire.
Example: Replace collection creation
In order to read from an existing collection, replace the following part of the script:
# Create collection source_collection = database.create_collection(ASTRA_DB_COLLECTION_NAME) print( f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}" f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}." )
with this:
# Get an existing collection source_collection = database.get_collection(ASTRA_DB_COLLECTION_NAME) print( f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}" f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}." )
-
Insert data that you want Glean to index. This can be vector or non-vector data.
If your collection contains vector data, Glean does not differentiate embeddings from other data. In the same way that Glean processes your other datasources, Glean indexes your Astra DB data, including embeddings, as text data.
In the integration script shown on this page, the collection is populated with sample data from a standard dataset (of philosophical quotes).
Prepare Glean
You must be a Glean administrator to complete some of the following steps. If you are not, ask your Glean administrator for assistance.
-
Create a Glean custom datasource for your Astra DB Serverless database and give it a distinctive name. You can do so by creating a custom app, or use the Indexing API
/adddatasource
endpoint.In the integration script shown on this page, the datasource is created programmatically as part of the script. Comment out that step in the script part if you choose to accomplish this step otherwise.
-
Create a Glean Indexing API token.
-
Get your customer name from your Glean API endpoint. The endpoint follows the format
https://GLEAN_CUSTOMER_NAME-be.glean.com/api/index/v1
. -
Set the Glean environment variables for your integration script:
.envGLEAN_CUSTOMER=GLEAN_CUSTOMER_NAME GLEAN_DATASOURCE_NAME=GLEAN_DATASOURCE_NAME GLEAN_API_TOKEN=GLEAN_INDEXING_API_TOKEN
You can collate these lines to the end of the
.env
file you already prepared for the Astra DB credentials.
Create a Glean indexing script
The script shown on this page uses the Data API Python client. You may choose to use another client, or issue simple HTTP requests, for interacting with the Data API: in that case, you will need to modify the following script. See Astra DB Data API for more information.
-
Ensure your Python version is 3.9 or higher and prepare a virtual environment:
python3 -m venv my_virtual_env
-
Activate the virtual environment and install the required packages:
source my_virtual_env/bin/activate # on Windows run: my_virtual_env\Scripts\activate pip install \ "astrapy>=2.0,<3.0" \ "datasets>=3.5,<4.0" \ "https://app.glean.com/meta/indexing_api_client.zip"
-
Import the required dependencies into the script:
This and the following steps show the various parts of the indexing procedure. You will be able to run it successfully once you have assembled it into a complete Python script.
astra-glean-import-job.pyimport os from astrapy import DataAPIClient from colorama import Fore, Style from datasets import load_dataset from dotenv import load_dotenv import glean_indexing_api_client as indexing_api from glean_indexing_api_client.api import datasources_api, documents_api from glean_indexing_api_client.model.custom_datasource_config import ( CustomDatasourceConfig, ) from glean_indexing_api_client.model.object_definition import ObjectDefinition from glean_indexing_api_client.model.index_document_request import IndexDocumentRequest from glean_indexing_api_client.model.document_definition import DocumentDefinition from glean_indexing_api_client.model.content_definition import ContentDefinition from glean_indexing_api_client.model.document_permissions_definition import ( DocumentPermissionsDefinition, )
-
Load the required environment variables:
astra-glean-import-job.py# Load environment variables from .env load_dotenv() ASTRA_DB_APPLICATION_TOKEN = os.environ["ASTRA_DB_APPLICATION_TOKEN"] ASTRA_DB_API_ENDPOINT = os.environ["ASTRA_DB_API_ENDPOINT"] ASTRA_DB_COLLECTION_NAME = os.environ["ASTRA_DB_COLLECTION_NAME"] ASTRA_DB_KEYSPACE = os.getenv("ASTRA_DB_KEYSPACE") GLEAN_API_TOKEN = os.environ["GLEAN_API_TOKEN"] GLEAN_CUSTOMER = os.environ["GLEAN_CUSTOMER"] GLEAN_DATASOURCE_NAME = os.environ["GLEAN_DATASOURCE_NAME"] print(f"{Fore.GREEN}============================={Style.RESET_ALL}") print(f"{Fore.GREEN} ASTRADB - GLEAN INTEGRATION {Style.RESET_ALL}") print(f"{Fore.GREEN}============================={Style.RESET_ALL}\n")
-
Initialize Astra DB client and database:
astra-glean-import-job.py# Initialize Astra DB client client = DataAPIClient(callers=[("glean", "1.0")]) database = client.get_database( ASTRA_DB_API_ENDPOINT, token=ASTRA_DB_APPLICATION_TOKEN, keyspace=ASTRA_DB_KEYSPACE, ) print( f"{Fore.CYAN}[ OK ] - Credentials are OK, your database name is " f"{Style.RESET_ALL}{database.name()}{Fore.CYAN}." )
-
Create Astra DB collection:
astra-glean-import-job.py# Create collection source_collection = database.create_collection(ASTRA_DB_COLLECTION_NAME) print( f"{Fore.CYAN}[ OK ] - Collection {Style.RESET_ALL}{source_collection.name}" f"{Fore.CYAN} is ready{Style.RESET_ALL}{Fore.CYAN}." )
-
Load a sample dataset and populate the Astra DB collection:
astra-glean-import-job.py# Load philosophers dataset print(f"{Fore.CYAN}[INFO] - Downloading data from Hugging Face 🤗.{Style.RESET_ALL}") philo_dataset = load_dataset("datastax/philosopher-quotes")["train"] print(f"{Fore.CYAN}[ OK ] - Dataset loaded in memory.{Style.RESET_ALL}") print(f"{Fore.CYAN}[INFO] - Sample record: {Style.RESET_ALL}{philo_dataset[16]}") def load_to_astra_db(data_to_insert, collection): """Load all of the provided data into a collection.""" def split_tags(t): return [tag for tag in (t or "").split(";") if tag] documents_to_insert = [ { **item, **{"_id": index, "tags": split_tags(item["tags"])}, } for index, item in enumerate(data_to_insert) ] collection.insert_many(documents_to_insert) # Insert documents into Astra DB philo_count = len(philo_dataset) print( f"{Fore.CYAN}[INFO] - Inserting {philo_count} documents into Astra DB..." f"{Style.RESET_ALL}" ) load_to_astra_db(philo_dataset, source_collection) print(f"{Fore.CYAN}[ OK ] - Insertion finished.{Style.RESET_ALL}")
-
Initialize Glean API client:
astra-glean-import-job.py# Setup Glean API GLEAN_API_ENDPOINT = f"https://{GLEAN_CUSTOMER}-be.glean.com/api/index/v1" print( f"{Fore.CYAN}[INFO] - Glean API setup, endpoint is:" f"{Style.RESET_ALL} {GLEAN_API_ENDPOINT}" ) # Initialize Glean client configuration = indexing_api.Configuration( host=GLEAN_API_ENDPOINT, access_token=GLEAN_API_TOKEN ) api_client = indexing_api.ApiClient(configuration) datasource_api = datasources_api.DatasourcesApi(api_client) print(f"{Fore.CYAN}[ OK ] - Glean client initialized{Style.RESET_ALL}")
-
Create and register a new Glean datasource:
astra-glean-import-job.py# Create and register datasource in Glean datasource_config = CustomDatasourceConfig( name=GLEAN_DATASOURCE_NAME, display_name="Astra DB Collection DataSource", datasource_category="PUBLISHED_CONTENT", url_regex=f"^{ASTRA_DB_API_ENDPOINT}", object_definitions=[ ObjectDefinition(doc_category="PUBLISHED_CONTENT", name="AstraVectorEntry") ], ) try: datasource_api.adddatasource_post(datasource_config) print( f"{Fore.GREEN}[ OK ] - DataSource has been created!" f"{Style.RESET_ALL}{Fore.GREEN}." ) except indexing_api.ApiException as e: print( f"{Fore.RED}[ ERROR ] - Error creating datasource: " f"{e}{Style.RESET_ALL}{Fore.GREEN}." )
-
Index Astra DB documents in Glean:
astra-glean-import-job.pydef index_astra_db_document_into_glean(astra_document): """Index one Astra DB document into Glean.""" document_id = str(astra_document["_id"]) title = f"{astra_document['author']} quote_{astra_document['_id']}" body_text = astra_document["quote"] datasource_name = GLEAN_DATASOURCE_NAME request = IndexDocumentRequest( document=DocumentDefinition( datasource=datasource_name, title=title, id=document_id, view_url=ASTRA_DB_API_ENDPOINT, body=ContentDefinition(mime_type="text/plain", text_content=body_text), permissions=DocumentPermissionsDefinition(allow_anonymous_access=True), ) ) documents_api_client = documents_api.DocumentsApi(api_client) try: documents_api_client.indexdocument_post(request) except indexing_api.ApiException as e: print(f"{Fore.RED}Error indexing document {document_id}: {e}{Style.RESET_ALL}") def index_documents_to_glean(collection): """Index all documents from an Astra DB collection to Glean.""" total_docs = collection.count_documents({}, upper_bound=1000) print( f"{Fore.CYAN}[INFO] - Indexing {total_docs} " f"documents into Glean...{Style.RESET_ALL}" ) for doc in collection.find(): try: index_astra_db_document_into_glean(doc) except Exception as error: print( f"{Fore.RED}Error indexing document " f"{doc['_id']}: {error}{Style.RESET_ALL}" ) print(f"{Fore.CYAN}[ OK ] - Indexing finished.{Style.RESET_ALL}") # Use the function to index documents into Glean index_documents_to_glean(source_collection) print(f"{Fore.GREEN}Import job completed successfully!{Style.RESET_ALL}")
Run and test the Glean integration
-
Run the script:
python3 astra-glean-import-job.py
-
After the script finishes, try searching Glean for the content indexed from Astra DB.
If the results reflect your Astra DB data, the integration has completed successfully.
If there are no results or no evidence of the expected response, check the following:
-
The script ran without error.
-
The datasource category is accurate in the datasource configuration.
-
Indexing is complete, and the datasource is populated in Glean. For more information, see the Glean documentation on Debugging.
-
Your Astra DB collection contains only text data. Currently, Glean can index text data only. If your database contains other types of data, the content might not be fully supported in your Glean searches.
-
Next steps
Try these next steps to expand this integration:
-
Use a cron job to automatically run the script at regular intervals.
-
Extend the existing script or create additional scripts to pull from other databases and collections.