Integrate AWS Glue with Astra DB Serverless
AWS Glue is a serverless data integration service you can use to create jobs that run extract, transform, and load (ETL) scripts.
This guide explains how to connect your Astra DB Serverless database to AWS Glue. After you import your non-vector data from Astra DB into AWS Glue, you can push the data to Redshift, SageMaker, and other AWS services.
Prerequisites
You need the following:
-
An active Astra account.
-
An active Serverless (Non-Vector) database.
Alternatively, you can use a Serverless (Vector) database with a collection that isn’t vector-enabled.
-
An application token with the Organization Administrator role.
-
An AWS account with permissions for Glue, S3, IAM, and the AWS Secrets Manager.
Load data
Load non-vector data in Astra DB to import to AWS Glue.
-
Download the sample CSV file.
-
In the Astra Portal navigation menu, select your database.
-
-
For Partition key, select country_name.
-
Make a note of the Target keyspace. You need it for the connection to AWS Glue.
-
Create an IAM role
-
In the AWS IAM console navigation pane, select Roles.
-
Click Create Role.
-
For Trusted entity type, select AWS service.
-
For Use case, select Glue.
-
Click Next, and then add the following permissions to the role:
-
AmazonS3FullAccess
-
AWSGlueServiceRole
-
AWSGlueConsoleFullAccess
-
SecretsManagerReadWrite
-
AWSGlueJobExecutor
-
-
Click Next, and then enter a role name, such as
AstraGlueRole
. -
Click Create Role.
Configure the JDBC driver in S3
The Astra JDBC driver supports the connection between AWS Glue and your Astra DB database.
-
Download the latest
astra-jdbc-driver
jar file. -
In the Amazon S3 console navigation pane, click Create Bucket.
-
Make sure the AWS Region is the same as your Astra DB database.
-
Enter a unique bucket name.
-
In the Object Ownership section, select a permission model.
DataStax recommends using Access Control Lists (ACLs).
For information about S3 bucket settings, see Creating a bucket.
-
Click Create bucket.
-
On the list of buckets, click your new bucket to open it.
-
Click Upload, and then upload the driver jar file.
-
On your bucket’s Objects page, click the driver file, and then copy the S3 URI. You will use this when you create the Glue connector.
Create secrets
-
In the AWS Secrets Manager console, click Store a new secret.
-
For Secret type, select Other type of secret.
-
Enter the following key/value pairs:
-
user
:token
(literal, all-lowercase stringtoken
) -
password
: Your application token
-
-
Click Next.
-
Enter a name for the secret, such as
AstraGlueCreds
, and then click Next. -
On the Configure Rotation page, click Next.
-
On the Review page, click Store.
Create an AWS Glue connector
-
In the AWS Glue Studio console navigation pane, click Data connections.
-
In the Custom connectors section, click Create custom connector.
-
For Connector S3 URL, enter your JDBC driver’s S3 URI.
-
Enter a connector name.
-
For Connector type, select JDBC.
-
For Class name, enter
com.datastax.astra.jdbc.AstraJdbcDriver
.The JDBC base URL is formatted as
jdbc:astra://ASTRA_DB_NAME/ASTRA_KEYSPACE_NAME?user=token&password=APPLICATION_TOKEN
.For example,
jdbc:astra://astraglue_db/astraglue_ks?user=token&password=AstraCS:…
. -
For URL parameter delimiter, enter '&'.
-
Click Create connector.
-
On the Data connections page, in the Connectors section, click your JDBC connector.
-
Click Create connection.
-
Enter a connection name.
-
For Connection credential type, select default.
-
For AWS Secret, select your Astra DB secret.
-
Click Create connection.
Create an AWS Glue database and table
-
Download the sample schema.
-
In the Amazon S3 Console, create a new, empty bucket for AWS Glue. Make sure the bucket is in the same region as your Glue instance.
-
In the AWS Glue console navigation pane, click Data Catalog/Databases.
-
Click Add database.
-
Enter a database name, and then click Create database.
-
Click Tables, and then click Add table.
-
Enter a table name.
-
Select your new database.
-
Set the Data store to S3, and then select your AWS Glue S3 bucket.
-
Set the Data format to CSV with comma (
,
) delimiters. -
Click Next, and then click Choose or define schema.
-
Click Edit schema as JSON.
-
Click Choose file, and then select the sample
schema.txt
file. -
Click Save, and then click Next.
-
Review the table attributes, and then click Create.
Create an ETL job
Create an ETL job in AWS Glue that ingests data from Astra DB, transforms it to a defined schema, and then pushes it to your AWS Glue database and table.
-
In the AWS Glue Studio console navigation pane, click Data connections.
-
In the Connections section, click your JDBC connection.
-
Click Create job.
-
Enter a job name.
-
On the Visual tab, click the Data source - Connection node for your JDBC connection.
-
For Table name, enter
demographics
.If you changed the table name when you loaded data in Astra DB, use that name instead.
-
On the Data preview tab, select your AWS Glue IAM role.
-
To start the data transfer, click Start data preview session, and then wait for the job to run.
If data extraction succeeded, the preview shows your data from Astra DB.
-
On the Output schema tab, select Use data preview schema. This applies the data scheme from Astra DB to your AWS Glue job.
-
In the Visual tab, click
Add, click the Transform category, and then select Change Schema. -
Make sure the Transform - Change Schema node is connected to the Data source - Connection node.
If they are not connected, click the Change Schema node, and then set the Parent node to the Connection node.
-
Click the new Transform - Change Schema node, and then inspect the Transform settings:
-
Output schema: The schema received from the Data source - Connection node
-
Data preview: A preview of the data after transformation
-
-
In the Visual tab, click
Add, click the Targets category, and then select Glue Data Catalog. -
Click the Target - Glue Data Catalog node, and then edit the following settings:
-
Set the Parent node to Change Schema.
-
Select your AWS Glue database and table.
-
-
Click Save.
Test the job
-
In the AWS Glue Studio console navigation pane, click ETL jobs, and then select your ETL job.
-
Click Run.
-
After the job runs, go to the AWS S3 console, and then click your AWS Glue database bucket.
-
Click Download to download a CSV file of the data imported to your AWS Glue database from Astra DB.