Loading data from AWS S3
How to use the DSE Graph Loader to load data from AWS S3.
The data mapping script for loading from AWS S3 is shown with explanation. The full script is found at the bottom of the page.
Procedure
- If desired, add configuration to the mapping script.
-
A sample of the CSV data residing on AWS S3:
// SAMPLE INPUT // For the author.csv file: // name|gender // Julia Child|F // For the book.csv file: // name|year|ISBN // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2 // For the authorBook.csv file: // bname|aname // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck
Because the property key
name
is used for both vertex labelsauthor
andbook
, in theauthorBook
file, variablesaname
andbname
are used for author name and book name, respectively. These variables are used in the mapping logic used to create the edges betweenauthor
andbook
vertices. -
Specify the data inputs using a AWS S3 reference
dfs_uri
that definess3://[bucket]
and the filenames:
This example uses compressed files and the additional step// DATA INPUT // Define the data input sources / // dfs_uri specifies the URI to the HDFS directory in which the files are stored dfs_uri = 's3://food/' authorInput = File.csv(dfs_uri + 'author.csv.gz').gzip().delimiter('|') bookInput = File.csv(dfs_uri + 'book.csv.gz').gzip().delimiter('|') authorBookInput = File.csv(dfs_uri + 'authorBook.csv.gz').gzip().delimiter('|')
gzip()
. - Create the main body of the mapping script. This part of the mapping script is the same regardless of the file format.
-
To run DSE Graph Loader for text loading as a dry run, use the following
command:
graphloader authorBookMappingS3.groovy -graph testS3 -address localhost -dryrun true
For testing purposes, the graph specified does not have to exist prior to running
graphloader
. However, for production applications, the graph and schema should be created prior to usinggraphloader
. The-dryrun true
option runs the command without loading data. -
The full loading script is shown.
// SAMPLE INPUT // For the author.csv file: // name|gender // Julia Child|F // For the book.csv file: // name|year|ISBN // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2 // For the authorBook.csv file: // bname|aname // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck // CONFIGURATION // Configures the data loader to create the schema config create_schema: true, load_new: true, preparation: true // DATA INPUT // Define the data input sources // dfs_uri specifies the URI to the HDFS directory in which the files are stored dfs_uri = 's3://food/' authorInput = File.csv(dfs_uri + 'author.csv.gz').gzip().delimiter('|') bookInput = File.csv(dfs_uri + 'book.csv.gz').gzip().delimiter('|') authorBookInput = File.csv(dfs_uri + 'authorBook.csv.gz').gzip().delimiter('|') // Specifies what data source to load using which mapper (as defined inline) load(authorInput).asVertices { label "author" key "name" } load(bookInput).asVertices { label "book" key "name" } load(authorBookInput).asEdges { label "authored" outV "aname", { label "author" key "name" } inV "bname", { label "book" key "name" } }