Loading data from Hadoop (HDFS)
The data mapping script for loading from HDFS is shown with explanation. The full script is found at the bottom of the page.
-
If desired, add configuration to the mapping script.
-
A sample of the CSV data residing on HDFS:
// SAMPLE INPUT // For the author.csv file: // name|gender // Julia Child|F // For the book.csv file: // name|year|ISBN // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2 // For the authorBook.csv file: // bname|aname // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck
Because the property key
name
is used for both vertex labelsauthor
andbook
, in theauthorBook
file, variablesaname
andbname
are used for author name and book name, respectively. These variables are used in the mapping logic used to create the edges betweenauthor
andbook
vertices. -
Specify the data inputs using a HDFS reference
dfs_uri
and the filenames:// DATA INPUT // Define the data input sources / // dfs_uri specifies the URI to the HDFS directory in which the files are stored dfs_uri = 'hdfs://hadoopNode:9000/food/' authorInput = File.csv(dfs_uri + 'author.csv.gz'). gzip(). delimiter('|') bookInput = File.csv(dfs_uri + 'book.csv.gz'). gzip(). delimiter('|') authorBookInput = File.csv(dfs_uri + 'authorBook.csv.gz'). gzip(). delimiter('|')
This example uses compressed files and the additional step
gzip()
. -
Create the main body of the mapping script. This part of the mapping script is the same regardless of the file format.
-
To run DSE Graph Loader for text loading as a dry run, use the following command:
graphloader authorBookMappingHDFS.groovy -graph testHDFS -address localhost -dryrun true
For testing purposes, the graph specified does not have to exist prior to running
graphloader
. However, for production applications, the graph and schema should be created prior to usinggraphloader
. The-dryrun true
option runs the command without loading data. -
The full loading script is shown.
// SAMPLE INPUT // For the author.csv file: // name|gender // Julia Child|F // For the book.csv file: // name|year|ISBN // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2 // For the authorBook.csv file: // bname|aname // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck // CONFIGURATION // Configures the data loader to create the schema config create_schema: true, load_new: true, preparation: true // DATA INPUT // Define the data input sources // dfs_uri specifies the URI to the HDFS directory in which the files are stored dfs_uri = 'hdfs://hadoopNode:9000/food/' authorInput = File.csv(dfs_uri + 'author.csv.gz'). gzip(). delimiter('|') bookInput = File.csv(dfs_uri + 'book.csv.gz'). gzip(). delimiter('|') authorBookInput = File.csv(dfs_uri + 'authorBook.csv.gz'). gzip(). delimiter('|') // Specifies what data source to load using which mapper (as defined inline) load(authorInput).asVertices { label "author" key "name" } load(bookInput).asVertices { label "book" key "name" } load(authorBookInput).asEdges { label "authored" outV "aname", { label "author" key "name" } inV "bname", { label "book" key "name" } }