Loading data from Hadoop (HDFS)

The data mapping script for loading from HDFS is shown with explanation. The full script is found at the bottom of the page.

  • If desired, add configuration to the mapping script.

  • A sample of the CSV data residing on HDFS:

    // SAMPLE INPUT
    // For the author.csv file:
    // name|gender
    // Julia Child|F
    // For the book.csv file:
    // name|year|ISBN
    // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2
    // For the authorBook.csv file:
    // bname|aname
    // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck

    Because the property key name is used for both vertex labels author and book, in the authorBook file, variables aname and bname are used for author name and book name, respectively. These variables are used in the mapping logic used to create the edges between author and book vertices.

  • Specify the data inputs using a HDFS reference dfs_uri and the filenames:

    // DATA INPUT
    // Define the data input sources  /
    // dfs_uri specifies the URI to the HDFS directory in which the files are stored
    
    dfs_uri = 'hdfs://hadoopNode:9000/food/'
    authorInput = File.csv(dfs_uri + 'author.csv.gz').
        gzip().
        delimiter('|')
    bookInput = File.csv(dfs_uri + 'book.csv.gz').
        gzip().
        delimiter('|')
    authorBookInput = File.csv(dfs_uri + 'authorBook.csv.gz').
        gzip().
        delimiter('|')

    This example uses compressed files and the additional step gzip().

  • Create the main body of the mapping script. This part of the mapping script is the same regardless of the file format.

  • To run DSE Graph Loader for text loading as a dry run, use the following command:

    graphloader authorBookMappingHDFS.groovy -graph testHDFS -address localhost -dryrun true

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader. The -dryrun true option runs the command without loading data.

  • The full loading script is shown.

    // SAMPLE INPUT
    // For the author.csv file:
    // name|gender
    // Julia Child|F
    // For the book.csv file:
    // name|year|ISBN
    // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2
    // For the authorBook.csv file:
    // bname|aname
    // Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck
    
    // CONFIGURATION
    // Configures the data loader to create the schema
    
    config create_schema: true, load_new: true, preparation: true
    
    // DATA INPUT
    // Define the data input sources
    // dfs_uri specifies the URI to the HDFS directory in which the files are stored
    
    dfs_uri = 'hdfs://hadoopNode:9000/food/'
    authorInput = File.csv(dfs_uri + 'author.csv.gz').
        gzip().
        delimiter('|')
    bookInput = File.csv(dfs_uri + 'book.csv.gz').
        gzip().
        delimiter('|')
    authorBookInput = File.csv(dfs_uri + 'authorBook.csv.gz').
        gzip().
        delimiter('|')
    
    
    // Specifies what data source to load using which mapper (as defined inline)
    
    load(authorInput).asVertices {
        label "author"
        key "name"
    }
    
    load(bookInput).asVertices {
        label "book"
        key "name"
    }
    
    load(authorBookInput).asEdges {
        label "authored"
        outV "aname", {
            label "author"
            key "name"
        }
        inV "bname", {
            label "book"
            key "name"
        }
    }

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com