Loading TEXT data

The data mapping script for delimited text data is shown with explanation. The full script is found at the bottom of the page.

  • If desired, add configuration to the mapping script.

  • A sample of the data for load looks like the following:

    SAMPLE INPUT
    // For the author.dat file:
    Julia Child|F
    // For the book.dat file:
    Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2
    // For the authorBook.dat file:
    Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck
  • Specify the data input files. The variable inputfiledir specifies the directory name for the input files. Each of the identified files will be used for loading.

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/TEXT/'
    authorInput = File.text(inputfiledir + "author.dat").
        delimiter("|").
        header('name', 'gender')
    bookInput = File.text(inputfiledir + "book.dat").
        delimiter("|").
        header('name', 'year', 'ISBN')
    authorBookInput = File.text(inputfiledir + "authorBook.dat").
        delimiter("|").
        header('bname', 'aname')

    Because the property key name is used for both vertex labels author and book, in the authorBook file, variables aname and bname are used for author name and book name, respectively. These variables are used in the mapping logic used to create the edges between author and book vertices.

  • In each line, the file is specified as a text file, the file name is specified, a delimiter is set, and a header can be specified to identify the fields that will be read. The header can alternatively be specified on the first line of the data file. A map, authorInput, is created that will be used to process the data. The map can be manipulated before loading using transforms.

    authorInput = File.text(inputfiledir + "author.dat").delimiter("|").header('name', 'gender')

    If a header() is used in the mapping script and a header line is used in the data file, then both must match. Either a header line in the data file or a header() is required.

  • Create the main body of the mapping script. This part of the mapping script is the same regardless of the file format.

  • To run DSE Graph Loader for text loading as a dry run, use the following command:

    $ graphloader authorBookMappingTEXT.groovy -graph testTEXT -address localhost -dryrun true

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader.

  • The full loading script is shown.

    /** SAMPLE INPUT
    author: Julia Child|F
    book : Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2
    authorBook: Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck
     */
    
    // CONFIGURATION
    // Configures the data loader to create the schema
    config create_schema: true, load_new: true, load_vertex_threads: 3
    
    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files that is given in the commandline
    // as the "-filename" option
    
    inputfiledir = '/tmp/CSV/'
    authorInput = File.text(inputfiledir + "author.dat").
        delimiter("|").
        header('name', 'gender')
    bookInput = File.text(inputfiledir + "book.dat").
        delimiter("|").
        header('name', 'year', 'ISBN')
    authorBookInput = File.text(inputfiledir + "authorBook.dat").
        delimiter("|").
        header('bname', 'aname')
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(authorInput).asVertices {
        label "author"
        key "name"
    }
    
    load(bookInput).asVertices {
        label "book"
        key "name"
    }
    
    load(authorBookInput).asEdges {
        label "authored"
        outV "aname", {
            label "author"
            key "name"
        }
        inV "bname", {
            label "book"
            key "name"
        }
    }
  • Mapping several files with same format from a directory

  • A sample of the data for load looks like the following:

    SAMPLE INPUT
    // For the author.text file:
    name|gender
    Julia Child|F
    Simone Beck|F
    
    // For the knows.text file:
    aname|bname
    Julia Child|James Beard

    A number of files with the same format exist in a directory. If the files differ, the graphloader will issue an error and stop:

    java.lang.IllegalArgumentException: /tmp/dirSource/data has more than 1 input type.
  • Specify the data input directory. The variable inputfiledir specifies the directory for the input files. Each of the identified files will be used for loading.

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/dirSource/data'
    personInput = File.directory(inputfiledir).delimiter('|').header('name','gender')
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(personInput).asVertices {
        label "author"
        key "name"
    }

    The important element is File.directory(); this defines the directory where the files are stored.

  • Note that two directories could be used to load vertices and edges:

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/dirSource/data'
    vertexfiledir = inputfiledir+'/vertices'
    edgefiledir = inputfiledir+'/edges'
    personInput = File.directory(vertexfiledir).delimiter('|').header('name','gender')
    personEdgeInput = File.directory(edgefiledir).delimiter('|').header('aname','bname')
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(personInput).asVertices {
        label "author"
        key "name"
    }
    
    load(personEdgeInput).asEdges {
        label "knows"
        outV "aname", {
            label "author"
            key "name"
        }
        inV "bname", {
            label "book"
            key "name"
        }
    }
  • To run DSE Graph Loader for text file loading from a directory, use the following command:

    $ graphloader dirSourceMapping.groovy -graph testdirSource -address localhost

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com