Loading JSON data

A common file format for loading graph data is JSON. An input JSON file holds all key and value information in a nested structure.

Mapping several different JSON files

Mapping several different JSON files with DSE Graph Loader.

DSE Graph Loader can load several different CSV files that exist in a directory using the following steps. Sample input data:

SAMPLE INPUT
// For the author.json file:
{"author_name":"Julia Child","gender":"F"}
// For the book.json file:
{"name":"The Art of French Cooking, Vol. 1","year":"1961","ISBN":"none"}
// For the authorBook.json file:
{"name":"The Art of French Cooking, Vol. 1","author":"Julia Child"}

Because the property key name is used for both vertex labels author and book, in the authorBook file, variables aname and bname are used for author name and book name, respectively. These variables are used in the mapping logic used to create the edges between author and book vertices.

  1. If desired, add configuration to the mapping script.

  2. Specify the data input files. The variable inputfiledir specifies the directory for the input files. Each of the identified files will be used for loading.

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/JSON/'
    authorInput = File.json(inputfiledir + 'author.json')
    bookInput = File.json(inputfiledir + 'book.json')
    authorBookInput = File.json(inputfiledir + 'authorBook.json')
  3. In each line, the file is specified as a json file and the file name is specified. The JSON format for File.json is one JSON object per line. A map, authorInput, is created that will be used to process the data. The map can be manipulated before loading using transforms.

    authorInput = File.json(inputfiledir + 'author.json')
  4. Create the main body of the mapping script. This part of the mapping script is the same regardless of the file format.

  5. To run DSE Graph Loader for JSON loading as a dry run, use the following command:

    graphloader authorBookMappingJSON.groovy -graph testJSON -address localhost -dryrun true

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader.

The fullscript is shown:

/* SAMPLE INPUT
author: {"name":"Julia Child","gender":"F"}
book : {"name":"The Art of French Cooking, Vol. 1","year":"1961","ISBN":"none"}
authorBook: {"bname":"The Art of French Cooking, Vol. 1","aname":"Julia Child"}
 */

// CONFIGURATION
// Configures the data loader to create the schema
config create_schema: true, load_new: true, load_vertex_threads: 3

// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files that is given in the commandline
// as the "-filename" option

inputfiledir = '/tmp/JSON/'
authorInput = File.json(inputfiledir + 'author.json')
bookInput = File.json(inputfiledir + 'book.json')
authorBookInput = File.json(inputfiledir + 'authorBook.json')

//Specifies what data source to load using which mapper (as defined inline)

load(authorInput).asVertices {
    label "author"
    key "name"
}

load(bookInput).asVertices {
    label "book"
    key "name"
}

load(authorBookInput).asEdges {
    label "authored"
    outV "aname", {
        label "author"
        key "name"
    }
    inV "bname", {
        label "book"
        key "name"
    }
}

Mapping several files with same format from a directory

DSE Graph Loader can load several JSON files with same format that exist in a directory using the following steps. Sample input data:

SAMPLE INPUT
// For the author.json file:
{"author_name":"Julia Child","gender":"F"}
// For the book.json file:
{"name":"The Art of French Cooking, Vol. 1","year":"1961","ISBN":"none"}
// For the authorBook.json file:
{"name":"The Art of French Cooking, Vol. 1","author":"Julia Child"}

A number of files with the same format exist in a directory. If the files differ, the graphloader will issue an error and stop:

java.lang.IllegalArgumentException: /tmp/dirSource/data has more than 1 input type.
  1. If desired, add configuration to the mapping script.

  2. Specify the data input directory. The variable inputfiledir specifies the directory for the input files. Each of the identified files will be used for loading.

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/dirSource/data'
    personInput = File.directory(inputfiledir)
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(personInput).asVertices {
        label "author"
        key "name"
    }

    The important element is File.directory(); this defines the directory where the files are stored.

  3. Note that two directories could be used to load vertices and edges:

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/dirSource/data'
    vertexfiledir = inputfiledir+'/vertices'
    edgefiledir = inputfiledir+'/edges'
    personInput = File.directory(vertexfiledir)
    personEdgeInput = File.directory(edgefiledir)
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(personInput).asVertices {
        label "author"
        key "name"
    }
    
    load(personEdgeInput).asEdges {
        label "knows"
        outV "aname", {
            label "author"
            key "name"
        }
        inV "bname", {
            label "book"
            key "name"
        }
    }
  4. To run DSE Graph Loader for JSON loading from a directory, use the following command:

    graphloader dirSourceJSONMapping.groovy -graph testdirSource -address localhost

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader.

Mapping files from a directory using a file pattern

DSE Graph Loader can load several files from a directory using file pattern matching Sample input files:

ls data
badOne.csv	person1.csv	person2.csv

A number of files with the same format exist in a directory. If the files differ, DSE Graph Loader will only load the files that match the pattern in the map script.

Several file patterns are defined for use:

Usable file patterns
Pattern Description Example

*

Matches zero or more characters. While matching, it will not cross directory boundaries.

*.csv will match all CSV files ending in csv.

**

Same as * but will cross directory boundaries.

CSV files in more than one directory.

?

Matches only one character.

person?.csv will match all CSV files, named as person1.csv or personA.csv, but not person11.csv.

\

Avoid characters being interpreted as special characters, for example, \\ to get a single \ .

[ ]

Matches a set of designated characters, though only as ingle character is matched.

[efg] matches "e", "f", or "g", so person[efg] matches persone, personf or persong. [1-9] matches any one number; person[1-9] will get files person1.csv through person9.csv.

{ }

Matches group of sub-patterns.

{csv,json} will match all CSV and JSON files in the directory.

  • Mapping using *

  • If desired, add configuration to the mapping script.

  • Sample input file:

    /* SAMPLE CSV INPUT:
    id|name|gender
    001|Julia Child|F
    */
  • Specify the data input directory. The variable inputfiledir specifies the directory for the input files. Each of the identified files will be used for loading.

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/filePattern'
    inputfileCSV = inputfiledir+'/data'
    personInput = File.directory(inputfileCSV).fileMatches("person*.csv").delimiter('|').header('id','name','gender')
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(personInput).asVertices {
        label "person"
        key "name"
    }
    
    /* RESULT:
       person1.csv and person2.csv will be loaded, but not badOne.csv
    */

    The important element is fileMatches("person*.csv"); this defines the pattern that will be matched for loaded files. The file badOne.csv will not be loaded, because the pattern does not match. Note that a file personExtra.csv would also be loaded, as it would match the pattern.

    This same pattern matching can be used for JSON input files, by substituting person*.json for person*.csv and using JSON input file parameters.

  • To run DSE Graph Loader for CSV loading from a directory, use the following command:

    graphloader filePatternCSV.groovy -graph testPattCSV -address localhost

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader.

  • Mapping using [ ]

  • If desired, add configuration to the mapping script.

  • Sample input file:

    /* SAMPLE CSV INPUT:
    id|name|gender
    001|Julia Child|F
    */
  • Specify the data input directory. The variable inputfiledir specifies the directory for the input files. Each of the identified files will be used for loading.

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/filePattern'
    inputfileCSV = inputfiledir+'/data'
    personInput = File.directory(inputfileCSV).fileMatches("person[1-9].csv").delimiter('|').header('id','name','gender')
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(personInput).asVertices {
        label "person"
        key "name"
    }
    
    /* RESULT:
       person1.csv and person2.csv will be loaded, but not badOne.csv
    */

    The important element is fileMatches("person[1-9].csv"); this defines the pattern that will be matched for loaded files. All files person1.csv through person9.csv will be loaded, but person15.csv doesn’t match the pattern and will not be loaded, as well as badOne.csv. Note that fileMatches("person?.csv") would achieve the same result.

    This same pattern matching can be used for JSON input files, by substituting person[1-9].json for person[1-9].csv and using JSON input file parameters.

  • Run DSE Graph Loader for this example use the following command:

    graphloader filePatternRANGE.groovy -graph testPattRANGE -address localhost

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader.

  • Mapping using { } with multiple patterns

  • If desired, add configuration to the mapping script.

  • Sample input file:

    /* SAMPLE CSV INPUT:
    id|name|gender
    001|Julia Child|F
    */
  • Specify the data input directory. The variable inputfiledir specifies the directory for the input files. Each of the identified files will be used for loading.

    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/filePattern/data'
    personInput = File.directory(inputfiledir).fileMatches("{person*,badOne}.csv").delimiter('|').header('id','name','gender')
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(personInput).asVertices {
        label "person"
        key "name"
    }
    
    /* RESULT:
       person1.csv, person2.csv and badOne.csv will all be loaded
    */

    The important element is fileMatches("{person*,badOne}.csv"); this defines the pattern that will be matched for loaded files. The files person1.csv, person1.csv, and badOne.csv will be loaded, because the pattern matches all three files. This same pattern matching can be used for JSON input files, by substituting person*.json for person*.csv and using JSON input file parameters.

  • To run DSE Graph Loader for this example using the following command:

    graphloader filePatternMULT.groovy -graph testPattMULT -address localhost

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com