Mapping script

Explain the main body of the mapping script.

Regardless of the file format selected, the main body of the mapping script is the same. After setting configuration and adding a data input source, the mapping commands are specified.

Procedure

  • Vertices are loaded from authorInput, with the vertex label author and the property key name which uniquely identifies a vertex listed for the key. Note that, in this example, if gender were chosen for the key, it would not be unique enough to load each record from the data file. Using the configuration setting load_new: true can significantly speed up the loading process, but a duplicate vertex will be created if the record already exists. All other property keys will be loaded, but do not have to be identified in the loading script. For author vertices, gender will also be loaded.
    load(authorInput).asVertices {
        label "author"
        key "name"
    }
    Note: If more than 256 property key values are present in the input file, see important information on the max_query_params value in the dse.yaml file.

    One load statement must be created for each vertex loaded, even if the same file is reused for one or more vertices. When using the same input file for multiple vertices, sometimes a field exists in the input file that should be ignored for a particular vertex. See the instructions for ignoring a field. If an input file includes multiple types of lines, for instance, authors and reviewers, that should be read into different vertex labels, see the instructions for labelField.

  • Loading the book vertices follows a similar pattern. Note that both vertex labels author and book use name as the unique key for identifying a vertex. This declares that the vertex record does not yet exist in the graph at the beginning of the loading process.
    load(bookInput).asVertices {
        label "book"
        key "name"
    }
  • After vertices are loaded, edges are loaded. Similar to the vertex mapping, an edge label is specified. In addition, the outgoing vertex (outV) and incoming vertex (inV) for the edge must be identified. For each vertex in outV or inV, the vertex label is specified with label, and the unique key is specified with key.
    load(authorBookInput).asEdges {
        label "authored"
        outV "aname", {
            label "author"
            key "name"
        }
        inV "bname", {
            label "book"
            key "name"
        }
    }
    Note the naming convention used for the outV and inV designations. Because both the outgoing vertex and the incoming vertex keys are listed as name, the designators aname and bname are used to distinguish between the author name and the book name as the field names in the input file.
  • An alternative to the definitions shown above is to specify the mapping logic with variables, and then list the load statements separately.
    authorMapper = {
        label "author"
        key "name"}
    bookMapper = {
        label "book"
        key "name"
    }
    authorBookMapper = {
        label "authored"
        outV "aname", {
            label "author"
            key "name"
        }
        inV "bname", {
            label "book"
            key "name"
        }
    }
    
    load(authorInput).asVertices(authorMapper)
    load(bookInput).asVertices(bookMapper)
    load(authorBookInput).asEdges(authorBookMapper)

Ignoring a field in input file

Mapping data while ignoring a field with DSE Graph Loader.

If the input file includes a field that should be ignored for a particular vertex load, use ignore.

Procedure

  1. Create a map script that ignores the field restaurant:
    // authorInput includes name, gender, and restaurant
    // but restaurant is not loaded
    /* Sample input:
    name|gender|restaurant
    Alice Waters|F|Chez Panisse
    */
    
    load(authorInput).asVertices {
        label "author"
        key "name"
        ignore "restaurant"
    }
  2. An additional example shows the use of ignore where two different types of vertices are created, book and author, using the same input file.
    /* Sample input:
    name|gender|bname
    Julia Child|F|The French Chef Cookbook
    Simone Beck|F|The Art of French Cooking, Vol. 1
    */
    
    //inputfiledir = '/tmp/TEXT/'
    authorInput = File.text("author.dat").
        delimiter("|").
        header('name', 'gender','bname')
    
    //Specifies what data source to load using which mapper (as defined inline)
      
    load(authorInput).asVertices {
        label "book"
        key "bname"
        ignore "name"
        ignore "gender"
    }
    
    load(authorInput).asVertices {
        label "author"
        key "name"
        outV "book", "authored", {
            label "book"
            key "bname"
       }
    }

Using labelField to parse input into different vertex labels

Mapping data using labelField to parse input into different vertex labels with DSE Graph Loader.

Oftentimes, an input file includes a field that is used to identify the vertex label. In order to load the file and create different vertex labels on-the-fly, labelField is used to identify that particular field.

Procedure

Create a map to input both authors and reviewers from the same file using labelField:
/* SAMPLE INPUT
The input personInput includes type of person, name, gender; type can be either author or reviewer.
type::name::gender 
author::Julia Child::F
reviewer::Jane Doe::F
*/

personInput = File.text('people.dat').
    delimiter("::").
    header('type','name','gender')


load(personInput).asVertices{
        labelField "type"
        key "name"
}
Running this map script using the sample data results in two different vertex labels, with one record for each.
g.V().hasLabel('author').valueMap()
{gender=[F], name=[Julia Child]}
g.V().hasLabel('reviewer').valueMap()
{gender=[F], name=[Jane Doe]}

Using compressed files to load data

Mapping compressed data with DSE Graph Loader.

Compressed files can be loaded using DSE Graph Loader to load both vertices and edges. This example loads vertices and edges, as well as edge properties, using gzipped files.

Procedure

Create a map script that specifies the input files as compressed *.gz files:
/* SAMPLE INPUT
rev_name|recipe_name|timestamp|stars|comment
John Doe|Beef Bourguignon|2014-01-01|5|comment
 */

// CONFIGURATION
// Configures the data loader to create the schema
config create_schema: false, load_new: false

// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files that is given in the commandline
// as the "-filename" option
inputfiledir = '/tmp/CSV/'
// This next file is not required if the reviewers already exist
reviewerInput = File.csv(inputfiledir + "reviewers.csv.gz").
    gzip().
    delimiter('|')
// This next file is not required if the recipes already exist
recipeInput = File.csv(inputfiledir +"recipes.csv.gz").
    gzip().
    delimiter('|')
// This is the file that is used to create the edges with edge properties
reviewerRatingInput = File.csv(inputfiledir + "reviewerRatings.csv.gz").
    gzip().
    delimiter('|')

//Specifies what data source to load using which mapper (as defined inline)

load(reviewerInput).asVertices {
    label "reviewer"
    key "name"
}

load(recipeInput).asVertices {
    label "recipe"
    key "name"
}

load(reviewerRatingInput).asEdges {
    label "rated"
    outV "rev_name", {
        label "reviewer"
        key "name"
    }
    inV "recipe_name", {
        label "recipe"
        key "name"
    }
   // properties are automatically added from the file, using the header line as property keys
   // from previously created schema
}

The compressed files are designated as .gz files, followed by a gzip() step for processing. Edge properties are loaded from one of the input files based on the header identifying the property keys to use for the values listed in each line of the CSV file. The edge properties populate a rated edge between a reviewer vertex and a recipe vertex with the properties timestamp, stars, and comment.

Mapping data with a composite custom id

Mapping data with a composite custom id with DSE Graph Loader.

Data with a composite primary key requires some additional definition when specifying the key for loading, if the custom id uses multiple keys for definition (either partitionKeys and/or clusteringKeys).

Procedure

  1. Inserting data for vertices with a composite custom id requires the declaration of two or more keys:
    /* SAMPLE INPUT
    cityId|sensorId|fridgeItem
    santaCruz|93c4ec9b-68ff-455e-8668-1056ebc3689f|asparagus
     */
      
    load(fridgeItemInput).asVertices {
        label "fridgeSensor"
        // The vertexLabel schema for fridgeSensor includes two keys:
        // partition key: cityId and clustering key: sensorId 
        key cityId: "cityId", sensorId: "sensorId"
    }
    Tip: The schema for the composite custom id must be created prior to using DSE Graph Loader, and cannot be inferred from the data. In addition, create a search index that includes all properties in the composite key. The search index is required to use DSE Graph Loader for inserting composite custom id data.
    Check the vertex id results with id() to retrieve the full primary key definition:
    gremlin> g.V().hasLabel('fridgeSensor').id()
    ==>{~label=fridgeSensor, sensorId=93c4ec9b-68ff-455e-8668-1056ebc3689f, cityId=santaCruz}
    ==>{~label=fridgeSensor, sensorId=9c23b683-1de2-4c97-a26a-277b3733732a, cityId=sacramento}
    ==>{~label=fridgeSensor, sensorId=eff4a8af-2b0d-4ba9-a063-c170130e2d84, cityId=sacramento}
    Each vertex stores fridgeItem as data:
    gremlin> g.V().valueMap()
    ==>{fridgeItem=[asparagus]}
    ==>{fridgeItem=[ham]}
    ==>{fridgeItem=[eggs]}
  2. To load edges based on a composite key, a transformation is required:
    /* SAMPLE EDGE DATA
    cityId|sensorId|name
    santaCruz|93c4ec9b-68ff-455e-8668-1056ebc3689f|asparagus
    */
    the_edges = File.csv(inputfiledir + "fridgeItemEdges.csv").delimiter('|')
    
    the_edges = the_edges.transform {
        it['fridgeSensor'] = [
                'cityId' : it['cityId'],
                'sensorId' : it['sensorId'] ];
        it['ingredient'] = [
                'name' : it['name'] ];
        it
    }
    load(the_edges).asEdges  {
        label "contains"
        outV "ingredient", {
            label "ingredient"
            key "name"
        }
        inV "fridgeSensor", {
            label "fridgeSensor"
            key cityId:"cityId", sensorId:"sensorId"
        }
    }
    

    The edge file transforms the partition key and clustering key into a map of cityId and sensorId. This map can then be used to designate the key for a fridgeSensor vertex when the edges are loaded.

    The resulting map shows the edges created between ingredient and fridgeSensor vertices.
  3. For DSE 5.1.3 and later, an alternative method of loading edge data from CSV files can be used:
    /* SAMPLE EDGE DATA
    cityId|sensorId|homeId
    100|001|9001
    */
    
    isLocatedAt_fridgeSensor = File.csv(/tmp/data/edges/" + "isLocatedAt_fridgeSensor.csv").delimiter('|')
    
    load(isLocatedAt_fridgeSensor).asEdges {
        label "isLocatedAt"
        outV {
            label "fridgeSensor"
            key cityId: "cityId", sensorId: "sensorId"
            exists()
            ignore "homeId"
        }
        inV {
            label "home"
            key "homeId"
            exists()
            ignore "cityId"
            ignore "sensorId"
        }
        ignore "cityId"
        ignore "sensorId"
        ignore "homeId"
    }
    In this example, no transform is required, but ignore statements are required in both the inV and outV declarations, as well as the edge properties section. Removing the exists() statement in the incoming and outgoing vertex declarations can enable loading the vertices as well as the edges in this mapping script.
    Important: There is a new subtle change in the inV and outV declarations. An input field name is no longer used, such as inV "home", {, due to the requirement to support multiple-key custom ids.
    The resulting map:

Mapping multi-cardinality edges

Mapping multi-cardinality edges data with DSE Graph Loader.

Multiple cardinality edges are a common type of data that is inserted into graphs. Often, the input file has both vertex and edge information for loading.

Procedure

Inserting vertices and multi-cardinality edges can be accomplished from one file with judicious use of ignore while loading vertices:
/* SAMPLE INPUT
authorCity: 
author|city|dateStart|dateEnd
Julia Child|Paris|1961-01-01|1967-02-10
*/

// CONFIGURATION
// Configures the data loader to create the schema
config dryrun: false, preparation: true, create_schema: false, load_new: true, schema_output: 'loader_output.txt'

// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/multiCard/'
authorCityInput = File.csv(inputfiledir + "authorCity.csv").delimiter('|')

//Specifies what data source to load using which mapper (as defined inline)

// Ignore city, dateStart, and dateEnd when creating author vertices
load(authorCityInput).asVertices {
    label "author"
    key "author"
    ignore "city"
    ignore "dateStart"
    ignore "dateEnd"
}

// Ignore author, dateStart, and dateEnd when creating city vertices

load(authorCityInput).asVertices {
    label "city"
    key "city"
    ignore "author"
    ignore "dateStart"
    ignore "dateEnd"
}

// create edges from author -> city and include the edge properties dateStart and dateEnd
load(authorCityInput).asEdges {
    label "livedIn"
    outV "author", {
        label "author"
        key "author"
    }
    inV "city", {
        label "city"
        key "city"
    }
}

Mapping meta-properties

Mapping meta-property data with DSE Graph Loader.

If the input file includes meta-properties, or properties that have properties, use vertexProperty.

The schema for this data load should be created prior to running graphloader
// PROPERTY KEYS
schema.propertyKey('name').Text().single().create()
schema.propertyKey('gender').Text().single().create()
schema.propertyKey('badge').Text().single().create()
schema.propertyKey('since').Int().single().create()
// Create the meta-property since on the property badge
schema.propertyKey('badge').properties('since').add()
// VERTEX LABELS
schema.vertexLabel('reviewer').properties('name','gender','badge').create()
// INDEXES
schema.vertexLabel('reviewer').index('byname').materialized().by('name').add()

Procedure

The mapping script uses vertexProperty to identify badge as a vertex property. Note the structure of the nested fields for badge in the JSON file.
* SAMPLE INPUT
reviewer: { "name":"Jon Doe", "gender":"M", "badge" : { "value": "Gold Badge","since" : 2012 } }
 */

// CONFIGURATION
// Configures the data loader to create the schema
config dryrun: false, preparation: true, create_schema: true, load_new: true, load_vertex_threads: 3, schema_output: 'loader_output.txt'

// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/'
reviewerInput = File.json(inputfiledir + "reviewer.json")

//Specifies what data source to load using which mapper (as defined inline)
  
load(reviewerInput).asVertices{
        label "reviewer"
        key "name"
        vertexProperty "badge", {
                value "value"
        }
}
Running this mapping script using the sample data results in a reviewer vertex where the property badge has a meta-property since.
g.V().valueMap()
{badge=[Gold Badge], gender=[M], name=[Jane Doe]}
g.V().properties('badge').valueMap()
{since=2012}

Mapping multiple meta-properties

Mapping multiple meta-property data with DSE Graph Loader.

If the input file includes multiple meta-properties, or properties that have multiple properties, use vertexProperty.

The schema for this data load should be created prior to running graphloader
// PROPERTY KEYS
schema.propertyKey('badge').Text().multiple().create()
schema.propertyKey('gender').Text().single().create()
schema.propertyKey('name').Text().single().create()
schema.propertyKey('since').Int().single().create()

// VERTEX LABELS
schema.vertexLabel('reviewer').properties('name', 'gender', 'badge').create()
schema.propertyKey('badge').properties('since').add()

// INDEXES
schema.vertexLabel('reviewer').index('byname').materialized().by('name').add()

Procedure

The mapping script uses vertexProperty to identify badge as a vertex property. Note the structure of the nested fields for badge in the JSON file.
/* SAMPLE INPUT
reviewer: { "name":"Jane Doe", "gender":"F", 
    "badge" : [{ "value": "Gold Badge", "since" : 2012 },
               { "value": "Silver Badge", "since" : 2005 }] }
 */

// CONFIGURATION
// Configures the data loader to create the schema
config dryrun: false, preparation: true, create_schema: false, load_new: true, load_vertex_threads: 3, schema_output: 'loader_output.txt'

// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/'
reviewerInput = File.json(inputfiledir + "reviewerMultiMeta.json")

//Specifies what data source to load using which mapper (as defined inline)
  
load(reviewerInput).asVertices{
        label "reviewer"
        key "name"
        vertexProperty "badge", {
                value "value"
        }
}
Optionally, the data can be loaded from a CSV file if a transform is used before loading:
/* SAMPLE INPUT
name|gender|value|since
Jane Doe|F|Gold Badge|2011
Jane Doe|F|Silver Badge|2005
Jon Doe|M|Gold Badge|2012
 */

// CONFIGURATION
// Configures the data loader to create the schema
config dryrun: false, preparation: true, create_schema: false, load_new: true, load_vertex_threads: 3, schema_output: 'loader_output.txt'

// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/'
reviewerInput = File.csv(inputfiledir + "reviewerMultiMeta.csv").delimiter('|')

//Specifies what data source to load using which mapper (as defined inline)
reviewerInput = reviewerInput.transform {
  badge1 = [
    "value": it.remove("value"),
    "since": it.remove("since") ]
  it["badge"] = [badge1]
  it
}

load(reviewerInput).asVertices{
	label "reviewer"
	key "name"
	vertexProperty "badge", {
		value "value"
	}
}
Running this mapping script using the sample data results in a reviewer vertex where the property badge has multiple values.
Choosing the pop-up link for badge reveals the meta-property values:

Mapping geospatial and Cartesian data

Mapping geospatial and Cartesian data with DSE Graph Loader.

Geospatial and Cartesian data can be loaded with DSE Graph Loader. The DSE Graph Loader is not capable of creating schema for geospatial and Cartesian data, so schema must be created before loading and the create_schema configuration must be set to false.

An example of geospatial schema for the example:
//SCHEMA
schema.propertyKey('name').Text().create()
schema.propertyKey('point').Point().withGeoBounds().create()
schema.vertexLabel('location').properties('name','point').create()
schema.propertyKey('line').Linestring().withGeoBounds().create()
schema.vertexLabel('lineLocation').properties('name','line').create()
schema.propertyKey('polygon').Polygon().withGeoBounds().create()
schema.vertexLabel('polyLocation').properties('name','polygon').create()

schema.vertexLabel('location').index('byname').materialized().by('name').add()
schema.vertexLabel('lineLocation').index('byname').materialized().by('name').add()
schema.vertexLabel('polyLocation').index('byname').materialized().by('name').add()
schema.vertexLabel('location').index('search').search().by('point').add()
schema.vertexLabel('lineLocation').index('search').search().by('line').add()
schema.vertexLabel('polyLocation').index('search).search().by('polygon').add()
Search indexes must be used for geospatial and Cartesian points, linestrings or polygons in graph queries. DSE Graph uses one index per query, and because geospatial data consists of latitude and longitude (two parameters), only search indexes can be used to optimize query performance.

Procedure

  1. If desired, add configuration to the mapping script.
  2. Specify the data input directory. The variable inputfiledir specifies the directory for the input files. Each of the identified files will be used for loading.
    /* SAMPLE DATA
    name|point
    New York|POINT(74.0059 40.7128)
    Paris|POINT(2.3522 48.8566)
    */
    
    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/geo_dgl/data/'
    ptsInput = File.csv(inputfiledir + "vertices/place.csv").delimiter('|')
    linesInput = File.csv(inputfiledir + "vertices/place_lines.csv").delimiter('|')
    polysInput = File.csv(inputfiledir + "vertices/place_polys.csv").delimiter('|')
    
    //Specifies what data source to load using which mapper (as defined inline)
    
    load(ptsInput).asVertices {
        label "location"
        key "name"
    }
    
    import com.datastax.driver.dse.geometry.Point
    ptsInput = ptsInput.transform {
      it['point'] = Point.fromWellKnownText(it['point']);
      return it;
    }
    
    load(linesInput).asVertices {
        label "lineLocation"
        key "name"
    }
    import com.datastax.driver.dse.geometry.LineString
    linesInput = linesInput.transform {
      it['line'] = LineString.fromWellKnownText(it['line']);
      return it;
    }
    
    load(polysInput).asVertices {
        label "polyLocation"
        key "name"
    }
    
    import com.datastax.driver.dse.geometry.Polygon
    polysInput = polysInput.transform {
      it['polygon'] = Polygon.fromWellKnownText(it['polygon']);
      return it;
    }
    A transformation of the input data is required, converting the point from the WKT format into the format DSE Graph stores. For a point, the transformation imports a Point library and uses the fromWellKnownText method:
    
    import com.datastax.driver.dse.geometry.Point
    ptsInput = ptsInput.transform {
      it['point'] = Point.fromWellKnownText(it['point']);
      return it;
    }
    Linestrings and polygons use the same library and method, respectively.
  3. To run DSE Graph Loader for CSV loading from a directory, use the following command:
    graphloader geoMap.groovy -graph testGeo -address localhost

Mapping Gryo data generated from DSE Graph

Inserting Gryo binary data requires a slightly modified map script. To load Gryo data, allow DSE Graph Loader to create schema and load new data. Loading will require a graph schema_mode set to Development.

Procedure

Create a map script for DSE Graph generated Gryo input:
inputfiledir = '/tmp/Gryo/'
recipeInput = com.datastax.dsegraphloader.api.Graph.file(inputfiledir + 'recipesDSEG.gryo').gryo().dse()  
load(recipeInput.vertices()).asVertices {
    labelField '~label'
    key 'name'
}

load(recipeInput.edges()).asEdges {
    labelField '~label'
    outV 'outV', {
        labelField '~label'
        key 'name' : 'name', 'personId' : 'personId'
    }
    inV 'inV', {
        labelField '~label'
        key 'name' : 'name', 'bookId' : 'bookId'
    }
}

The Gryo data format will include ~label and name field values that must be used to create the vertices. For instance, a record that is an author will have a ~label of person and property name. For the edges, notice that a user-defined vertex ID consisting of both name and bookId is used to identify the vertex to use as the incoming vertex for the edge.

Mapping Gryo data generated with TinkerGraph

Inserting Gryo binary data requires a slightly modified map script. To load Gryo data, allow DSE Graph Loader to create schema and load new data. Loading will require a graph schema_mode set to Development.

Procedure

Create a map script for TinkerGraph generated Gryo input:
//Specifies what data source to load using which mapper (as defined inline)
  
load(recipeInput.vertices()).asVertices {
    labelField "~label"
    key "~id", "id"
}

load(recipeInput.edges()).asEdges {
    labelField "~label"
    outV "outV", {
        labelField "~label"
        key "~id", "id"
    }
    inV "inV", {
        labelField "~label"
        key "~id", "id"
    }
}

The Gryo data format will include ~label and name field values that must be used to create the vertices and edges. For instance, a record that is an author will have a ~label of author and property name. The vertexKeyMap creates a map of each vertex label to a unique property. This map is used to create unique keys used while loading vertices from the binary file.

Mapping GraphML binary data

Mapping GraphML data with DSE Graph Loader.

Inserting GraphML binary data requires a slightly modified map script. To load GraphML data, allow DSE Graph Loader to create schema and load new data. Loading will require a graph schema_mode set to Development.

Procedure

Create a map script for GraphML data:
//Specifies what data source to load using which mapper (as defined inline)
  
load(recipeInput.vertices()).asVertices {
    labelField "~label"
    key "~id", "id"
}

load(recipeInput.edges()).asEdges {
    labelField "~label"
    outV "outV", {
        labelField "~label"
        key "~id", "id"
    }
    inV "inV", {
        labelField "~label"
        key "~id", "id"
    }
}
The GraphML data format will include ~label and ~id field values that must be used to create the label and key for each record loaded. For instance, a record that is an author will have a ~label of author. The ~id will similarly be set in the record, a difference from other data. The difference can be seen by looking at a record and noting the presence of the id field, based on the second item in each key setting in the mapping script:
g.V().hasLabel('author').valueMap()
{gender=[F], name=[Julia Child], id=[0]}
{gender=[F], name=[Simone Beck], id=[3]}

Mapping GraphSON binary data

Mapping GraphSON data with DSE Graph Loader.

Inserting GraphSON data requires a slightly modified map script. To load GraphSON data, allow DSE Graph Loader to create schema and load new data. Loading will require a graph schema_mode set to Development.

Procedure

Create a map script for GraphML data:
//Specifies what data source to load using which mapper (as defined inline)
  
load(recipeInput.vertices()).asVertices {
    labelField "~label"
    key "~id", "id"
}

load(recipeInput.edges()).asEdges {
    labelField "~label"
    outV "outV", {
        labelField "~label"
        key "~id", "id"
    }
    inV "inV", {
        labelField "~label"
        key "~id", "id"
    }
}
The GraphSON data format will include ~label and ~id field values that must be used to create the label and key for each record loaded. For instance, a record that is an author will have a ~label of author. The ~id will similarly be set in the record, a difference from other data. The difference can be seen by looking at a record and noting the presence of the id field, based on the second item in each key setting in the mapping script:
g.V().hasLabel('author').valueMap()
{gender=[F], name=[Julia Child], id=[0]}
{gender=[F], name=[Simone Beck], id=[3]}