Mapping script
Explain the main body of the mapping script.
Regardless of the file format selected, the main body of the mapping script is the same. After setting configuration and adding a data input source, the mapping commands are specified.
Procedure
-
Vertices are loaded from
authorInput
, with the vertex labelauthor
and the property keyname
which uniquely identifies a vertex listed for thekey
. Note that, in this example, ifgender
were chosen for the key, it would not be unique enough to load each record from the data file. Using the configuration settingload_new: true
can significantly speed up the loading process, but a duplicate vertex will be created if the record already exists. All other property keys will be loaded, but do not have to be identified in the loading script. Forauthor
vertices,gender
will also be loaded.load(authorInput).asVertices { label "author" key "name" }
Note: If more than 256 property key values are present in the input file, see important information on themax_query_params
value in the dse.yaml file.One load statement must be created for each vertex loaded, even if the same file is reused for one or more vertices. When using the same input file for multiple vertices, sometimes a field exists in the input file that should be ignored for a particular vertex. See the instructions for ignoring a field. If an input file includes multiple types of lines, for instance,
authors
andreviewers
, that should be read into different vertex labels, see the instructions for labelField. -
Loading the book vertices follows a similar pattern. Note that both vertex
labels
author
andbook
usename
as the unique key for identifying a vertex. This declares that the vertex record does not yet exist in the graph at the beginning of the loading process.load(bookInput).asVertices { label "book" key "name" }
-
After vertices are loaded, edges are loaded. Similar to the vertex mapping, an
edge label is specified. In addition, the outgoing vertex (outV) and incoming
vertex (inV) for the edge must be identified. For each vertex in
outV
orinV
, the vertex label is specified withlabel
, and the unique key is specified withkey
.
Note the naming convention used for theload(authorBookInput).asEdges { label "authored" outV "aname", { label "author" key "name" } inV "bname", { label "book" key "name" } }
outV
andinV
designations. Because both the outgoing vertex and the incoming vertex keys are listed asname
, the designatorsaname
andbname
are used to distinguish between the author name and the book name as the field names in the input file. -
An alternative to the definitions shown above is to specify the mapping logic
with variables, and then list the load statements separately.
authorMapper = { label "author" key "name"} bookMapper = { label "book" key "name" } authorBookMapper = { label "authored" outV "aname", { label "author" key "name" } inV "bname", { label "book" key "name" } } load(authorInput).asVertices(authorMapper) load(bookInput).asVertices(bookMapper) load(authorBookInput).asEdges(authorBookMapper)
Ignoring a field in input file
Mapping data while ignoring a field with DSE Graph Loader.
If the input file includes a field that should be ignored for a particular vertex
load, use ignore
.
Procedure
-
Create a map script that ignores the field
restaurant
:// authorInput includes name, gender, and restaurant // but restaurant is not loaded /* Sample input: name|gender|restaurant Alice Waters|F|Chez Panisse */ load(authorInput).asVertices { label "author" key "name" ignore "restaurant" }
-
An additional example shows the use of
ignore
where two different types of vertices are created,book
andauthor
, using the same input file./* Sample input: name|gender|bname Julia Child|F|The French Chef Cookbook Simone Beck|F|The Art of French Cooking, Vol. 1 */ //inputfiledir = '/tmp/TEXT/' authorInput = File.text("author.dat"). delimiter("|"). header('name', 'gender','bname') //Specifies what data source to load using which mapper (as defined inline) load(authorInput).asVertices { label "book" key "bname" ignore "name" ignore "gender" } load(authorInput).asVertices { label "author" key "name" outV "book", "authored", { label "book" key "bname" } }
Using labelField to parse input into different vertex labels
Mapping data using labelField to parse input into different vertex labels with DSE Graph Loader.
Oftentimes, an input file includes a field that is used to identify the vertex label.
In order to load the file and create different vertex labels on-the-fly,
labelField
is used to identify that particular field.
Procedure
labelField
:
/* SAMPLE INPUT
The input personInput includes type of person, name, gender; type can be either author or reviewer.
type::name::gender
author::Julia Child::F
reviewer::Jane Doe::F
*/
personInput = File.text('people.dat').
delimiter("::").
header('type','name','gender')
load(personInput).asVertices{
labelField "type"
key "name"
}
g.V().hasLabel('author').valueMap()
{gender=[F], name=[Julia Child]}
g.V().hasLabel('reviewer').valueMap()
{gender=[F], name=[Jane Doe]}
Using compressed files to load data
Mapping compressed data with DSE Graph Loader.
Compressed files can be loaded using DSE Graph Loader to load both vertices and edges. This example loads vertices and edges, as well as edge properties, using gzipped files.
Procedure
/* SAMPLE INPUT
rev_name|recipe_name|timestamp|stars|comment
John Doe|Beef Bourguignon|2014-01-01|5|comment
*/
// CONFIGURATION
// Configures the data loader to create the schema
config create_schema: false, load_new: false
// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files that is given in the commandline
// as the "-filename" option
inputfiledir = '/tmp/CSV/'
// This next file is not required if the reviewers already exist
reviewerInput = File.csv(inputfiledir + "reviewers.csv.gz").
gzip().
delimiter('|')
// This next file is not required if the recipes already exist
recipeInput = File.csv(inputfiledir +"recipes.csv.gz").
gzip().
delimiter('|')
// This is the file that is used to create the edges with edge properties
reviewerRatingInput = File.csv(inputfiledir + "reviewerRatings.csv.gz").
gzip().
delimiter('|')
//Specifies what data source to load using which mapper (as defined inline)
load(reviewerInput).asVertices {
label "reviewer"
key "name"
}
load(recipeInput).asVertices {
label "recipe"
key "name"
}
load(reviewerRatingInput).asEdges {
label "rated"
outV "rev_name", {
label "reviewer"
key "name"
}
inV "recipe_name", {
label "recipe"
key "name"
}
// properties are automatically added from the file, using the header line as property keys
// from previously created schema
}
The compressed files are designated as .gz
files, followed
by a gzip()
step for processing. Edge properties are loaded
from one of the input files based on the header identifying the property
keys to use for the values listed in each line of the CSV file. The edge
properties populate a rated
edge between a
reviewer
vertex and a recipe
vertex
with the properties timestamp
, stars
, and
comment
.
Mapping data with a composite custom id
Mapping data with a composite custom id with DSE Graph Loader.
Data with a composite primary key requires some additional definition when specifying the key for loading, if the custom id uses multiple keys for definition (either partitionKeys and/or clusteringKeys).
Procedure
-
Inserting data for vertices with a composite custom id requires the declaration
of two or more keys:
/* SAMPLE INPUT cityId|sensorId|fridgeItem santaCruz|93c4ec9b-68ff-455e-8668-1056ebc3689f|asparagus */ load(fridgeItemInput).asVertices { label "fridgeSensor" // The vertexLabel schema for fridgeSensor includes two keys: // partition key: cityId and clustering key: sensorId key cityId: "cityId", sensorId: "sensorId" }
Tip: The schema for the composite custom id must be created prior to using DSE Graph Loader, and cannot be inferred from the data. In addition, create a search index that includes all properties in the composite key. The search index is required to use DSE Graph Loader for inserting composite custom id data.Check the vertex id results withid()
to retrieve the full primary key definition:
Each vertex storesgremlin> g.V().hasLabel('fridgeSensor').id() ==>{~label=fridgeSensor, sensorId=93c4ec9b-68ff-455e-8668-1056ebc3689f, cityId=santaCruz} ==>{~label=fridgeSensor, sensorId=9c23b683-1de2-4c97-a26a-277b3733732a, cityId=sacramento} ==>{~label=fridgeSensor, sensorId=eff4a8af-2b0d-4ba9-a063-c170130e2d84, cityId=sacramento}
fridgeItem
as data:gremlin> g.V().valueMap() ==>{fridgeItem=[asparagus]} ==>{fridgeItem=[ham]} ==>{fridgeItem=[eggs]}
-
To load edges based on a composite key, a transformation is required:
/* SAMPLE EDGE DATA cityId|sensorId|name santaCruz|93c4ec9b-68ff-455e-8668-1056ebc3689f|asparagus */ the_edges = File.csv(inputfiledir + "fridgeItemEdges.csv").delimiter('|') the_edges = the_edges.transform { it['fridgeSensor'] = [ 'cityId' : it['cityId'], 'sensorId' : it['sensorId'] ]; it['ingredient'] = [ 'name' : it['name'] ]; it } load(the_edges).asEdges { label "contains" outV "ingredient", { label "ingredient" key "name" } inV "fridgeSensor", { label "fridgeSensor" key cityId:"cityId", sensorId:"sensorId" } }
The edge file transforms the partition key and clustering key into a map of
cityId
andsensorId
. This map can then be used to designate the key for afridgeSensor
vertex when the edges are loaded.The resulting map shows the edges created between ingredient and fridgeSensor vertices. -
For DSE 5.1.3 and later, an alternative method of loading edge data from CSV
files can be used:
In this example, no transform is required, but/* SAMPLE EDGE DATA cityId|sensorId|homeId 100|001|9001 */ isLocatedAt_fridgeSensor = File.csv(/tmp/data/edges/" + "isLocatedAt_fridgeSensor.csv").delimiter('|') load(isLocatedAt_fridgeSensor).asEdges { label "isLocatedAt" outV { label "fridgeSensor" key cityId: "cityId", sensorId: "sensorId" exists() ignore "homeId" } inV { label "home" key "homeId" exists() ignore "cityId" ignore "sensorId" } ignore "cityId" ignore "sensorId" ignore "homeId" }
ignore
statements are required in both theinV
andoutV
declarations, as well as the edge properties section. Removing theexists()
statement in the incoming and outgoing vertex declarations can enable loading the vertices as well as the edges in this mapping script.Important: There is a new subtle change in theinV
andoutV
declarations. An input field name is no longer used, such asinV "home", {
, due to the requirement to support multiple-key custom ids.The resulting map:
Mapping multi-cardinality edges
Mapping multi-cardinality edges data with DSE Graph Loader.
Multiple cardinality edges are a common type of data that is inserted into graphs. Often, the input file has both vertex and edge information for loading.
Procedure
ignore
while loading vertices:
/* SAMPLE INPUT
authorCity:
author|city|dateStart|dateEnd
Julia Child|Paris|1961-01-01|1967-02-10
*/
// CONFIGURATION
// Configures the data loader to create the schema
config dryrun: false, preparation: true, create_schema: false, load_new: true, schema_output: 'loader_output.txt'
// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/multiCard/'
authorCityInput = File.csv(inputfiledir + "authorCity.csv").delimiter('|')
//Specifies what data source to load using which mapper (as defined inline)
// Ignore city, dateStart, and dateEnd when creating author vertices
load(authorCityInput).asVertices {
label "author"
key "author"
ignore "city"
ignore "dateStart"
ignore "dateEnd"
}
// Ignore author, dateStart, and dateEnd when creating city vertices
load(authorCityInput).asVertices {
label "city"
key "city"
ignore "author"
ignore "dateStart"
ignore "dateEnd"
}
// create edges from author -> city and include the edge properties dateStart and dateEnd
load(authorCityInput).asEdges {
label "livedIn"
outV "author", {
label "author"
key "author"
}
inV "city", {
label "city"
key "city"
}
}
Mapping meta-properties
Mapping meta-property data with DSE Graph Loader.
If the input file includes meta-properties, or properties that have properties, use
vertexProperty
.
graphloader
// PROPERTY KEYS
schema.propertyKey('name').Text().single().create()
schema.propertyKey('gender').Text().single().create()
schema.propertyKey('badge').Text().single().create()
schema.propertyKey('since').Int().single().create()
// Create the meta-property since on the property badge
schema.propertyKey('badge').properties('since').add()
// VERTEX LABELS
schema.vertexLabel('reviewer').properties('name','gender','badge').create()
// INDEXES
schema.vertexLabel('reviewer').index('byname').materialized().by('name').add()
Procedure
vertexProperty
to identify
badge
as a vertex property. Note the structure of the
nested fields for badge
in the JSON file.
* SAMPLE INPUT
reviewer: { "name":"Jon Doe", "gender":"M", "badge" : { "value": "Gold Badge","since" : 2012 } }
*/
// CONFIGURATION
// Configures the data loader to create the schema
config dryrun: false, preparation: true, create_schema: true, load_new: true, load_vertex_threads: 3, schema_output: 'loader_output.txt'
// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/'
reviewerInput = File.json(inputfiledir + "reviewer.json")
//Specifies what data source to load using which mapper (as defined inline)
load(reviewerInput).asVertices{
label "reviewer"
key "name"
vertexProperty "badge", {
value "value"
}
}
reviewer
vertex where the property
badge
has a meta-property since
.
g.V().valueMap()
{badge=[Gold Badge], gender=[M], name=[Jane Doe]}
g.V().properties('badge').valueMap()
{since=2012}
Mapping multiple meta-properties
Mapping multiple meta-property data with DSE Graph Loader.
If the input file includes multiple meta-properties, or properties that have multiple
properties, use vertexProperty
.
graphloader
// PROPERTY KEYS
schema.propertyKey('badge').Text().multiple().create()
schema.propertyKey('gender').Text().single().create()
schema.propertyKey('name').Text().single().create()
schema.propertyKey('since').Int().single().create()
// VERTEX LABELS
schema.vertexLabel('reviewer').properties('name', 'gender', 'badge').create()
schema.propertyKey('badge').properties('since').add()
// INDEXES
schema.vertexLabel('reviewer').index('byname').materialized().by('name').add()
Procedure
vertexProperty
to identify
badge
as a vertex property. Note the structure of the
nested fields for badge
in the JSON file.
/* SAMPLE INPUT
reviewer: { "name":"Jane Doe", "gender":"F",
"badge" : [{ "value": "Gold Badge", "since" : 2012 },
{ "value": "Silver Badge", "since" : 2005 }] }
*/
// CONFIGURATION
// Configures the data loader to create the schema
config dryrun: false, preparation: true, create_schema: false, load_new: true, load_vertex_threads: 3, schema_output: 'loader_output.txt'
// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/'
reviewerInput = File.json(inputfiledir + "reviewerMultiMeta.json")
//Specifies what data source to load using which mapper (as defined inline)
load(reviewerInput).asVertices{
label "reviewer"
key "name"
vertexProperty "badge", {
value "value"
}
}
/* SAMPLE INPUT
name|gender|value|since
Jane Doe|F|Gold Badge|2011
Jane Doe|F|Silver Badge|2005
Jon Doe|M|Gold Badge|2012
*/
// CONFIGURATION
// Configures the data loader to create the schema
config dryrun: false, preparation: true, create_schema: false, load_new: true, load_vertex_threads: 3, schema_output: 'loader_output.txt'
// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/'
reviewerInput = File.csv(inputfiledir + "reviewerMultiMeta.csv").delimiter('|')
//Specifies what data source to load using which mapper (as defined inline)
reviewerInput = reviewerInput.transform {
badge1 = [
"value": it.remove("value"),
"since": it.remove("since") ]
it["badge"] = [badge1]
it
}
load(reviewerInput).asVertices{
label "reviewer"
key "name"
vertexProperty "badge", {
value "value"
}
}
reviewer
vertex where the property
badge
has multiple values. Choosing the pop-up link for badge
reveals the
meta-property values:Mapping geospatial and Cartesian data
Mapping geospatial and Cartesian data with DSE Graph Loader.
Geospatial and Cartesian data can be loaded with DSE Graph Loader. The DSE Graph
Loader is not capable of creating schema for geospatial and Cartesian data, so
schema must be created before loading and the create_schema
configuration must be set to false
.
//SCHEMA
schema.propertyKey('name').Text().create()
schema.propertyKey('point').Point().withGeoBounds().create()
schema.vertexLabel('location').properties('name','point').create()
schema.propertyKey('line').Linestring().withGeoBounds().create()
schema.vertexLabel('lineLocation').properties('name','line').create()
schema.propertyKey('polygon').Polygon().withGeoBounds().create()
schema.vertexLabel('polyLocation').properties('name','polygon').create()
schema.vertexLabel('location').index('byname').materialized().by('name').add()
schema.vertexLabel('lineLocation').index('byname').materialized().by('name').add()
schema.vertexLabel('polyLocation').index('byname').materialized().by('name').add()
schema.vertexLabel('location').index('search').search().by('point').add()
schema.vertexLabel('lineLocation').index('search').search().by('line').add()
schema.vertexLabel('polyLocation').index('search).search().by('polygon').add()
Search indexes must be used for
geospatial and Cartesian points, linestrings or polygons in graph queries. DSE Graph
uses one index per query, and because geospatial data consists of latitude and
longitude (two parameters), only search indexes can be used to optimize query
performance.Procedure
- If desired, add configuration to the mapping script.
-
Specify the data input directory. The variable
inputfiledir
specifies the directory for the input files. Each of the identified files will be used for loading./* SAMPLE DATA name|point New York|POINT(74.0059 40.7128) Paris|POINT(2.3522 48.8566) */ // DATA INPUT // Define the data input source (a file which can be specified via command line arguments) // inputfiledir is the directory for the input files inputfiledir = '/tmp/geo_dgl/data/' ptsInput = File.csv(inputfiledir + "vertices/place.csv").delimiter('|') linesInput = File.csv(inputfiledir + "vertices/place_lines.csv").delimiter('|') polysInput = File.csv(inputfiledir + "vertices/place_polys.csv").delimiter('|') //Specifies what data source to load using which mapper (as defined inline) load(ptsInput).asVertices { label "location" key "name" } import com.datastax.driver.dse.geometry.Point ptsInput = ptsInput.transform { it['point'] = Point.fromWellKnownText(it['point']); return it; } load(linesInput).asVertices { label "lineLocation" key "name" } import com.datastax.driver.dse.geometry.LineString linesInput = linesInput.transform { it['line'] = LineString.fromWellKnownText(it['line']); return it; } load(polysInput).asVertices { label "polyLocation" key "name" } import com.datastax.driver.dse.geometry.Polygon polysInput = polysInput.transform { it['polygon'] = Polygon.fromWellKnownText(it['polygon']); return it; }
A transformation of the input data is required, converting the point from the WKT format into the format DSE Graph stores. For a point, the transformation imports a Point library and uses thefromWellKnownText
method:
Linestrings and polygons use the same library and method, respectively.import com.datastax.driver.dse.geometry.Point ptsInput = ptsInput.transform { it['point'] = Point.fromWellKnownText(it['point']); return it; }
-
To run DSE Graph Loader for CSV loading from a directory, use the following
command:
graphloader geoMap.groovy -graph testGeo -address localhost
Mapping Gryo data generated from DSE Graph
Inserting Gryo binary data requires a slightly modified map script. To load Gryo data, allow DSE Graph Loader to create schema and load new data. Loading will require a graph schema_mode set to Development.
Procedure
inputfiledir = '/tmp/Gryo/'
recipeInput = com.datastax.dsegraphloader.api.Graph.file(inputfiledir + 'recipesDSEG.gryo').gryo().dse()
load(recipeInput.vertices()).asVertices {
labelField '~label'
key 'name'
}
load(recipeInput.edges()).asEdges {
labelField '~label'
outV 'outV', {
labelField '~label'
key 'name' : 'name', 'personId' : 'personId'
}
inV 'inV', {
labelField '~label'
key 'name' : 'name', 'bookId' : 'bookId'
}
}
The Gryo data format will include ~label
and
name
field values that must be used to create the
vertices. For instance, a record that is an author will have a
~label
of person
and property
name
. For the edges, notice that a user-defined vertex
ID consisting of both name
and bookId
is
used to identify the vertex to use as the incoming vertex for the edge.
Mapping Gryo data generated with TinkerGraph
Inserting Gryo binary data requires a slightly modified map script. To load Gryo data, allow DSE Graph Loader to create schema and load new data. Loading will require a graph schema_mode set to Development.
Procedure
//Specifies what data source to load using which mapper (as defined inline)
load(recipeInput.vertices()).asVertices {
labelField "~label"
key "~id", "id"
}
load(recipeInput.edges()).asEdges {
labelField "~label"
outV "outV", {
labelField "~label"
key "~id", "id"
}
inV "inV", {
labelField "~label"
key "~id", "id"
}
}
The Gryo data format will include ~label
and
name
field values that must be used to create the
vertices and edges. For instance, a record that is an author will have a
~label
of author
and property
name
. The vertexKeyMap
creates a map
of each vertex label to a unique property. This map is used to create unique
keys used while loading vertices from the binary file.
Mapping GraphML binary data
Mapping GraphML data with DSE Graph Loader.
Inserting GraphML binary data requires a slightly modified map script. To load GraphML data, allow DSE Graph Loader to create schema and load new data. Loading will require a graph schema_mode set to Development.
Procedure
//Specifies what data source to load using which mapper (as defined inline)
load(recipeInput.vertices()).asVertices {
labelField "~label"
key "~id", "id"
}
load(recipeInput.edges()).asEdges {
labelField "~label"
outV "outV", {
labelField "~label"
key "~id", "id"
}
inV "inV", {
labelField "~label"
key "~id", "id"
}
}
~label
and
~id
field values that must be used to create the label
and key for each record loaded. For instance, a record that is an author
will have a ~label
of author
. The
~id
will similarly be set in the record, a difference
from other data. The difference can be seen by looking at a record and
noting the presence of the id
field, based on the second
item in each key
setting in the mapping
script:g.V().hasLabel('author').valueMap()
{gender=[F], name=[Julia Child], id=[0]}
{gender=[F], name=[Simone Beck], id=[3]}
Mapping GraphSON binary data
Mapping GraphSON data with DSE Graph Loader.
Inserting GraphSON data requires a slightly modified map script. To load GraphSON data, allow DSE Graph Loader to create schema and load new data. Loading will require a graph schema_mode set to Development.
Procedure
//Specifies what data source to load using which mapper (as defined inline)
load(recipeInput.vertices()).asVertices {
labelField "~label"
key "~id", "id"
}
load(recipeInput.edges()).asEdges {
labelField "~label"
outV "outV", {
labelField "~label"
key "~id", "id"
}
inV "inV", {
labelField "~label"
key "~id", "id"
}
}
~label
and
~id
field values that must be used to create the label
and key for each record loaded. For instance, a record that is an author
will have a ~label
of author
. The
~id
will similarly be set in the record, a difference
from other data. The difference can be seen by looking at a record and
noting the presence of the id
field, based on the second
item in each key
setting in the mapping
script:g.V().hasLabel('author').valueMap()
{gender=[F], name=[Julia Child], id=[0]}
{gender=[F], name=[Simone Beck], id=[3]}