Loading CSV data
A common file format for loading graph data is CSV (comma-delimited data).
An input CSV file generally identifies the property keys in the first line of the file with a header line.
However, the mapping script can also identify the property keys to be read with header()
in the data input line.
If more flexibility is desired, such as manipulation of the vertex labels using labelField
, use Loading TEXT data.
Mapping several different CSV files
Mapping several different CSV files with DSE Graph Loader.
DSE Graph Loader can load several different CSV files that exist in a directory using the following steps. Sample input data:
SAMPLE INPUT
// For the author.csv file:
name|gender
Julia Child|F
// For the book.csv file:
name|year|ISBN
Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2
// For the authorBook.csv file:
bname|aname
Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck
Because the property key name
is used for both vertex labels author
and book
, in the authorBook
file, variables aname
and bname
are used for author name and book name, respectively.
These variables are used in the mapping logic used to create the edges between author
and book
vertices.
-
If desired, add configuration to the mapping script.
-
Specify the data input files. The variable
inputfiledir
specifies the directory for the input files. Each of the identified files will be used for loading.// DATA INPUT // Define the data input source (a file which can be specified via command line arguments) // inputfiledir is the directory for the input files inputfiledir = '/tmp/CSV/' authorInput = File.csv(inputfiledir + 'author.csv').delimiter('|') bookInput = File.csv(inputfiledir + 'book.csv').delimiter('|') authorBookInput = File.csv(inputfiledir + 'authorBook.csv').delimiter('|')
It is important to note that CSV files can have a header line that shows the field names. For example, the
authorInput
will have the following as the first line in the file:name|gender
If a
header()
is used in the mapping script and a header line is used in the data file, then both must match. Either a header line in the data file or aheader()
is required. -
In each line, the file is specified as a
csv
file, the file name is specified, and a delimiter is set. A map,authorInput
, is created that will be used to process the data. The map can be manipulated before loading using transforms.authorInput = File.csv(inputfiledir + 'author.csv').delimiter('|')
If you need to trim excess whitespace from data, use
trimWhitespace(true)
in theFile.csv()
statement. -
Create the main body of the mapping script. This part of the mapping script is the same regardless of the file format.
-
To run DSE Graph Loader for CSV loading as a dry run, use the following command:
$ graphloader authorBookMappingCSV.groovy -graph testCSV -address localhost -dryrun true
For testing purposes, the graph specified does not have to exist prior to running
graphloader
. However, for production applications, the graph and schema must be created prior to usinggraphloader
.
The fullscript is shown:
/* SAMPLE INPUT
author: Julia Child|F
book: Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2
authorBook: Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck
*/
// CONFIGURATION
// Configures the data loader to create the schema
config create_schema: true, load_new: true, load_vertex_threads: 3
// DATA INPUT
// Define the data input source (a file which can be specified via command line arguments)
// inputfiledir is the directory for the input files
inputfiledir = '/tmp/CSV/'
authorInput = File.csv(inputfiledir + "author.csv").delimiter('|')
bookInput = File.csv(inputfiledir + "book.csv").delimiter('|')
authorBookInput = File.csv(inputfiledir + "authorBook.csv").delimiter('|')
//Specifies what data source to load using which mapper (as defined inline)
load(authorInput).asVertices {
label "author"
key "name"
}
load(bookInput).asVertices {
label "book"
key "name"
}
load(authorBookInput).asEdges {
label "authored"
outV "aname", {
label "author"
key "name"
}
inV "bname", {
label "book"
key "name"
}
}
Mapping several files with same format from a directory
DSE Graph Loader can load several CSV files with same format that exist in a directory using the following steps. Sample input data:
SAMPLE INPUT
// For the author.csv file:
name|gender
Julia Child|F
Simone Beck|F
// For the knows.csv file:
aname|bname
Julia Child|James Beard
A number of files with the same format exist in a directory. If the files differ, the graphloader will issue an error and stop:
java.lang.IllegalArgumentException: /tmp/dirSource/data has more than 1 input type.
-
If desired, add configuration to the mapping script.
-
Specify the data input directory. The variable
inputfiledir
specifies the directory for the input files. Each of the identified files will be used for loading.// DATA INPUT // Define the data input source (a file which can be specified via command line arguments) // inputfiledir is the directory for the input files inputfiledir = '/tmp/dirSource/data' personInput = File.directory(inputfiledir).delimiter('|').header('name','gender') //Specifies what data source to load using which mapper (as defined inline) load(personInput).asVertices { label "author" key "name" }
The important element is
File.directory()
; this defines the directory where the files are stored.It is important to note that CSV files must have a header line that shows the field names. For example, the
authorInput
will have the following as the first line in the file:name|gender
-
Note that two directories could be used to load vertices and edges:
// DATA INPUT // Define the data input source (a file which can be specified via command line arguments) // inputfiledir is the directory for the input files inputfiledir = '/tmp/dirSource/data' vertexfiledir = inputfiledir+'/vertices' edgefiledir = inputfiledir+'/edges' personInput = File.directory(vertexfiledir).delimiter('|').header('name','gender') personEdgeInput = File.directory(edgefiledir).delimiter('|').header('aname','bname') //Specifies what data source to load using which mapper (as defined inline) load(personInput).asVertices { label "author" key "name" } load(personEdgeInput).asEdges { label "knows" outV "aname", { label "author" key "name" } inV "bname", { label "book" key "name" } }
-
To run DSE Graph Loader for CSV loading from a directory, use the following command:
$graphloader dirSourceMapping.groovy -graph testdirSource -address localhost
For testing purposes, the graph specified does not have to exist prior to running
graphloader
. However, for production applications, the graph and schema should be created prior to usinggraphloader
.
Mapping files from a directory using a file pattern
Mapping several same format CSV files with DSE Graph Loader.
DSE Graph Loader can load several files from a directory using file pattern matching Sample input files:
$ ls data
$ badOne.csv person1.csv person2.csv
A number of files with the same format exist in a directory. If the files differ, DSE Graph Loader will only load the files that match the pattern in the map script.
Several file patterns are defined for use:
Pattern | Description | Example |
---|---|---|
* |
Matches zero or more characters. While matching, it will not cross directory boundaries. |
*.csv will match all CSV files ending in csv. |
** |
Same as * but will cross directory boundaries. |
CSV files in more than one directory. |
? |
Matches only one character. |
person?.csv will match all CSV files, named as person1.csv or personA.csv, but not person11.csv. |
|Avoid characters being interpreted as special characters, e.g. \\ to get a single \ . |
||
[ ] |
Matches a set of designated characters, though only as ingle character is matched. |
[efg] matches "e", "f", or "g", so person[efg] matches persone, personf or persong. [1-9] matches any one number; person[1-9] will get files person1.csv through person9.csv. |
{ } |
Matches group of sub-patterns. |
{csv,json} will match all CSV and JSON files in the directory. |
-
Mapping using *
-
If desired, add configuration to the mapping script.
-
Sample input file:
/* SAMPLE CSV INPUT: id|name|gender 001|Julia Child|F */
-
Specify the data input directory. The variable
inputfiledir
specifies the directory for the input files. Each of the identified files will be used for loading.// DATA INPUT // Define the data input source (a file which can be specified via command line arguments) // inputfiledir is the directory for the input files inputfiledir = '/tmp/filePattern' inputfileCSV = inputfiledir+'/data' personInput = File.directory(inputfileCSV).fileMatches("person*.csv").delimiter('|').header('id','name','gender') //Specifies what data source to load using which mapper (as defined inline) load(personInput).asVertices { label "person" key "name" } /* RESULT: person1.csv and person2.csv will be loaded, but not badOne.csv */
The important element is
fileMatches("person*.csv")
; this defines the pattern that will be matched for loaded files. The filebadOne.csv
will not be loaded, because the pattern does not match. Note that a file personExtra.csv would also be loaded, as it would match the pattern.This same pattern matching can be used for JSON input files, by substituting
person*.json
forperson*.csv
and using JSON input file parameters. -
To run DSE Graph Loader for CSV loading from a directory, use the following command:
$ graphloader filePatternCSV.groovy -graph testPattCSV -address localhost
For testing purposes, the graph specified does not have to exist prior to running
graphloader
. However, for production applications, the graph and schema should be created prior to usinggraphloader
. -
Mapping using [ ]
-
If desired, add configuration to the mapping script.
-
Sample input file:
/* SAMPLE CSV INPUT: id|name|gender 001|Julia Child|F */
-
Specify the data input directory. The variable
inputfiledir
specifies the directory for the input files. Each of the identified files will be used for loading.// DATA INPUT // Define the data input source (a file which can be specified via command line arguments) // inputfiledir is the directory for the input files inputfiledir = '/tmp/filePattern' inputfileCSV = inputfiledir+'/data' personInput = File.directory(inputfileCSV).fileMatches("person[1-9].csv").delimiter('|').header('id','name','gender') //Specifies what data source to load using which mapper (as defined inline) load(personInput).asVertices { label "person" key "name" } /* RESULT: person1.csv and person2.csv will be loaded, but not badOne.csv */
The important element is
fileMatches("person[1-9].csv")
; this defines the pattern that will be matched for loaded files. All filesperson1.csv
throughperson9.csv
will be loaded, butperson15.csv
doesn’t match the pattern and will not be loaded, as well asbadOne.csv
. Note thatfileMatches("person?.csv")
would achieve the same result.This same pattern matching can be used for JSON input files, by substituting
person[1-9].json
forperson[1-9].csv
and using JSON input file parameters. -
Run DSE Graph Loader for this example use the following command:
$ graphloader filePatternRANGE.groovy -graph testPattRANGE -address localhost
For testing purposes, the graph specified does not have to exist prior to running
graphloader
. However, for production applications, the graph and schema should be created prior to usinggraphloader
. -
Mapping using { } with multiple patterns
-
If desired, add configuration to the mapping script.
-
Sample input file:
/* SAMPLE CSV INPUT: id|name|gender 001|Julia Child|F */
-
Specify the data input directory. The variable
inputfiledir
specifies the directory for the input files. Each of the identified files will be used for loading.// DATA INPUT // Define the data input source (a file which can be specified via command line arguments) // inputfiledir is the directory for the input files inputfiledir = '/tmp/filePattern/data' personInput = File.directory(inputfiledir).fileMatches("{person*,badOne}.csv").delimiter('|').header('id','name','gender') //Specifies what data source to load using which mapper (as defined inline) load(personInput).asVertices { label "person" key "name" } /* RESULT: person1.csv, person2.csv and badOne.csv will all be loaded */
The important element is
fileMatches("{person*,badOne}.csv")
; this defines the pattern that will be matched for loaded files. The filesperson1.csv
,person1.csv
, andbadOne.csv
will be loaded, because the pattern matches all three files. This same pattern matching can be used for JSON input files, by substitutingperson*.json
forperson*.csv
and using JSON input file parameters. -
To run DSE Graph Loader for this example using the following command:
$ graphloader filePatternMULT.groovy -graph testPattMULT -address localhost
For testing purposes, the graph specified does not have to exist prior to running
graphloader
. However, for production applications, the graph and schema should be created prior to usinggraphloader
.