DSE Graph Loader reference

DSE Graph Loader reference.

Synopsis

cassandra-env.sh

The location of the cassandra-env.sh file depends on the type of installation:

Package installations	/etc/dse/cassandra/cassandra-env.sh
Tarball installations	`installation_location`/resources/cassandra/conf/cassandra-env.sh

graphloader loadingScript [[-option value]...]

Table 1. Legend
Syntax conventions	Description
UPPERCASE	Literal keyword.
Lowercase	Not literal.
`Italics`	Variable value. Replace with a valid option or user-defined value.
`[ ]`	Optional. Square brackets ( `[ ]` ) surround optional command arguments. Do not type the square brackets.
`( )`	Group. Parentheses ( `( )` ) identify a group to choose from. Do not type the parentheses.
`\|`	Or. A vertical bar ( `\|` ) separates alternative elements. Type any one of the elements. Do not type the vertical bar.
`...`	Repeatable. An ellipsis ( `...` ) indicates that you can repeat the syntax element as often as required.
`'Literal string'`	Single quotation ( `'` ) marks must surround literal strings in CQL statements. Use single quotation marks to preserve upper case.
`{ key:value }`	Map collection. Braces ( `{ }` ) enclose map collections or key value pairs. A colon separates the key and the value.
`<datatype1,datatype2>`	Set, list, map, or tuple. Angle brackets ( `< >` ) enclose data types in a set, list, map, or tuple. Separate the data types with a comma.
`cql_statement;`	End CQL statement. A semicolon ( `;` ) terminates all CQL statements.
`[ -- ]`	Separate the command line options from the command arguments with two hyphens ( `--` ). This syntax is useful when arguments might be mistaken for command line options.
`' <schema> ... </schema> '`	Search CQL only: Single quotation marks ( `'` ) surround an entire XML schema declaration.
`@xml_entity='xml_entity_type'`	Search CQL only: Identify the entity and literal value to overwrite the XML element in the schema and solrconfig files.

Options can be invoked in the command line or included in the loading script. Required options are marked.


Option	Data type	Default	Description
-abort_on_num_failures	Integer	100	Number of failures after which loading is aborted.
-abort_on_prep_errors	Boolean	true	Normally if errors occur in the preparation, or during the vertex insertion phase we abort, setting this to false will force the loader to continue up to the maximum number of allowed failures.
-address	String		The IP address (and port) of the DSE Graph instance to connect to. REQUIRED
-allow_remote_hosts_in_quorum	Boolean	false	Allows hosts in a different datacenter to participate in a local consistency level, so that a node from a remote datacenter can be used to reach a consistency level of QUORUM, for instance, for a query. Choices are: true, false.
-batch-size	Integer	100	Size of loading batches.
-compress	String	none	The compression of the file. Choices are none, gzip, and xzip.
-consistency_level	CL	ONE	Choices are: ANY, ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, SERIAL, LOCAL_SERIAL, LOCAL_ONE.
-create_graph	Boolean	true	Check if the target graph exists, and if it doesn't, creates it if true. Note that this option can fail on the default consistency level of QUORUM if a datacenter is unreachable.
-create_schema	Boolean	true	Whether to update or create the schema for missing schema elements. Note: It is strongly recommended that schema is created prior to data loading, so that the correct data types are enforced and indexes created. Setting `create_schema` to true is recommended only for testing. In DSE 6.0, this configuration option is deprecated and will be removed in a future release.
-driver_retry_attempts	Integer	3	Number of retry attempts. If greater than zero, requests will be resubmitted after some recoverable failures.
-driver_retry_delay	milliseconds	1000	Number of milliseconds between driver retries.
-dryrun	Boolean	false	Whether to only conduct a trial run to verify data integrity and schema consistency. Does not create a graph if it doesn't exist. Note: This configuration option discovers schema and suggests missing schema without executing any changes. In DSE 6.0, this option is deprecated and may possibly be removed in a future release.
-filename	String		The file to load the vertex data from. REQUIRED if not defined in the mapping script.
-graph	String		The name of the graph to load into. REQUIRED
-label	String		The label of the vertex to be populated with data. If left blank, the name of the input file is used as the vertex label name.
-load_failure_log	String	load_failures.txt	Name and location of the file where failed records will be stored.
-load_new	Boolean	false	Whether the vertices loaded are new and do not yet exist in the graph.
-load_edge_threads	Integer	0	Number of threads to use for loading edge and property data into the graph (0 will force the value to be the number of nodes in the DC * 6).
-load_vertex_threads	Integer	0	Number of threads to use for loading vertices into the graph (0 will force the value to the number of cores/2).
-preparation	Boolean	true	Whether to do a preparation run to analyze the data and update the schema, if necessary. Note: This configuration option validates and creates schema if used in conjunction with `create_schema`. The default will be set to `false`, and this option is deprecated with DSE 6.0. In a future release, it may be removed.
-preparation_limit	Intger	0	The number of records that the preparation phase will use to attempt to determine if the schema should be updated. Zero indicates no limit.
-queue-size	Integer	10000	Data retrieval queue size.
-read_threads	Integer	1	Number of threads to use for reading data from data input.
-remote_hosts_in_dc	Integer	2	Number of remote nodes that can participate in the consistency level for a query.
-reporting_interval	Integer	1	Number of seconds between each progress report written to the log.
-schema_output	String	proposed_schema.txt	The name of the file to save the proposed schema in when executing a dry-run. Leave blank to disable.
-skip_blank_values	Boolean	true	When false, loader will insert a blank ("") for all unspecified (empty/blank) property values in a CSV file.
-timeout	Integer	120000	Number of milliseconds until a connection times out.
-v \| --version	N/A	N/A	Print the version of DSE Graph Loader.
-vertex_complete	Boolean	false	The loader assumes that all vertexes referenced by properties and edges in this load are also included as vertexes of this load. No new vertices will be created from edge data or property data files.
-username	String		Username for DSE authentication.
-password	String		Password for DSE authentication.
-ssl	Boolean	false	Enable SSL.
-kerberos	Boolean	false	Enable kerberos.
-sasl	String		An optional sasl protocol name used in conjunction with kerberos.

Security options:


Option	Data type	Default	Description
-kerberos	Boolean	false	Enable kerberos.
-password	String		Password for DSE authentication.
-sasl	String		An optional sasl protocol name used in conjunction with kerberos.
-ssl	Boolean	false	Enable SSL.
-username	String		Username for DSE authentication.

Description

DSE Graph Loader is an utility for loading up to 100 million vertices and 1 billion edges. The utility runs on a sufficiently powerful computer that can cache all vertices in memory and includes enough cores to parallelize the loading process. For larger loads, the utility must be run on a different machine.

DSE Graph Loader is invoked on the command line with a loading script as argument and a variable number of configuration option-value pairs. The loading script specifies what input data is being loaded and how that data maps onto the graph. The loading script can also configure the option-value pairs.

The three stages of load processing are:

Preparation

Reads entire input data. This stage either ensures that the data conforms to the graph schema, or the stage updates the graph schema according to the provided data (if enabled). At the end of this stage, statistical estimates are provided on how much data will be added to the graph but no data is loaded. Set

-dryrun true

to abort the loading process after the preparation stage and before any changes are made. Inspect the output and verify that it matches your expectations. For large datasets, doing a dry run is important for spotting errors.

Vertex Loading

The second stage adds or retrieves all of the vertices in the input data and caches them locally to speed up the subsequent edge loading.

Edge and Property Loading

Adds all edges and properties from the input data to the graph.

A loading, or mapping, script is required to specify the particular mapping used to load the data from the input file to the graph. DSE Graph Loader supports four file-based data input types: CSV, JSON, delimited text, and text parsed by regular expressions. All file-based input formats support compression of the input data files.

Logging during the loading process can provide useful information if troubleshooting is required. The three stages of load processing are detailed in the log.

Examples

To get the listing of possible options, use -help.

graphloader -help

This example will use the loading script mymapscript.groovy to read data from a file /tmp/recipe/all.dat into the graph test that is running on the localhost. Dry run is specified to test the loading without inserting the data.

graphloader mymapscript.groovy -filename /tmp/recipe/all.dat -graph test -address localhost -dryrun true

This example will use the loading script csv2Vertex.groovy to read data from a file MyUsers.csv into the graph csvTest that is running on the localhost. The -label option specifies that the vertex label will be User, rather than the filename MyUsers.

graphloader ./scripts/csv2Vertex.groovy -filename MyUsers.csv -graph csvTest -label User -address 127.0.0.1

The configuration settings can also be specified in the loading script. A fragment of a loading script is shown here that sets create_schema to true and load_vertex_threads to 3.

// CONFIGURATION
// Configures the data loader to create the schema and set load_vertex_threads to 3
config load_new: true, load_vertex_threads: 3

By default, the graphloader logs debug information to the file loader.log in the directory from which graphloader is run. The location of the log can be specified with -load_failure_log:

graphloader mymapscript.groovy -graph test -address localhost -load_failure_log /tmp/dgl.log

If log4j modifications are desired to log information differently, a configuration file can be created, and used in conjunction with the -load_failure_log. Here is a sample configuration file:

# Set root logger level to the designated level and its appenders to F1 and stdout
log4j.rootLogger=INFO, WARN, A1, stdout
#/dev/stdout
# Log INFO messages to A1.  A1 is set to be a ConsoleAppender.
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.Target=System.out
log4j.appender.A1.Threshold=INFO
# A1 uses PatternLayout.
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
# Direct INFO log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.Threshold=INFO
# stdout uses PatternLayout.
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

and a sample graphloadercommand:

java -Dlog4j.configuration=file:./lib/log4j.properties -jar graphLoaderJar mymapscript.groovy -graph test -address localhost -load_failure_log /dev/stdout

that will write the log information to stdout.

The preparation stage has additional options. To use the input data to discover the schema, use -preparation true. If preparation discovers missing elements in the schema, those elements can be added if -create_schema true. If desired, preparation can be performed, but schema creation must be manually created if -create_schema false. Setting -create_schema true without -preparation true will result in a stopped job. Without sampling the data to discover the schema that the data describes, graphloader cannot create schema because the manner of the schema is unknown. To summarize, if you wish to create schema manually, use -preparation true -create_schema false. If you wish graphloader to automatically create schema, use -preparation true -create_schema true.

To use authentication, configure graphloader with -user and -password:

graphloader mymapscript.groovy -graph test -address localhost -username myName -password myPasswd

To configure graphloader with SSL encryption and using Kerberos:

java -Djavax.net.ssl.trustStore=<TRUSTSTORE_PATH> -Djavax.net.ssl.trustStorePassword=<PASSWORD> -Djavax.net.ssl.keyStore=<KEYSTORE_PATH> \
-Djavax.net.ssl.keyStorePassword=<PASSWORD> -jar dse-graph-loader-5.0.3-uberjar.jar -kerberos true -sasl dsename -graph new -address localhost mymapscript.groovy

If the truststore and keystore java options are set in , the command is simplified:

java -jar dse-graph-loader.jar -kerberos true -sasl dsename -graph new -address localhost mymapscript.groovy

Runtime parameters

Some modifications are necessary if certain conditions must be set. For instance, the JAR file can be run directly to use Java modifiers, or the graphloader script may be modified to allow additional parameters to be set.

If a large data set is loaded, configure the heap space to cache all vertices. This command runs Java and calls the jar file for DSE Graph Loader. For example:

java -Xmx10g -jar dse-graph-loader.jar

Vertex caching uses a temporary directory to store data during loading. If the temporary directory is not large enough, loading is blocked. To change the location of the temporary directory, use a runtime variable LOADER_TMP_DIR:

LOADER_TMP_DIR=/home/user ./graphloader -graph new -address localhost mymapscript.groovy

Successful loading

When graphloader has successfully loaded the data specified, notification of the results are logged to /var/lib/cassandra/system.log:

2017-02-09 23:27:22 INFO Reporter:97 - Current total additions: 1155735 vertices 1982536 edges 6583940 properties 0 anonymous