DSE Graph Loader reference
DSE Graph Loader reference.
Synopsis
graphloader loadingScript [[-option value]...]Options can be invoked in the command line or included in the loading script. Required options are marked.
Option | Data type | Default | Description |
---|---|---|---|
-abort_on_num_failures | Integer | 100 | Number of failures after which loading is aborted. |
-abort_on_prep_errors | Boolean | true | Normally if errors occur in the preparation, or during the vertex insertion phase we abort, setting this to false will force the loader to continue up to the maximum number of allowed failures. |
-address | String | The IP address (and port) of the DSE Graph instance to connect to. REQUIRED | |
-allow_remote_hosts_in_quorum | Boolean | false | DSE 5.0.6 and later. Allows hosts in a different datacenter to participate in a local consistency level, so that a node from a remote datacenter can be used to reach a consistency level of QUORUM, for instance, for a query. Choices are: true, false. |
-batch-size | Integer | 100 | Size of loading batches. |
-compress | String | none | The compression of the file. Choices are none, gzip, and xzip. |
-consistency_level | CL | ONE | DSE 5.0.6 and later. Choices are: ANY, ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, SERIAL, LOCAL_SERIAL, LOCAL_ONE |
-create_graph | Boolean | true | DSE 5.0.6 and later. Check if the target graph exists, and if it doesn't, creates it if true. Note that this option can fail on the default consistency level of QUORUM if a datacenter is unreachable. |
-create_schema | Boolean | false | Whether to update or create the schema for missing schema elements. Note: It is strongly recommended that
schema is created prior to
data loading, so that the correct data types are enforced and indexes
created. Setting
create_schema to true is recommended
only for testing. In DSE 6.0, this configuration option is deprecated
and will be removed in a future release. |
-driver_retry_attempts | Integer | 3 | Number of retry attempts. If greater than zero, requests will be resubmitted after some recoverable failures. |
-driver_retry_delay | milliseconds | 1000 | Number of milliseconds between driver retries. |
-dryrun | Boolean | false | Whether to only conduct a trial run to verify data integrity and schema
consistency. Does not create a graph if it doesn't exist. Note: This configuration option discovers
schema and suggests missing schema without executing any changes. In DSE
6.0, this option is deprecated and may possibly be removed in a future
release.
|
-filename | String | The file to load the vertex data from. REQUIRED if not defined in the mapping script. | |
-graph | String | The name of the graph to load into. REQUIRED | |
-label | String | The label of the vertex to be populated with data. If left blank, the name of the input file is used as the vertex label name. | |
-load_failure_log | String | load_failures.txt | Name and location of the file where failed records will be stored. |
-load_new | Boolean | false | Whether the vertices loaded are new and do not yet exist in the graph. |
-load_edge_threads | Integer | 1 | Number of threads to use for loading edge and property data into the graph (0 will force the value to be the number of nodes in the DC * 6). |
-load_vertex_threads | Integer | 1 | Nunber of threads to use for loading vertices into the graph (0 will force the value to the number of cores/2). |
-preparation | Boolean | true | Whether to do a preparation run to analyze the data and update the schema, if
necessary. Note: This configuration
option validates and creates schema if used in conjunction with
create_schema . The default will be set to
false , and this option is deprecated with DSE 6.0. In a
future release, it may be removed. |
-preparation_limit | Intger | 0 | The number of records that the preparation phase will use to attempt to determine if the schema should be updated. Zero indicates no limit. |
-queue-size | Integer | 10000 | Data retrieval queue size. |
-read_threads | Integer | 1 | Number of threads to use for reading data from data input. |
-remote_hosts_in_dc | Integer | 2 | DSE 5.0.6 and later. Number of remote nodes that can participate in the consistency level for a query. |
-reporting_interval | Integer | 1 | Number of seconds between each progress report written to the log. |
-schema_output | String | proposed_schema.txt | The name of the file to save the proposed schema in when executing a dry-run. Leave blank to disable. |
-skip_blank_values | Boolean | true | When false, loader will insert a blank ("") for all unspecified (empty/blank) property values in a CSV file. |
-timeout | Integer | 120000 | Number of milliseconds until a connection times out. |
-v | -version | N/A | N/A | Print the version of DSE Graph Loader. Version 5.0.4 and later. |
-vertex_complete | Boolean | false | The loader assumes that all vertexes referenced by properties and edges in this load are also included as vertexes of this load. No new vertices will be created from edge data or property data files. |
-username | String | Username for DSE authentication. | |
-password | String | Password for DSE authentication. | |
-ssl | Boolean | false | Enable SSL. |
-kerberos | Boolean | false | Enable kerberos. |
-sasl | String | An optional sasl protocol name used in conjunction with kerberos. |
Option | Data type | Default | Description |
---|---|---|---|
-kerberos | Boolean | false | Enable kerberos. |
-password | String | Password for DSE authentication. | |
-sasl | String | An optional sasl protocol name used in conjunction with kerberos. | |
-ssl | Boolean | false | Enable SSL. |
-username | String | Username for DSE authentication. |
Syntax conventions | Description |
---|---|
Italics |
Variable value. Replace with a user-defined value. |
[ ] |
Optional. Square brackets ( [ ] ) surround optional command
arguments. Do not type the square brackets. |
( ) |
Group. Parentheses ( ( ) ) identify a group to choose from. Do
not type the parentheses. |
| |
Or. A vertical bar ( | ) separates alternative elements. Type
any one of the elements. Do not type the vertical bar. |
[ -- ] |
Separate the command line options from the command arguments with two hyphens (
-- ). This syntax is useful when arguments might be mistaken for
command line options. |
Description
DSE Graph Loader is an utility for loading up to 100 million vertices and 1 billion edges. The utility runs on a sufficiently powerful computer that can cache all vertices in memory and includes enough cores to parallelize the loading process. For larger loads, the utility must be run on a different machine.
DSE Graph Loader is invoked on the command line with a loading script as argument and a variable number of configuration option-value pairs. The loading script specifies what input data is being loaded and how that data maps onto the graph. The loading script can also configure the option-value pairs.
- Preparation
- Reads entire input data. This stage either ensures that the data conforms to the
graph schema, or the stage updates the graph schema according to the provided data (if
enabled). At the end of this stage, statistical estimates are provided on how much
data will be added to the graph but no data is loaded. Set
to abort the loading process after the preparation stage and before any changes are made. Inspect the output and verify that it matches your expectations. For large datasets, doing a dry run is important for spotting errors.-dryrun true
- Vertex Loading
- The second stage adds or retrieves all of the vertices in the input data and caches them locally to speed up the subsequent edge loading.
- Edge and Property Loading
- Adds all edges and properties from the input data to the graph.
A loading, or mapping, script is required to specify the particular mapping used to load the data from the input file to the graph. DSE Graph Loader supports four file-based data input types: CSV, JSON, delimited text, and text parsed by regular expressions. All file-based input formats support compression of the input data files.
Logging during the loading process can provide useful information if troubleshooting is required. The three stages of load processing are detailed in the log.
Examples
-help
.graphloader -help
mymapscript.groovy
to read data
from a file /tmp/recipe/all.dat into the graph test that is running on the
localhost. Dry run is specified to test the loading without inserting the
data.graphloader mymapscript.groovy -filename /tmp/recipe/all.dat -graph test -address localhost -dryrun true
csv2Vertex.groovy
to read data
from a file MyUsers.csv into the graph csvTest that is running on the
localhost. The -label
option specifies that the vertex label will be
User
, rather than the filename
MyUsers
.graphloader ./scripts/csv2Vertex.groovy -filename MyUsers.csv -graph csvTest -label User -address 127.0.0.1
// CONFIGURATION
// Configures the data loader to create the schema and set load_vertex_threads to 3
config load_new: true, load_vertex_threads: 3
graphloader
logs debug information to the file
loader.log in the directory from which graphloader
is run. The location of the log can be specified with
-load_failure_log
:graphloader mymapscript.groovy -graph test -address localhost -load_failure_log /tmp/dgl.log
The preparation stage has additional options. To use the input data to discover the schema,
use -preparation true
. If preparation discovers missing elements in the
schema, those elements can be added if -create_schema true
. If desired,
preparation can be performed, but schema creation can be manually created if
-create_schema false
. Setting -create_schema
without
-preparation true
will result in a stopped job. Without sampling the data
to discover the schema that the data describes, graphloader
cannot create
schema because the manner of the schema is unknown. To summarize, if you wish to create
schema manually, use -preparation true -create_schema false
. If you wish
graphloader
to automatically create schema, use -preparation
true -create_schema true
.
graphloader
with -user
and
-password
:graphloader mymapscript.groovy -graph test -address localhost -username myName -password myPasswd
graphloader
with SSL encryption and using
Kerberos:java -Djavax.net.ssl.trustStore=<TRUSTSTORE_PATH> -Djavax.net.ssl.trustStorePassword=<PASSWORD> -Djavax.net.ssl.keyStore=<KEYSTORE_PATH> \ -Djavax.net.ssl.keyStorePassword=<PASSWORD> -jar dse-graph-loader-5.0.3-uberjar.jar -kerberos true -sasl dsename -graph new -address localhost mymapscript.groovyIf the truststore and keystore java options are set in cassandra-env.sh, the command is simplified:
java -jar dse-graph-loader.jar -kerberos true -sasl dsename -graph new -address localhost mymapscript.groovy
Runtime parameters
Some modifications are necessary if certain conditions must be set. For instance, the JAR file can be run directly to use Java modifiers, or the graphloader script may be modified to allow additional parameters to be set.
java -Xmx10g -jar dse-graph-loader.jar
LOADER_TMP_DIR
:
LOADER_TMP_DIR=/home/user ./graphloader -graph new -address localhost mymapscript.groovy
Sucessful loading
2017-02-09 23:27:22 INFO Reporter:97 - Current total additions: 1155735 vertices 1982536 edges 6583940 properties 0 anonymous