Configuring DSE Graph Loader

Before loading data using any of the methods detailed in the next topics, decide which configuration items to include in the mapping script.

The configuration settings can be applied in the command line using a "-" command, like -read_threads, or the settings can be included in the mapping script. All configuration settings are shown in the DSE Graph Loader reference including security options.

  • The dryrun setting will run the DSE Graph Loader with a mapping script, and output the results, but will not execute the loading process. It is useful for spotting potential errors in the mapping script or graphloader command.

    config dryrun: true

    This command may be more useful to use as a command line option, since it is not common to leave in after checking a mapping script:

    graphloader map.groovy -graph food -address localhost -dryrun true

    This configuration option discovers schema and suggests missing schema without executing any changes. In DSE 6.0, this option is deprecated and may possibly be removed in a future release.

  • The preparation setting is a validity checking mechanism. If preparation is true, then a sample of the data is analyzed for whether or not the schema is valid. This setting is used in conjunction with create_schema. If create_schema and preparation are both true, then the data is analyzed, compared to the schema, and new schema is created if found missing.

    /* CONFIGURATION                                   */
    /* Configures the data loader to analyze the schema */
    config preparation: true

    See the table below for all permutations.

    This configuration option validates and creates schema if used in conjunction with create_schema. The default will be set to false, and this option is deprecated with DSE 6.0. In a future release, it may be removed.

  • This example sets create_schema to true, so that schema is created from the data. Setting create_schema to true is a good method of inputting new data, to get feedback on what schema may be required for the data. It is not recommended for Production data loading.

    /* CONFIGURATION                                   */
    /* Configures the data loader to create the schema */
    config create_schema: true

    It is strongly recommended that schema is created prior to data loading, so that the correct data types are enforced and indexes created. Setting create_schema to true is recommended only for testing. In DSE 6.0, this configuration option is deprecated and will be removed in a future release.

    preparation and create_schema must be considered together

    Schema preparation, creation, and results
    preparation create_schema Results

    true

    true

    Data is analyzed, and if schema is found missing, it is created; loading succeeds.

    true

    false

    Data is analyzed, and if schema is found missing, it is not created; loading fails. If schema is previously loaded manually, including indexes on the vertices loaded, loading succeeds.

    false

    true

    Data is not analyzed, and if schema is missing, it is not created; loading fails.

    false

    false

    Data is not analyzed, and if schema is found missing, it is not created; loading fails.

  • The load_new setting is used if vertex records do not yet exist in the graph at the beginning of the loading process, such as for a new graph. Configuring load_new can significantly speed up the loading process. However, it is important that the user guarantee that the vertex records are indeed new, or duplicate vertices can be created in the graph. Edges that are created in the same script will use the newly created vertices for the outgoing vertex outV and incoming vertex inV.

    config load_new: true

    Duplicate vertices will be created if load_new is set to false and the data being loaded contain any vertex that already exists in the graph.

    • Setting the number of threads used for loading vertices or edges uses load_vertex_threads and load_edge_threads, respectively; the default is 0, which will set load_vertex_threads to the number of cores divided by 2, and load_edge_threads to the number of nodes in the datacenter multiplied by six.

    config load_vertex_threads: 3 load_edge_threads: 0
  • Multiple configuration settings can be listed together.

    config load_new: true, dryrun: true, schema_output: '/tmp/loader_output.txt'

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com