DSE Graph configuration
Configure DSE Graph.
Adjusting DSE Graph configuration can create an environment easier to use for development, while protecting and improving the performance for a production environment. Some configurations affect the interaction of applications with the graph database, while others affect internal processing within DSE. In addition, securing DSE Graph has important consequences, and a number of configuration settings can secure cluster operation. Whether doing development or implementing production, a thorough knowledge of the configuration is vital.
General DSE Graph settings
Settings that affect DSE Graph core functionality.
dse.yaml
The location of the dse.yaml file depends on the type of installation:| Package installations | /etc/dse/dse.yaml | 
| Tarball installations | installation_location/resources/dse/conf/dse.yaml | 
dse.yaml Graph options
DSE Graph stores cluster-wide options for DSE Graph in
     dse.yaml under the graph: and gremlin-server: keys.
    Most of the options that are common to modify have been discussed in the sections below. Of
    particular note, the Graph sandbox is
     configured in the Gremlin Server options of the dse.yaml file. This feature is enabled
    by default and provides protection from malicious attacks within the JVM.
To modify dse.yaml settings, modify the file on each node in the cluster and restart each node. Settings in the dse.yaml are node system level in scope. The dse.yaml files can also be modified using OpsCenter. Another alternative is to set options per graph, described in the Schema API configuration.
remote.yaml Gremlin console options
The remote.yaml file is the primary configuration file for DSE Graph Gremlin console connections to the Gremlin Server. Most options are self-explanatory. In particular, be aware that if you are using analytic OLAP queries with DSE Graph, changes are required in this file.
Replication factor
The replication factor (RF) and system replication factor (system RF) for a graph can affect the performance of reads and writes in DSE. Just as for the DSE database, these factors control the number of replicas of data that the distributed graph database will store across multiple nodes.
| Number of nodes in each datacenter | Graph replication factor | Graph System replication factor | 
|---|---|---|
| 1 | 1 | 1 | 
| 2 | 2 | 2 | 
| 3 | 3 | 3 | 
| 4 | 3 | 4 | 
| 5 or greater | 3 | 5 | 
Consistency_mode, datacenter_id, read_consistency, and write_consistency
Consistency level in DSE Graph is controlled for both graph operation and DSE database
    operations. The consistency_mode setting configures graph operations, and
     read_consistency and write_consistency settings configure the
    consistency level of DSE database read and write operations within a graph transaction.
consistency_mode (default: GLOBAL) is appropriate for user-defined vertex ids. If auto-generated vertex ids are used, this
    setting can be changed to DC_LOCAL, with a concurrent change made to the
     datacenter_id setting. Both consistency_mode and
     datacenter_id must be configured on every node in the cluster. The
     datacenter_id setting is ignored if consistency_mode is set
    to GLOBAL.Gremlin queries execute CQL commands to insert, read, and update graph data via traversals,
    and so the DSE database consistency level settings can affect the execution of graph operations.
    The consistency level for reads or writes can generally be set per graph with the
     read_consistency (default: ONE) and write_consistency
    (default: LOCAL_QUORUM) settings for user-defined vertex ids. If a search index is used in a
    graph traversal, the read_consistency will be set to LOCAL_ONE in a multiple
    datacenter cluster. The options are set with the Schema API options.
schema_mode
To access data, two configuration items are important: schema_mode
    and allow_scan.
schema_mode setting has two choices that identify whether
    automatic schema creation is allowed or not:- Development: allows loading graph data before explicitly specifying a graph schema through the Graph Schema API
- Production(default): required explicit graph schema prior to loading graph data
schema_mode setting has a hard-coded default value of
     Production, that can be overridden by either:- including an option in the dse.yaml file: schema_mode: Development
- using a option:
       schema.config().option('schema_mode').set('Development')
schema_mode:
     Development can be beneficial in helping you to discover the graph schema that you may
    want to use. However, setting schema_mode: Production is important once
    development is complete, to prevent random schema creation.schema_mode
                and allow_scan are set for production, not development, to ensure
                out-of-the-box operation conforms to the more restrictive environment.- schema.getEffectiveSchemaMode(): Returns the graph-level setting, dse.yaml value (if specified), or hard-coded default- schema_modevalue that may have been set, in that order.
- schema.getEffectiveAllowScan(): Returns the graph-level setting, dse.yaml value (if specified), or hard-coded default- allow_scanvalue that may have been set, in that order. It does not consider any transaction-level setting that may have been set.
- graph.getEffectiveAllowScan(): Returns the- allow_scanvalues for a graph's internal automatically-managed transaction.- schema.getEffectiveSchemaMode()is preferred; this method is necessary only when the user overrides the- allow_scanvalue at the transaction level with- tx().config().option('graph.allow_scan').
DSE Graph security settings
Settings that affect DSE Graph security.
dse.yaml
The location of the dse.yaml file depends on the type of installation:| Package installations | /etc/dse/dse.yaml | 
| Tarball installations | installation_location/resources/dse/conf/dse.yaml | 
Graph sandbox and whitelisted/blacklisted code
gremlin-server: key, is enabled by default. This
                security feature prevents malicious code execution in the JVM that could harm a DSE
                instance. Sandbox rules are defined to both blacklist (disallow execution) and
                whitelist (allow execution) packages, superclasses and types. For Java/Groovy code
                entered in the Gremlin console, only the specified allowed operations will execute.
                The default sandbox rules may be
                overridden in the dse.yaml file. The sandbox rules are applied
                in the following order:- blacklist_supers, including all classes that implement or extend the listed items
- blacklist_packages, including all sub-packages
- whitelist_packages, including all sub-packages
- whitelist_types, not including sub-classes, but only the specified type
- whitelist_supers, including all classes that implement or extend the listed items
- java.lang.System: All methods other than currentTimeMillis and nanoTime are blocked (blacklisted).
- java.lang.Thread: currentThread().isInterrupted is an allowed method that can return a wrapped thread with toString, and sleep is another allowed method, and all other methods are disallowed.
gremlin_server section of the dse.yaml file:
                gremlin_server:
     port: 8182
     threadPoolWorker: 2
     gremlinPool: 0
        scriptEngines:
            gremlin-groovy:
                config:
 #                  sandbox_enabled: false
                       sandbox_rules:
                            whitelist_packages:
                            - org.apache.tinkerpop.gremlin.process
                            - java.nio
                        whitelist_types:
                            - java.lang.String
                            - java.lang.Boolean
                            - com.datastax.bdp.graph.spark.SparkSnapshotBuilderImpl
                            - com.datastax.dse.graph.api.predicates.Search
                        whitelist_supers:
                            - groovy.lang.Script
                            - java.lang.Number
                            - java.util.Map
                            - org.apache.tinkerpop.gremlin.process.computer.GraphComputer
                        blacklist_packages:
                            - java.io
                            - org.apache.tinkerpop.gremlin.structure.io
                            - org.apache.tinkerpop.gremlin.groovy.jsr223
                            - java.nio.channelsThe Fluent API restricts the allowable operations to secure execution, but uses the sandbox to enable lambda functions.
Authentication, authorization, and encryption
DSE can authenticate or authorize access by users, secure the stored data with encryption, or secure Gremlin console with SSL, based on Graph vertex labels or graphs, as applicable.
DSE Graph security is managed by DSE security. As noted in this topic, you can modify
                the Graph Sandbox by
                customizing the gremlin-server: key of the
                dse.yaml file. 
To configure the DSE Graph Gremlin console connection to the Gremlin Server, customize the remote.yaml file for your environment.
DSE Graph also supports auditing using DSE auditing; for details, refer to Setting up database auditing.
Restrict lambda
restrict_lambda (default: true) value. DSE Graph traversal performance settings
Settings that affect DSE Graph traversal performance.
dse.yaml
The location of the dse.yaml file depends on the type of installation:| Package installations | /etc/dse/dse.yaml | 
| Tarball installations | installation_location/resources/dse/conf/dse.yaml | 
allow_scan
To access data, two configuration items are important: schema_mode and
     allow_scan.
allow_scan setting is a Boolean setting that identifies whether
    full scans of the entire cluster are allowed or not:- TRUE: allows any graph query to do full scans of the cluster, similar to ALLOW FILTERING in CQL queries. Although useful during development, allowing full scan can result in queries that do costly linear scans over one or more tables.
- FALSE(default): will not execute a query if restrictions to a subset of the entire cluster’s data are not included
allow_scan setting has a hard-coded default value of
     FALSE, that can be overridden to a value of TRUE by doing one
    of the following actions:allow_scan:
     true allows you to fully explore and visualize ther relationships in small test
    datasets with very broad queries like g.V(). Be aware, however, that traversals
    depending on full scans will take too long to execute with large production-size datasets, and
    that once development is complete, allow_scan: false is the appropriate
    setting. schema_mode
                and allow_scan are set for production, not development, to ensure
                out-of-the-box operation conforms to the more restrictive environment.- schema.getEffectiveSchemaMode(): Checks the hard-coded value, dse.yaml value (if specified), and graph-level setting that may have been set.
- schema.getEffectiveAllowScan(): Checks the hard-coded value, dse.yaml value (if specified), and graph-level setting that may have been set.
- graph.getEffectiveAllowScan(): Checks the hard-coded value, dse.yaml value (if specified), graph-level setting that may have been set, and transaction-level setting that may have been set.
cache
- adjacency cache: store the properties of vertices and the properties of those vertices' incident edges
- index cache: stores the results of graph traversals that include a global index, such as a hasLabel() or has() step
cache (default: true) can be used to disable caching. In addition, both
    adjacency cache and index cache have settings that can be modified:| Cache setting | Default | Location | Description | 
|---|---|---|---|
| vertex_cache_size | 10000l | Set with Schema API. | Maximum size of transaction-level cache of recently-used vertices. | 
| adjacency_cache_clean_rate | 1024 | dse.yaml | Number of stale rows per second to clean from each graph's adjacency cache. | 
| adjacency_cache_max_entry_size_in_mb | 0 | dse.yaml | Maximum entry size in each graph's adjacency cache. | 
| adjacency_cache_size_in_mb | 128 | dse.yaml | Amount of ram to allocate to each graph's adjacency (edge and property) cache. | 
| index_cache_clean_rate | 1024 | dse.yaml | Number of stale entries per second to clean from the index adjacency cache. | 
| index_cache_max_entry_size_in_mb | 0 | dse.yaml | Maximum entry size in the index adjacency cache. When set to zero, the default is calculated based on the cache size and the number of CPUs. | 
Timeouts
Timeout settings can cause failure of DSE Graph in a variety of ways, both
    client-side and server-side. On the client-side, commands from the Gremlin console can time out
    before reaching the Gremlin server. Issuing the command :remote config timeout
     none in the Gremlin console allows the default maximum timeout of 3 minutes to be
    overridden with no time limit. Any request typed into the Gremlin console is sent to the Gremlin
    Server, and the console waits for a response before it aborts the request and returns control to
    the user. If the timeout is changed to none, the request will never timeout. This can be useful
    if the time to send a request to the server and get a return is taking longer than the default
    timeout, for complex traversals or large datasets.
On the server-side, the cluster-wide timeout settings,
     realtime_evaluation_timeout_in_seconds (default: 30 seconds) or
     analytic_evaluation_timeout_in_minutes (default: 1008 minutes), are the
    maximum time to wait for a traversal to evaluate for OLTP or OLAP traversals, respectively.
    These settings are found in the dse.yaml file. If the timeout behavior for traversal evaluation
    needs to be overridden for a particular graph, evaluation_timeout can be set on
    a graph-by-graph basis, to override either the OLTP or OLAP traversal evaluation timeout. If
    complex traversals are timing out during execution, changing an appropriate timeout setting
    should fix the error. 
An additional server-side setting that can be adjusted in the dse.yaml file is
     schema_agreement_timeout_in_ms (30 seconds), the maximum time to wait for
    schema versions to agree across a cluster when making schema changes. If a large schema is
    submitted to a cluster, especially with indexes defined, this setting may need adjustment before
    data is submitted to the graph.
Finally, in the dse.yaml file, system_evaluation_timeout_in_seconds
    (default: 180 seconds) is defined as the maximum time to wait for a graph system request to
    evaluate. Creating or dropping a graph is a system request affected by this setting, which does
    not interact with the other timeout options.
| Timeout | Default | Impact | 
|---|---|---|
| :remote config timeout none | 3 minutes | Lengthen if command transit from Gremlin console to Gremlin Server is timing out. | 
| realtime_evaluation_timeout_in_seconds | 30 seconds | Lengthen if the OLTP traversal evaluation is timing out. | 
| analytic_evaluation_timeout_in_minutes | 1008 minutes | Lengthen if the OLAP traversal evaluation is timing out. | 
| evaluation_timeout | N/A | Set per-graph to override OLTP or OLAP traversal evaluation timeout. | 
| schema_agreement_timeout_in_ms | 30 seconds | Lengthen if a large schema is submitted, especially with indexes. | 
| system_evaluation_timeout_in_seconds | 180 seconds | Lengthen if graph system requests are not completing. | 
external_vertex_verify and internal_vertex_verify
These settings allow a tradeoff between correctness verification and better load performance.
    For example, when loading large datasets that have user-defined vertex ids
     external_vertex_verify (default: true) or auto-generated vertex ids
     internal_vertex_verify (default: false), these options are important. If you
    have a fresh clean graph with no data yet, and don’t want to check if vertex ids found in your
    data already exist in the graph, then set the appropriate option to false and
    speed up data loading with DSE Graph Loader. Of course, if you do have data already and don’t
    want to overwrite it with the newly loading dataset, you should use a true value for the
    appropriate option.
tx_autostart and max_query_queue
If you are loading large GraphSON files,
     tx_autostart can enable a query to automatically start a new transaction once 10,000 elements are
    reached during loading. Another useful method of avoiding restrictions when loading large files
    is to configure max_query_queue in the dse.yaml file to
    remove restrictions at the node system-level.
