Graph anti-patterns

Examine common mistakes made with DSE Graph.

Some common mistakes are made with DSE Graph. Examining best practices can ease the learning curve and improve graph application performance.

Not using indexing

Indexing is a key feature in decreasing the latency of queries in a distributed database. DSE Graph relies on indexing to speed up OLTP read latency for complex graph traversals. What is key to understand is that global indexing in DSE Graph involves both a vertex label and a property key. The vertex label narrows the search in the underlying Cassandra datastore to a partition, which in turn narrows the search to one or a small number of Cassandra nodes in the cluster. Indexing a property key that is used for more than one vertex label and not supplying the vertex label in the query amounts to an almost full scan of the cluster. Thus, using this query:

g.V().has('name','James Beard')...

requires the traversal to check all vertices that use the property key name. Changing this query to:

g.V().has('author', 'name', 'James Beard')...

allows the query to consult an index that can be built for all names in author records, and retrieve just one vertex to start the traversal. The index would be added during schema creation:

schema.vertexLabel('author').index('byName').secondary().by('name').add()

In fact, this one change in the traversal will change the query from an OLAP query into an OLTP query.

Property key creation

Property key creation can affect the performance of DSE Graph. Using unique property key names can seem beneficial at first, but reusing property keys for different vertex labels can improve the storage of property keys for the graphs. For example, consider the following:

schema.propertyKey('recipeCreationDate').Timestamp().create()
schema.propertyKey('mealCreationDate').Timestamp().create()
schema.propertyKey('reviewCreationDate').Timestamp().create()

While these property key names make code readable and ease tracking in graph traversals, each additional property key stored requires resources. Use one property key instead, such as:

schema.propertyKey('timestamp').Timestamp().create()

to decrease overhead. Since property keys are mostly used in graph traversals along with vertex labels, timestamp will be uniquely identified by the combination of vertex label and property key.

Vertex label creation

Vertex label creation can affect the performance of DSE Graph. Using many unique vertex labels can seem useful, but like property keys, the fewest vertex labels created can improve the storage requirements. For example, consider the following:

schema.vertexLabel('recipeAuthor').create()
schema.vertexLabel('bookAuthor').create()
schema.vertexLabel('mealAuthor').create()
schema.vertexLabel('reviewAuthor').create()

While these vertex labels again have the advantage of readability, unless a vertex label will be uniquely queried, it is best to roll the functionality into a single vertex label. For instance, in the above code, it is likely that recipes, meals, and books will have the same authors, whereas reviews are likely to have a different set of writers and types of queries. Use two vertex labels instead of four:

schema.vertexLabel('author').create()
schema.vertexLabel('reviewer').create()

In fact, this case may even be better suited to using only one vertex label person, if the overlap in authors and reviewers is great enough. In some cases, a property key that identifies whether a person is an author or a reviewer is a viable option.

schema.propertyKey('type').Text().create()
schema.vertexLabel('person').create()
graph.addVertex(label, 'person', 'type', 'author', 'name', 'Jamie Oliver')

Mixing schema creation or configuration setting with traversal queries

Consider the following statements. The first statement configures a graph setting for read consistency. The second statement executes a count on a field name with a value read vertex for all vertices.

schema.config().option('graph.tx_groups.default.read_consistency').set('ALL');
g.V().has('name', 'read vertex').count()

In Gremlin Server, both statements are run in one transaction. Any changes made during this transaction are applied when it successfully commits both actions. The change in read consistency is not actually applied until the end of a transaction and thereby only affects the next transaction. The statements are not processed sequentially as individual requests.

To avoid such errors in processing, avoid mixing schema creation or configuration setting with traversal queries in applications. Best practice is to create schema and set configurations before querying the graph database with graph traversals.

InterruptedException indicates OLTP query running too long

In general, seeing logs with this exception are indicative that an OLTP query is running too long. The typical cause is that indexes have not been created for elements used in graph traversal queries. Create the indexes and retry the queries.

g.V().count() and g.E().count() can cause long delays

Running a count on a large graph can cause serious issues. The command basically must iterate through all the vertices, taking hours if the graph is large. Any table scan (iterating all vertices) is simply not an OLTP process. Doing the same process on edges is essentially the same, a full table scan, as well. Using Spark commands are currently the recommended method to get these counts.

Setting replication factor too low for graph_name_system

Each graph created in turn creates three Cassandra keyspaces, graph_name, graph_name_system and graph_name_pvt. The graph_name_system stores the graph schema, and loss of this data renders the entire graph inoperable. Be sure to set the replication factor appropriately based on cluster configuration.

Using string concatenation in application instead of parameterized queries

String concatenation in graph applications will critically impair performance. Each unique query string creates an object that is cached on a node, using up node resources. Use parameterized queries (DSE Java Driver, DSE Python Driver, DSE Ruby Driver, DSE Node.js Driver, DSE C# Driver, DSE C/C++ Driver) to prevent problems due to resource allocation.