Graph anti-patterns
Examine common mistakes made with DSE Graph.
Some common mistakes are made with DSE Graph. Examining best practices can ease the learning curve and improve graph application performance.
Not using indexing
g.V().has('name','James Beard')...
requires the traversal to check
all vertices that use the property key name
. Changing this query
to:g.V().has('author', 'name', 'James Beard')...
allows the query to
consult an index that can be built for all names in author records, and retrieve just one vertex
to start the traversal. The index would be added during schema
creation:schema.vertexLabel('author').index('byName').secondary().by('name').add()
In fact, this one change in the traversal will change the query from an OLAP query into an OLTP query.
Property key creation
schema.propertyKey('recipeCreationDate').Timestamp().create()
schema.propertyKey('mealCreationDate').Timestamp().create()
schema.propertyKey('reviewCreationDate').Timestamp().create()
While
these property key names make code readable and ease tracking in graph traversals, each
additional property key stored requires resources. Use one property key instead, such
as:schema.propertyKey('timestamp').Timestamp().create()
to decrease
overhead. Since property keys are mostly used in graph traversals along with vertex labels,
timestamp
will be uniquely identified by the combination of vertex label and
property key.Vertex label creation
schema.vertexLabel('recipeAuthor').create()
schema.vertexLabel('bookAuthor').create()
schema.vertexLabel('mealAuthor').create()
schema.vertexLabel('reviewAuthor').create()
While
these vertex labels again have the advantage of readability, unless a vertex label will be
uniquely queried, it is best to roll the functionality into a single vertex label. For instance,
in the above code, it is likely that recipes, meals, and books will have the same authors,
whereas reviews are likely to have a different set of writers and types of queries. Use two
vertex labels instead of
four:schema.vertexLabel('author').create()
schema.vertexLabel('reviewer').create()
In fact, this case may even be better suited to using only one vertex label
person
, if the overlap in authors and reviewers is great enough. In some
cases, a property key that identifies whether a person
is an author or a
reviewer is a viable
option.schema.propertyKey('type').Text().create()
schema.vertexLabel('person').create()
graph.addVertex(label, 'person', 'type', 'author', 'name', 'Jamie Oliver')
Mixing schema creation or configuration setting with traversal queries
name
with a value
read vertex
for all
vertices.schema.config().option('graph.tx_groups.default.read_consistency').set('ALL');
g.V().has('name', 'read vertex').count()
In
Gremlin Server, both statements are run in one transaction. Any changes made during this
transaction are applied when it successfully commits both actions. The change in read
consistency is not actually applied until the end of a transaction and thereby only affects the
next transaction. The statements are not processed sequentially as individual requests.To avoid such errors in processing, avoid mixing schema creation or configuration setting with traversal queries in applications. Best practice is to create schema and set configurations before querying the graph database with graph traversals.
InterruptedException indicates OLTP query running too long
In general, seeing logs with this exception are indicative that an OLTP query is running too long. The typical cause is that indexes have not been created for elements used in graph traversal queries. Create the indexes and retry the queries.
g.V().count() and g.E().count() can cause long delays
Running a count on a large graph can cause serious issues. The command basically must iterate through all the vertices, taking hours if the graph is large. Any table scan (iterating all vertices) is simply not an OLTP process. Doing the same process on edges is essentially the same, a full table scan, as well. Using Spark commands are currently the recommended method to get these counts.
Setting replication factor too low for graph_name_system
Each graph created in turn creates three DSE database keyspaces, graph_name, graph_name_system and graph_name_pvt. The graph_name_system stores the graph schema, and loss of this data renders the entire graph inoperable. Be sure to set the replication factor appropriately based on cluster configuration.
Using string concatenation in application instead of parameterized queries
String concatenation in graph applications will critically impair performance. Each unique query string creates an object that is cached on a node, using up node resources. Use parameterized queries (DSE Java Driver, DSE Python Driver, DSE Ruby Driver, DSE Node.js Driver, DSE C# Driver, DSE C/C++ Driver) to prevent problems due to resource allocation.