DSE Graph, OLTP and OLAP 

Explain OLTP and OLAP relationship in DSE Graph.

Online transaction processing (OLTP) is characterized by a large number of short online transactions for very fast query processing. OLTP is typically used for data entry and retrieval transaction-oriented applications. Online analytical processing (OLAP) is characterized by relatively low volume of transactions. OLAP is typically used to perform multidimensional analysis of data, doing complex calculations on aggregated historical data.

OLTP applications require sub-second response times, whereas OLAP applications take much longer to finish queries. In graph databases, OLTP traversals are localized to a particular subgraph of the global graph. Graph databases are a random access data system. OLAP traversals will do a linear scan of all vertices in the graph. OLTP traversals leverage indexes to "jump" in to a particular vertex in the graph before starting a scan on the subgraph.

OLTP queries are best for questions that require access to a limited subset of the entire graph. OLTP queries use filters to limit the number of vertices that will be walked to find answers. DSE Graph colocates vertices with their edges and adjacent neighbors. When a subgraph is specified in a traversal using indexes, the number of requests to disk are reduced to locate and write the requested subgraph to memory. Once in memory, the traversal performs a link walk from vertex to vertex along the edges.

OLAP queries are best for questions that must access a significant portion of the data stored in a graph. Using the method described above to evaluate OLAP queries will not be efficient, so a different process is used. When OLAP queries are processed, the entire graph is interpreted as a sequence of star graphs, each composed of a single vertex, along with its properties, incident edges and the edge's properties. The star graphs are linearly processed, jumping from one star graph to the next until all star graphs are processed and an aggregation of the discovered data is completed.

Understanding these underlying principles can lead to writing better graph traversals to query the graph data. A simple example illustrates the differences. Using the food graph, the query is "How many recipes has Julia Child created?"

Consider the following graph traversal:
g.V().in().has('name','Julia Child').count()
===>6
This traversal looks at all vertices, walks the incoming edges, finds the adjacent vertices that have a property key of name and property value of Julia Child, and counts the number of vertices. The count returned includes all vertices with edges to Julia Child, and not just the recipes, so as shown later, the count is incorrect and too high.
Consider the number of elements that must be traversed to complete this query. DSE Graph has profiling that aids in analyzing the traversal:
gremlin>  g.V().in().has('name','Julia Child').count().profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
DsegGraphStep([],vertex)                                              61          61           6.733    17.52
  query-optimizer                                                                              0.052
  iterator-setup                                                                               0.015
DsegVertexStep(IN,vertex)                                             78          78          19.496    50.72
  query-optimizer-total                                                                        0.328
  iterator-setup-total                                                                         0.313
HasStep([name.eq(Julia Child)])                                        5           5          12.147    31.60
CountGlobalStep                                                        1           1           0.040     0.11
SideEffectCapStep([~metrics])                                          1           1           0.022     0.06
                                            >TOTAL                     -           -          38.441        -
Note: The time each step takes depends on caching and other factors. For the purposes of this discussion, ignore the times reported.
Studio profile output for Traversal 1
Looking at the first step, all vertices in the graph are traversed. This graph is a very small graph, so the number of vertices is negligible compared to production graphs. In the next step, the traversal must find all incoming edges to the vertices. Again, for a small graph, the number of edges is negligible, but in production graphs, edges can number in the millions to billions. Now, the adjacent vertices are filtered for the property key information specified, narrowing the number of vertices to 6. The last two steps accomplish the count and profiling metrics.

This graph traversal is a classic OLAP traversal, that must touch all vertices and does not make use of indexing.

Now consider a modification that specifies the edge label for the incoming edges:
g.V().in('created').has('name','Julia Child').count()
===>3
This modified traversal still looks at all vertices, but in walking the incoming edges, it is limited to those that are labeled as created. Looking at the profile shows an improved picture:
gremlin>  g.V().in('created').has('name','Julia Child').count().profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
DsegGraphStep([],vertex)                                              61          61           9.750    33.94
  query-optimizer                                                                              0.075
  iterator-setup                                                                               0.028
DsegVertexStep(IN,[created],vertex)                                    8           8          17.056    59.38
  query-optimizer-total                                                                        0.732
  iterator-setup-total                                                                         0.420
HasStep([name.eq(Julia Child)])                                        3           3           1.856     6.46
CountGlobalStep                                                        1           1           0.033     0.12
SideEffectCapStep([~metrics])                                          1           1           0.028     0.10
                                            >TOTAL                     -           -          28.725        -
Studio profile output for Traversal 2
As with the original traversal, the first step still finds all the vertices. In the next step, however, the number of edges walked is significantly decreased. However, in a production graph, finding all the vertices in the entire graph will take a long time. Note that the third step now reflects the true answer for how many recipes Julia Child has created; in the first traversal, other incoming edges for Julia Child's books were included in the count.

This graph traversal is still an OLAP traversal that touch all vertices and does not use indexes.

What effect does specifying the vertex label have on improving the traversal?
g.V().hasLabel('recipe').in().has('name','Julia Child').count()
===>3
This modified traversal now is limited to the recipe vertices, but walks all incoming edges. The profile shows a somewhat better picture:
gremlin>  g.V().hasLabel('recipe').in().has('name','Julia Child').count().profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
DsegGraphStep([~label.eq(recipe)])                                     8           8           7.940    28.76
  query-optimizer                                                                              0.084
  iterator-setup                                                                               0.011
DsegVertexStep(IN,vertex)                                             22          22          10.708    38.78
  query-optimizer-total                                                                        0.093
  iterator-setup-total                                                                         0.061
HasStep([name.eq(Julia Child)])                                        3           3           8.784    31.81
CountGlobalStep                                                        1           1           0.126     0.46
SideEffectCapStep([~metrics])                                          1           1           0.053     0.19
                                            >TOTAL                     -           -          27.613        -
Studio profile output for Traversal 3
A limited number of vertices are found in the first step. A number of edges are walked. However, in a production graph, finding even a limited number of vertices will take some time without indexing and the number of edges walked could be quite large.

This graph traversal is still an OLAP traversal that does not use indexes. As will be described in the next example, indexes are identified by vertex label and property key. Although this traversal narrows the query by limiting the vertex label initially, an index is not used to find the starting point for the traversal.

This graph traversal twists the direction of the query:
g.V().has('author', 'name', 'Julia Child').outE('created').count()
===>3
This traversal starts at a single vertex by specifying both vertex label author and a specific property key and value Julia Child, and walks only the outgoing edges that have an edge label created.
gremlin>  g.V().has('author','name','Julia Child').outE('created').count().profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
DsegGraphStep([~label.eq(author), name.eq(Julia...                     1           1           2.248    77.28
  query-optimizer                                                                              0.210
  iterator-setup                                                                               0.013
DsegVertexStep(OUT,[created],edge)                                     3           3           0.618    21.25
  query-optimizer                                                                              0.097
  iterator-setup                                                                               0.008
CountGlobalStep                                                        1           1           0.029     1.00
SideEffectCapStep([~metrics])                                          1           1           0.013     0.47
                                            >TOTAL                     -           -           2.909        -
Studio profile output for Traversal 4
A single vertex start the traversal. The edges are filtered with the use of an edge label.

This graph traversal is an OLTP traversal. An index on the vertex label author and property key name can be used to start the traversal directly at an indexed vertex. This example results in a single vertex, but queries that use indexing to limit the starting point to even several vertices will be more efficient than a linear scan that must check all vertices in the graph. Thus, a subgraph, or portion of the graph is traversed.

The key to creating OLTP graph traversals is considering how the graph will be traversed. Use of indexing is critical to the success of fast transactional processing. The profiling tool included with DSE Graph is valuable to analyzing how the traversal performs.

For information on running OLAP queries using Spark, see DSE Graph and Graph Analytics.