Creating graph schema

Creating graph database schema.

Creating a data model for a graph database is the critical first step towards creating a schema. Once the data model is designed and a graph is created, defining the schema for the vertices and edges and their properties is the next step in creating a graph database. Gremlin-Groovy is the language used to create scripts; Gremlin-Groovy is packaged with the Apache TinkerPop engine, and can be used with either DataStax Studio or the Gremlin console (dse gremlin-console) installed with DataStax Graph.

Before creating schema, create a graph. If you are reusing a graph that you previously created, drop the graph schema and data. The general order of creating schema is to:
  • create any user-defined types (UDTs) that will be used in vertex or edge labels
  • create the vertex labels along with vertex properties
  • create the edge labels along with edge properties
  • create or analyze and apply indexes for either vertex or edge labels
Both vertex and edge labels can have properties of any UDT or standard data type. Graph data types and CQL data types are identical. Property names are assigned as CQL table columns, as vertex labels are stored one per table, as are edge labels. Property names are not global in nature. Properties will be used to retrieve selective subsets of the graph and to retrieve stored values in graph queries. Properties can be used to either pinpoint a particular vertex or edge as a starting point for a graph traversal.

Meta-properties, or properties of properties, can be stored in collections (set, list, map), tuples, or UDTs. Indexing these properties can facilitate graph queries that use the data stored in such data types. Collections, tuples and UDTs can be nested, and the frozen keyword can be used.

Vertex and edge labels can be checked for prior existence before creation using ifNotExists(). Vertex and edge labels can include a partition key that identifies on which partition the vertex label table will be located. To faciliatate ordering within a partition for either element, clustering columns can also be specified. There are limitations to the number of CQL tables, and thus the number of vertex and edge labels that can be implemented in a single graph. The limitations are predicated by the limitations of the number of CQL tables Cassandra can practically handle.

Indexing plays a key role in graph traversal processing. Three types of indexes can be defined for both vertex and edge labels: materialized view, secondary, and search. See the section on indexing for a more thorough discussion of indexing.

Schema can be added and dropped after initial creation. Like any database, this feature is useful during development, but doing such manipulation during production can affect data continuity.