DSE Graph Loader overview
DSE Graph Loader is a customizable, highly tunable command line utility for loading graph datasets into DSE Graph from various input sources. It is not included as part of DataStax Enterprise installations and must be installed separately.
DSE Graph Loader is built to load datasets containing billions (109) of vertices and edges. DSE Graph Loader is efficient, using parallel loading and persistent cache to store vertices, provided a sufficient machine is used to run the program.
Data can be loaded from CSV files, JSON files, delimited text (CSV with a header line to identify the fields), text parsed by regular expressions, and binary Gryo files. Distributed filesystem support exists to read input files from Hadoop Distributed File Systems (HDFS) and AWS S3 sources. In addition, DSE Graph Loader supports reading input data directly from a JDBC compatible database or a Neo4J database. Input files can be uncompressed or compressed files. All data can be transformed upon reading to manipulate the data that is loaded into a graph.
Data from an input source file can be mapped to define vertices or edges, along with properties for both. The mapping script configures loading parameters, defines the input parameters, and identifies the mapping from each input record to graph element. Both vertex and edge properties can be included in the data that is loaded.
DSE Graph Loader processes input data with three stages:
- Preparation
-
Reads entire input data to check for graph schema conformity. Suggests graph schema updates, or if enabled, changes graph schema. Supplies statistics about how much data will be added to graph when loaded. The
dryrun
configuration option can be used to stop the loading process at this stage. - Vertex loading
-
Adds or retrieves all of the vertices in the input data and caches them locally to speed up subsequent edge loading. Vertex validation is enabled unless the data is identified as new data with
load_new
. If data is new, validation is not executed, and performance improvement will be seen. - Edge and property loading
-
Adds all edges and properties from the input data to the graph. Edge validation is enabled unless the data is identified as new data with
load_new
. If data is new, validation is not executed, and performance improvement will be seen. Another method of handling mixed new and existing data is the use ofisNew()
andexists()
. If duplicate edges are required,isNew()
must be used to designate those edges as additive to the edges that already exist.
Multiple cardinality input data must have graph schema created prior to data loading. |
A critical feature to keep in mind when using DSE Graph Loader is the upsert nature of the underlying DSE database. If a vertex already exists, DSE Graph Loader updates the stored data with the new property values depending on the configuration choices made. Configuration can be used to identify if the data loaded is new or will overwrite data that currently exists. Edges will be duplicated if the same edge is loaded multiple times and the edge label is set to the default of multiple cardinality.
It is strongly recommended that graph schema is created before loading data using DSE Graph Loader. Without schema, the correct data types for the data are not enforced. Creating indexes will greatly speed up the loading process, and are necessary to achieve acceptable performance for loading.