Loading TEXT data using regular expressions (regex)

How to use the DSE Graph Loader to load text data using regex.

The data mapping script for text data parsed using regular expressions (regex) is shown with explanation. The full script is found at the bottom of the page.

Procedure

  • If desired, add configuration to the mapping script.
  • A sample of the data for load looks like the following:
    SAMPLE INPUT
    // This file uses tabs between fields
    // For the authorREGEX.data file: 
    name:Julia Child	gender:F
    // For the bookREGEX.dat file:
    name:Simca's Cuisine: 100 Classic French Recipes for Every Occasion	year:1972	ISBN:0-394-40152-2
    // For the authorBookREGEX.dat file: 
    bname:Simca's Cuisine: 100 Classic French Recipes for Every Occasion	aname:Simone Beck
  • Specify the data input files. The variable inputfiledir specifies the directory name for the input files. Each of the identified files will be used for loading.
    // DATA INPUT
    // Define the data input source 
    // inputfiledir is the directory for the input files
    
    inputfiledir = '/tmp/REGEX/'
    authorInput = File.text(inputfiledir + "authorREGEX.dat").
        regex("name:(.*)\\tgender:([MF])").
        header('name', 'gender')
    bookInput = File.text(inputfiledir + "bookREGEX.dat").
    	regex("name:(.*)\\tyear:([0-9]{4})\\tISBN:([0-9]{1}[-]{1}[0-9]{3}[-]{1}[0-9]{5}[-]{1}[0-9]{0,1})").
    	header('name', 'year', 'ISBN')
    authorBookInput = File.text(inputfiledir + "authorBookREGEX.dat").
        regex("bname:(.*)\\taname:(.*)").
        header('bname', 'aname')

    Because the property key name is used for both vertex labels author and book, in the authorBook file, variables aname and bname are used for author name and book name, respectively. These variables are used in the mapping logic used to create the edges between author and book vertices.

  • In each line, the file is specified as a text file, the file name is specified, a delimiter is set, and a header must be specified to identify the fields that will be read. In addition, to parse each line of the text file using regex, the regex logic is included. A map, authorInput, is created that will be used to process the data. The map can be manipulated before loading using transforms.
    authorInput = File.text(inputfiledir + "authorREGEX.dat").regex("name:(.*)\\tgender:([MF])").header('name', 'gender')
  • Create the main body of the mapping script. This part of the mapping script is the same regardless of the file format.
  • To run DSE Graph Loader for text loading as a dry run, use the following command:
    graphloader authorBookMappingREGEX.groovy -graph testREGEX -address localhost -dryrun true

    For testing purposes, the graph specified does not have to exist prior to running graphloader. However, for production applications, the graph and schema should be created prior to using graphloader.

  • The full loading script is shown:
    /* SAMPLE INPUT - uses tabs
    author:
    name:Julia Child	gender:F
    book: 
    name:Simca's Cuisine: 100 Classic French Recipes for Every Occasion	year:1972	ISBN:0-394-40152-2
    authorBook: 
    bname:Simca's Cuisine: 100 Classic French Recipes for Every Occasion	aname:Simone Beck
     */
    
    // CONFIGURATION
    // Configures the data loader to create the schema
    config create_schema: true, load_new: true, load_vertex_threads: 3
    
    // DATA INPUT
    // Define the data input source (a file which can be specified via command line arguments)
    // inputfiledir is the directory for the input files that is given in the commandline
    // as the "-filename" option
    inputfiledir = '/tmp/REGEX/'
    authorInput = File.text(inputfiledir + "authorREGEX.dat").
        regex("name:(.*)\\tgender:([MF])").
        header('name', 'gender')
    bookInput = File.text(inputfiledir + "bookREGEX.dat").
    	regex("name:(.*)\\tyear:([0-9]{4})\\tISBN:([0-9]{1}[-]{1}[0-9]{3}[-]{1}[0-9]{5}[-]{1}[0-9]{0,1})").
    	header('name', 'year', 'ISBN')
    authorBookInput = File.text(inputfiledir + "authorBookREGEX.dat").
        regex("bname:(.*)\\taname:(.*)").
        header('bname', 'aname')
    
    //Specifies what data source to load using which mapper (as defined inline)
      
    load(authorInput).asVertices {
        label "author"
        key "name"
    }
    
    load(bookInput).asVertices {
        label "book"
        key "name"
    }
    
    load(authorBookInput).asEdges {
        label "authored"
        outV "aname", {
            label "author"
            key "name"
        }
        inV "bname", {
            label "book"
            key "name"
        }
    }