Hadoop（HDFS）からのデータの読み込み

DSE Graph Loaderを使用してHadoop（HDFS）からデータを読み込む方法。

HDFSから読み込むためのデータ・マッピング・スクリプトを説明付きで示します。スクリプト全文は、ページの最後にあります。

手順

必要に応じて、マッピング・スクリプトに構成を追加します。

HDFS上に存在するCSVデータの例：

// SAMPLE INPUT 
// For the author.csv file:           
// name|gender                    
// Julia Child|F          
// For the book.csv file:     
// name|year|ISBN                
// Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2
// For the authorBook.csv file:       
// bname|aname                    
// Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck

頂点ラベルauthorとbookの両方にプロパティ・キーnameが使用されるため、authorBookファイルでは、著者名と本の名前に変数anameとbnameがそれぞれ使用されます。これらの変数は、author頂点とbook頂点の間にエッジを作成するために使用されるマッピング・ロジックで使用されます。

HDFSリファレンスdfs_uriおよびファイル名を使用してデータ入力を指定します。


// DATA INPUT   
// Define the data input sources  /
// dfs_uri specifies the URI to the HDFS directory in which the files are stored 

dfs_uri = 'hdfs://hadoopNode:9000/food/'
authorInput = File.csv(dfs_uri + 'author.csv.gz').
    gzip().
    delimiter('|')
bookInput = File.csv(dfs_uri + 'book.csv.gz').
    gzip().
    delimiter('|')
authorBookInput = File.csv(dfs_uri + 'authorBook.csv.gz').
    gzip().
    delimiter('|')

この例では、圧縮ファイルと追加手順gzip()を使用します。

マッピング・スクリプトの本文を作成します。マッピング・スクリプトのこの部分は、ファイル形式に関係なく同じです。
テキストの読み込みに、DSE Graph Loaderをdry runとして実行するには、次のコマンドを使用します。
```
graphloader authorBookMappingHDFS.groovy -graph testHDFS -address localhost -dryrun true
```
テスト目的の場合、graphloaderの実行前に、指定されたグラフが存在する必要はありません。ただし、プロダクション・アプリケーションの場合は、グラフとスキーマを作成してから、graphloaderを使用する必要があります。-dryrun trueオプションは、データを読み込まずにコマンドを実行します。

読み込みスクリプトの全文は次のようになります。

// SAMPLE INPUT
// For the author.csv file:           
// name|gender                    
// Julia Child|F          
// For the book.csv file:     
// name|year|ISBN                
// Simca's Cuisine: 100 Classic French Recipes for Every Occasion|1972|0-394-40152-2
// For the authorBook.csv file:       
// bname|aname                    
// Simca's Cuisine: 100 Classic French Recipes for Every Occasion|Simone Beck

// CONFIGURATION  
// Configures the data loader to create the schema 

config create_schema: true, load_new: true, preparation: true

// DATA INPUT        
// Define the data input sources     
// dfs_uri specifies the URI to the HDFS directory in which the files are stored 

dfs_uri = 'hdfs://hadoopNode:9000/food/'
authorInput = File.csv(dfs_uri + 'author.csv.gz').
    gzip().
    delimiter('|')
bookInput = File.csv(dfs_uri + 'book.csv.gz').
    gzip().
    delimiter('|')
authorBookInput = File.csv(dfs_uri + 'authorBook.csv.gz').
    gzip().
    delimiter('|')


// Specifies what data source to load using which mapper (as defined inline)
  
load(authorInput).asVertices {
    label "author"
    key "name"
}

load(bookInput).asVertices {
    label "book"
    key "name"
}

load(authorBookInput).asEdges {
    label "authored"
    outV "aname", {
        label "author"
        key "name"
    }
    inV "bname", {
        label "book"
        key "name"
    }
}