Cassandra options to the import command

Sqoop options for importing data.

The demo created the MySQL data source, npa_nxx_demo, and in the last step of the demo, migrated its table and data to a Cassandra table. You execute the dse command and Cassandra options to import the data in the bin directory of the DataStax Enterprise installation. For example, in the last few steps of the Sqoop demo, you delete the npa_nxx.java file in $DSE_HOME/bin, and then run this command:

sudo . / dse sqoop import --connect jdbc:mysql://127.0.0.1/npa_nxx_demo \ 
  --username root \
  --table npa_nxx \
  --cassandra-keyspace newKS \
  --cassandra-column-family npa_nxx_cf \ 
  --cassandra-row-key npa_nxx_key \
  --cassandra-thrift-host 127.0.0.1 \ 
  --cassandra-create-schema

The dse command passes the external data source parameters:

  • IP address of the server, the connection string for the external data source
  • The database username
  • Optional password (not shown in the example)
  • The name of the external data source table to migrate

Next, the dse command passes the Cassandra parameters:

  • Name of the new Cassandra keyspace to use
  • Name of the new table object
  • Primary key of the table
  • IP address of the Cassandra node to connect to
  • The cassandra-create-schema parameter, which tells Sqoop to create my a new keyspace. You can also use existing keyspaces.

DSE interprets these command options as follows:

--cassandra-row-key <key>

The values of the key column of the source table become the partition keys in Cassandra.

Note: No detection of duplicates is performed, so using a unique key in the source table is strongly recommended to prevent overwriting and losing data.
--cassandra-thrift-host <cassandra-host(s)>

A comma-separated list of IP addresses identifies Cassandra nodes.

--cassandra-create-schema

An option that creates a Cassandra keyspace and table during the import operation.

Note: The table is created with no column metadata and all data is imported as strings.

Usage notes

  • Using this option when the keyspace already exists is valid.
  • Using this option when the table already exists causes an error that aborts the import operation.
  • Using this option when the keyspace or table, or both, do not exist is required; otherwise, the import operation is aborted.
  • Using this option creates a table with no column metadata and all data is imported as strings. Do not use this option if you want column metadata.
    Note: To generate column metadata for imported data, create a table, setting column values and names as string types, and then import data.

About the generated Sqoop JAR file 

After running the dse sqoop import command, a Java class appears in the DSE installation bin directory. This file, by default named npa_nxx.java after the demo table, can encapsulate one row of the imported data. You can specify the name of this JAR file, the output directory, and the class package using Sqoop command line options. For more information, see Sqoop documentation.

Sqoop output 

Snippets of the output and end result of the import operation look like this:

INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
INFO tool.CodeGenTool: Beginning code generation
INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `npa_nxx` AS t LIMIT 1
INFO orm.CompilationManager: HADOOP_HOME is /Users/robin/dev/dse-2.2/resources/hadoop/bin/..
INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-robin/compile/2e2b8b85fba83ccf1f52a8ee77c3b12f/npa_nxx.jar
 . . .
INFO mapreduce.ImportJobBase: Beginning import of npa_nxx
 . . .
INFO config.DseConfig: Load of settings is done.
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`npa_nxx_key`), MAX(`npa_nxx_key`) FROM `npa_nxx`
WARN db.TextSplitter: Generating splits for a textual index column.
WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
INFO mapred.JobClient: Running job: job_201203051624_0002
INFO mapred.JobClient: map 0% reduce 0%
INFO mapred.JobClient: map 25% reduce 0%
INFO mapred.JobClient: map 50% reduce 0%
INFO mapred.JobClient: map 75% reduce 0%
INFO mapred.JobClient: map 100% reduce 0%

INFO mapred.JobClient: Job complete: job_201203051624_0002
 . . .
INFO mapred.JobClient:        FILE_BYTES_WRITTEN=88472
INFO mapred.JobClient:        CFS_BYTES_READ=587
INFO mapred.JobClient:        File Input Format Counters
INFO mapred.JobClient:        Bytes Read=0
INFO mapred.JobClient:        Map-Reduce Framework
INFO mapred.JobClient:        Map input records=105291
INFO mapred.JobClient:        Spilled Records=0
INFO mapred.JobClient:        Total committed heap usage (bytes)=340000768
INFO mapred.JobClient:        Map output records=105291
INFO mapred.JobClient:        SPLIT_RAW_BYTES=587
INFO mapreduce.ImportJobBase: Transferred 0 bytes in 10.724 seconds (0 bytes/sec)
INFO mapreduce.ImportJobBase: Retrieved 105291 records

In this version of Sqoop, there is a Map, but no Reduce, phase.