Automating a Sqoop operation

Use the Sqoop metastore to save configuration information for an import or export as a job.

DataStax Enterprise 4.5.2 and later supports a native Cassandra implementation of the Sqoop metastore. You use the Sqoop metastore to store jobs, which are operations that you can run directly from Sqoop, such as an import or export. The native implementation saves the jobs in the sqoop_meta_store table in the dse_system keyspace.

You can save configuration information for an import or export as a job and run the job from the metastore repeatedly. You typically run the job from the metastore to incrementally import data. Sqoop imports only the newest rows.

Configuring the metastore 

You use the sqoop-site.xml installed with DataStax Enterprise 4.5.2 to configure the metastore. The default configuration sets up the native Cassandra metastore for use in a development environment. You need to make configuration changes to the following properties to use the metastore correctly in a working cluster:
Cassandra metastore properties
Property Description Default
sqoop.cassandra.host A comma-separated list of nodes that the metastore can use to connect to Cassandra 127.0.0.1
sqoop.cassandra.port The native protocol port number that the metastore uses to connect to Cassandra 9042
sqoop.job.storage.write.consistency.level The consistency level for metastore writes LOCAL_ONE
sqoop.job.storage.read.consistency.level The consistency level for metastore reads LOCAL_ONE
sqoop.metastore.client.record.password Save passwords with the job true

Job command syntax 

To create and manage a job, use the job tool. The syntax of the job command is:
$ dse sqoop job <option> [jobId] -- <sqoop command>
The following list describes Sqoop job options:
  • dse sqoop job --create <jobId> -- <sqoop commands>

    Creates a new job using the commands given after the '--'.

  • dse sqoop job --list

    Lists available jobs.

  • dse sqoop job --show <jobId>

    Displays information about a job.

  • dse sqoop job --delete <jobId>

    Deletes an existing job.

  • dse sqoop job --exec <jobId>

    Executes a saved job.

Creating a job 

This example creates a job named myjob that imports the Sqoop demo data from the MySQL npa_nxx_demo database into a CQL table named npa_nxx in Cassandra:

$ dse sqoop job --create myjob -- cql-import --table npa_nxx --cassandra-keyspace npa_nxx --cassandra-table npa_nxx_data --cassandra-column-mapping npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city --connect
jdbc:mysql://127.0.0.1/npa_nxx_demo

The following output indicates success. A job named myjob is saved in the DseMetaStore for execution later.

14/09/10 16:58:22 INFO policies.DCAwareRoundRobinPolicy: Using data-center name 'Analytics' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
14/09/10 16:58:22 INFO core.Cluster: New Cassandra host /127.0.0.1:9042 added

Listing a job 

This example shows how to list the jobs saved in the DseMetaStore:
$ dse sqoop job --list

Available jobs:
myjob

Viewing the job configuration 

This example shows how to view the configuration of a job:
$ dse sqoop job --show myjob 

Job: myjob
Tool: cql-import
Options:
----------------------------
verbose = false
db.connect.string = jdbc:mysql://127.0.0.1/npa_nxx_demo
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
hbase.create.table = false
db.require.password = false
hdfs.append.dir = false
db.table = npa_nxx
import.fetch.size = null
codegen.input.delimiters.escape = 0
codegen.input.delimiters.enclose.required = false
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
hcatalog.create.table = false
db.clear.staging.table = false
codegen.input.delimiters.record = 0
enable.compression = false
hive.overwrite.table = false
hive.import = false
codegen.input.delimiters.enclose = 0
hive.drop.delims = false
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = .
codegen.auto.compile.dir = true
mapreduce.num.mappers = 4
import.direct.split.size = 0
export.new.update = UpdateOnly
codegen.output.delimiters.field = 44
incremental.mode = None
hdfs.file.format = TextFile
codegen.compile.dir = /tmp/sqoop-root/compile/498dc667d886a4c710b70c00624935de
direct.import = false
hive.fail.table.exists = false
db.batch = false
mapred.used.genericoptionsparser = true
sqoop.cassandra.keyspace = npa_nxx
sqoop.cassandra.column.family = npa_nxx_data
sqoop.cassandra.column.mapping = npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city
sqoop.cassandra.tool = cql-import

Running a job 

This example assumes that you have truncated the npa_nxx.npa_nxx_data table using cqlsh. The following command runs the saved job.
$ dse sqoop job --exec myjob -- --username someuser -P 
Enter password: ...   

MapReduce runs the saved job.

Saved jobs and passwords 

DataStax recommends using the --username and -P options on the command line as shown in the example of running a job. Because multiple users can access the DseMetaStore, it does not store passwords. You can set the sqoop.metastore.client.record.password option to true in the sqoop-site.xml to make the password prompt appear each time you create a job that requires a password. No prompting occurs when you run show or exec.

For security reasons, configuring these parameters in the sqoop-site.xml is not recommended:
  • sqoop.metastore.client.autoconnect.username
  • sqoop.metastore.client.autoconnect.password

Importing data incrementally 

To import data in increments, you use the --incremental argument with the import command. Sqoop compares the values in a check column against a reference value for the most recent import. These arguments import all rows having an id greater than 100.

  • --incremental
  • --check-column id
  • --last-value 100

If you run an incremental import from the command line, Sqoop prints the last value in a subsequent incremental import. If you run an incremental import from a saved job, Sqoop retains the last value in the saved job. To import only newer rows than those previously imported, use the --exec <row id> option. Sqoop imports only rows having an id greater than the specified row id.