Automating a Sqoop operation (deprecated)

Use the Sqoop metastore to save configuration information for an import or export as a job.

Note: Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are deprecated. Sqoop is also deprecated and will be removed when Hadoop is removed.

DataStax Enterprise supports a native Cassandra implementation of the Sqoop metastore. You use the Sqoop metastore to store jobs, which are operations that you can run directly from Sqoop, such as an import or export. The native implementation saves the jobs in the sqoop_meta_store table in the dse_system keyspace.

You can save configuration information for an import or export as a job and run the job from the metastore repeatedly. You typically run the job from the metastore to incrementally import data. Sqoop imports only the newest rows.

Configuring the metastore

You use the sqoop-site.xml file that is installed with DataStax Enterprise to configure the metastore. The default configuration sets up the native Cassandra metastore for use in a development environment. You need to make configuration changes to the following properties to use the metastore correctly in a working cluster:
Table 1. Cassandra metastore properties
Property Description Default A comma-separated list of nodes that the metastore can use to connect to Cassandra
sqoop.cassandra.port The native protocol port number that the metastore uses to connect to Cassandra 9042 The consistency level for metastore writes LOCAL_ONE The consistency level for metastore reads LOCAL_ONE
sqoop.metastore.client.record.password Save passwords with the job true

Job command syntax

To create and manage a job, use the job tool. The syntax of the job command is:
$ dse sqoop job option [jobId] -- sqoop_commands
The following list describes Sqoop job options:
  • dse sqoop job --create jobId -- sqoop_commands

    Creates a new job using the commands given after the '--'.

  • dse sqoop job --list

    Lists available jobs.

  • dse sqoop job --show jobId

    Displays information about a job.

  • dse sqoop job --delete jobId

    Deletes an existing job.

  • dse sqoop job --exec jobId

    Executes a saved job.

Creating a job

This example creates a job named myjob that imports the Sqoop demo data from the MySQL npa_nxx_demo database into a CQL table named npa_nxx in Cassandra:

$ dse sqoop job --create myjob -- cql-import --table npa_nxx --cassandra-keyspace npa_nxx --cassandra-table npa_nxx_data --cassandra-column-mapping npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city --connect

The following output indicates success. A job named myjob is saved in the DseMetaStore for execution later.

14/09/10 16:58:22 INFO policies.DCAwareRoundRobinPolicy: Using data-center name 'Analytics' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
14/09/10 16:58:22 INFO core.Cluster: New Cassandra host / added

Listing a job

This example shows how to list the jobs saved in the DseMetaStore:
$ dse sqoop job --list

Available jobs:

Viewing the job configuration

This example shows how to view the configuration of a job:
$ dse sqoop job --show myjob 

Job: myjob
Tool: cql-import
verbose = false
db.connect.string = jdbc:mysql://
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
hbase.create.table = false
db.require.password = false
hdfs.append.dir = false
db.table = npa_nxx
import.fetch.size = null
codegen.input.delimiters.escape = 0
codegen.input.delimiters.enclose.required = false
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
hcatalog.create.table = false
db.clear.staging.table = false
codegen.input.delimiters.record = 0
enable.compression = false
hive.overwrite.table = false
hive.import = false
codegen.input.delimiters.enclose = 0
hive.drop.delims = false
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = . = true
mapreduce.num.mappers = 4 = 0 = UpdateOnly
codegen.output.delimiters.field = 44
incremental.mode = None
hdfs.file.format = TextFile
codegen.compile.dir = /tmp/sqoop-root/compile/498dc667d886a4c710b70c00624935de
direct.import = false = false
db.batch = false
mapred.used.genericoptionsparser = true
sqoop.cassandra.keyspace = npa_nxx = npa_nxx_data
sqoop.cassandra.column.mapping = npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city
sqoop.cassandra.tool = cql-import

Running a job

This example assumes that you have truncated the npa_nxx.npa_nxx_data table using cqlsh. The following command runs the saved job.
$ dse sqoop job --exec myjob -- --username someuser -P 
Enter password: ...   

MapReduce runs the saved job.

Saved jobs and passwords

DataStax recommends using the --username and -P options on the command line as shown in the example of running a job. Because multiple users can access the DseMetaStore, it does not store passwords. You can set the sqoop.metastore.client.record.password option to true in the sqoop-site.xml to make the password prompt appear each time you create a job that requires a password. No prompting occurs when you run show or exec.

For security reasons, configuring these parameters in the sqoop-site.xml is not recommended:
  • sqoop.metastore.client.autoconnect.username
  • sqoop.metastore.client.autoconnect.password

Importing data incrementally

To import data in increments, you use the --incremental argument with the import command. Sqoop compares the values in a check column against a reference value for the most recent import. These arguments import all rows having an id greater than 100.

  • --incremental
  • --check-column id
  • --last-value 100

If you run an incremental import from the command line, Sqoop prints the last value in a subsequent incremental import. If you run an incremental import from a saved job, Sqoop retains the last value in the saved job. To import only newer rows than those previously imported, use the --exec row_id option. Sqoop imports only rows that have an id greater than the specified row id.