Automating a Sqoop operation (deprecated)

Use the Sqoop metastore to save configuration information for an import or export as a job.

Note: Hadoop is deprecated for use with DataStax Enterprise. DSE Hadoop and BYOH (Bring Your Own Hadoop) are deprecated. Sqoop is also deprecated and will be removed when Hadoop is removed.

DataStax Enterprise supports a native Cassandra implementation of the Sqoop metastore. You use the Sqoop metastore to store jobs, which are operations that you can run directly from Sqoop, such as an import or export. The native implementation saves the jobs in the sqoop_meta_store table in the dse_system keyspace.

You can save configuration information for an import or export as a job and run the job from the metastore repeatedly. You typically run the job from the metastore to incrementally import data. Sqoop imports only the newest rows.

Configuring the metastore

You use the sqoop-site.xml file that is installed with DataStax Enterprise to configure the metastore. The default configuration sets up the native Cassandra metastore for use in a development environment. You need to make configuration changes to the following properties to use the metastore correctly in a working cluster:

Cassandra metastore properties
Property	Description	Default
sqoop.cassandra.host	A comma-separated list of nodes that the metastore can use to connect to Cassandra	127.0.0.1
sqoop.cassandra.port	The native protocol port number that the metastore uses to connect to Cassandra	9042
sqoop.job.storage.write.consistency.level	The consistency level for metastore writes	LOCAL_ONE
sqoop.job.storage.read.consistency.level	The consistency level for metastore reads	LOCAL_ONE
sqoop.metastore.client.record.password	Save passwords with the job	true

Job command syntax

To create and manage a job, use the job tool. The syntax of the job command is:

$ dse sqoop job option [jobId] -- sqoop_commands

The following list describes Sqoop job options:

dse sqoop job --create jobId -- sqoop_commands
Creates a new job using the commands given after the '--'.
dse sqoop job --list
Lists available jobs.
dse sqoop job --show jobId
Displays information about a job.
dse sqoop job --delete jobId
Deletes an existing job.
dse sqoop job --exec jobId
Executes a saved job.

Creating a job

This example creates a job named myjob that imports the Sqoop demo data from the MySQL npa_nxx_demo database into a CQL table named npa_nxx in Cassandra:

$ dse sqoop job --create myjob -- cql-import --table npa_nxx --cassandra-keyspace npa_nxx --cassandra-table npa_nxx_data --cassandra-column-mapping npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city --connect
jdbc:mysql://127.0.0.1/npa_nxx_demo

The following output indicates success. A job named myjob is saved in the DseMetaStore for execution later.

14/09/10 16:58:22 INFO policies.DCAwareRoundRobinPolicy: Using data-center name 'Analytics' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
14/09/10 16:58:22 INFO core.Cluster: New Cassandra host /127.0.0.1:9042 added

Listing a job

This example shows how to list the jobs saved in the DseMetaStore:

$ dse sqoop job --list

Available jobs:
myjob

Viewing the job configuration

This example shows how to view the configuration of a job:

$ dse sqoop job --show myjob 

Job: myjob
Tool: cql-import
Options:
----------------------------
verbose = false
db.connect.string = jdbc:mysql://127.0.0.1/npa_nxx_demo
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
hbase.create.table = false
db.require.password = false
hdfs.append.dir = false
db.table = npa_nxx
import.fetch.size = null
codegen.input.delimiters.escape = 0
codegen.input.delimiters.enclose.required = false
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
hcatalog.create.table = false
db.clear.staging.table = false
codegen.input.delimiters.record = 0
enable.compression = false
hive.overwrite.table = false
hive.import = false
codegen.input.delimiters.enclose = 0
hive.drop.delims = false
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = .
codegen.auto.compile.dir = true
mapreduce.num.mappers = 4
import.direct.split.size = 0
export.new.update = UpdateOnly
codegen.output.delimiters.field = 44
incremental.mode = None
hdfs.file.format = TextFile
codegen.compile.dir = /tmp/sqoop-root/compile/498dc667d886a4c710b70c00624935de
direct.import = false
hive.fail.table.exists = false
db.batch = false
mapred.used.genericoptionsparser = true
sqoop.cassandra.keyspace = npa_nxx
sqoop.cassandra.column.family = npa_nxx_data
sqoop.cassandra.column.mapping = npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city
sqoop.cassandra.tool = cql-import

Running a job

This example assumes that you have truncated the npa_nxx.npa_nxx_data table using cqlsh. The following command runs the saved job.

$ dse sqoop job --exec myjob -- --username someuser -P 
Enter password: ...

MapReduce runs the saved job.

Saved jobs and passwords

DataStax recommends using the --username and -P options on the command line as shown in the example of running a job. Because multiple users can access the DseMetaStore, it does not store passwords. You can set the sqoop.metastore.client.record.password option to true in the sqoop-site.xml to make the password prompt appear each time you create a job that requires a password. No prompting occurs when you run show or exec.

For security reasons, configuring these parameters in the sqoop-site.xml is not recommended:

sqoop.metastore.client.autoconnect.username
sqoop.metastore.client.autoconnect.password

Importing data incrementally

To import data in increments, you use the --incremental argument with the import command. Sqoop compares the values in a check column against a reference value for the most recent import. These arguments import all rows having an id greater than 100.

--incremental
--check-column id
--last-value 100

If you run an incremental import from the command line, Sqoop prints the last value in a subsequent incremental import. If you run an incremental import from a saved job, Sqoop retains the last value in the saved job. To import only newer rows than those previously imported, use the --exec row_id option. Sqoop imports only rows that have an id greater than the specified row id.