Automating a Sqoop operation
Use the Sqoop metastore to save configuration information for an import or export as a job.
DataStax Enterprise supports a native Cassandra implementation of the Sqoop metastore. You use the Sqoop metastore to store jobs, which are operations that you can run directly from Sqoop, such as an import or export. The native implementation saves the jobs in the sqoop_meta_store table in the dse_system keyspace.
You can save configuration information for an import or export as a job and run the job from the metastore repeatedly. You typically run the job from the metastore to incrementally import data. Sqoop imports only the newest rows.
Configuring the metastore
Property | Description | Default |
---|---|---|
sqoop.cassandra.host | A comma-separated list of nodes that the metastore can use to connect to Cassandra | 127.0.0.1 |
sqoop.cassandra.port | The native protocol port number that the metastore uses to connect to Cassandra | 9042 |
sqoop.job.storage.write.consistency.level | The consistency level for metastore writes | LOCAL_ONE |
sqoop.job.storage.read.consistency.level | The consistency level for metastore reads | LOCAL_ONE |
sqoop.metastore.client.record.password | Save passwords with the job | true |
Job command syntax
To create and manage a job, use the job tool. The syntax of the job command is:$ dse sqoop job option [jobId] -- sqoop_commands
The
following list describes Sqoop job options:- dse sqoop job --create jobId --
sqoop_commands
Creates a new job using the commands given after the '--'.
- dse sqoop job --list
Lists available jobs.
- dse sqoop job --show jobId
Displays information about a job.
- dse sqoop job --delete jobId
Deletes an existing job.
- dse sqoop job --exec jobId
Executes a saved job.
Creating a job
This example creates a job named myjob that imports the Sqoop demo data from the MySQL npa_nxx_demo database into a CQL table named npa_nxx in Cassandra:
$ dse sqoop job --create myjob -- cql-import --table npa_nxx --cassandra-keyspace npa_nxx --cassandra-table npa_nxx_data --cassandra-column-mapping npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city --connect
jdbc:mysql://127.0.0.1/npa_nxx_demo
The following output indicates success. A job named myjob is saved in the DseMetaStore for execution later.
14/09/10 16:58:22 INFO policies.DCAwareRoundRobinPolicy: Using data-center name 'Analytics' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
14/09/10 16:58:22 INFO core.Cluster: New Cassandra host /127.0.0.1:9042 added
Listing a job
$ dse sqoop job --list
Available jobs:
myjob
Viewing the job configuration
$ dse sqoop job --show myjob
Job: myjob
Tool: cql-import
Options:
----------------------------
verbose = false
db.connect.string = jdbc:mysql://127.0.0.1/npa_nxx_demo
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
hbase.create.table = false
db.require.password = false
hdfs.append.dir = false
db.table = npa_nxx
import.fetch.size = null
codegen.input.delimiters.escape = 0
codegen.input.delimiters.enclose.required = false
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
hcatalog.create.table = false
db.clear.staging.table = false
codegen.input.delimiters.record = 0
enable.compression = false
hive.overwrite.table = false
hive.import = false
codegen.input.delimiters.enclose = 0
hive.drop.delims = false
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = .
codegen.auto.compile.dir = true
mapreduce.num.mappers = 4
import.direct.split.size = 0
export.new.update = UpdateOnly
codegen.output.delimiters.field = 44
incremental.mode = None
hdfs.file.format = TextFile
codegen.compile.dir = /tmp/sqoop-root/compile/498dc667d886a4c710b70c00624935de
direct.import = false
hive.fail.table.exists = false
db.batch = false
mapred.used.genericoptionsparser = true
sqoop.cassandra.keyspace = npa_nxx
sqoop.cassandra.column.family = npa_nxx_data
sqoop.cassandra.column.mapping = npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city
sqoop.cassandra.tool = cql-import
Running a job
$ dse sqoop job --exec myjob -- --username someuser -P
Enter password: ...
MapReduce runs the saved job.
Saved jobs and passwords
DataStax recommends using the --username and -P options on the command line as shown in the example of running a job. Because multiple users can access the DseMetaStore, it does not store passwords. You can set the sqoop.metastore.client.record.password option to true in the sqoop-site.xml to make the password prompt appear each time you create a job that requires a password. No prompting occurs when you run show or exec.
- sqoop.metastore.client.autoconnect.username
- sqoop.metastore.client.autoconnect.password
Importing data incrementally
To import data in increments, you use the --incremental argument with the import command. Sqoop compares the values in a check column against a reference value for the most recent import. These arguments import all rows having an id greater than 100.
- --incremental
- --check-column id
- --last-value 100
If you run an incremental import from the command line, Sqoop prints the last value
in a subsequent incremental import. If you run an incremental import from a saved job, Sqoop
retains the last value in the saved job. To import only newer rows than those previously
imported, use the --exec row_id
option. Sqoop imports
only rows that have an id greater than the specified row id.