Running the Sqoop demo
In this demo, you run SQL commands to put the data from a CSV file into a MySQL table in a MySQL database. Then import the SQL data from MySQL to a CQL table in Cassandra.
The Sqoop demo uses the MySQL database and data from the North American Numbering Plan. This data consists of the area-code (NPA) and telephone number (Nxx) for the USA and Canada. The demo runs SQL commands to put the data from a CSV file into a MySQL table in a MySQL database. You then import the SQL data from MySQL to a CQL table in Cassandra. The following steps show running the SQL commands from the mysql command line. Alternatively, you can run the commands on the operating system command line described below. The demo exports the data from MySQL, and then uses a subset of Sqoop commands to import the data into a CQL table.
Prerequisites
To run the demo, you need:
- Latest version of Oracle Java SE Development Kit (JDK) 7. The JRE alone does not work.
- An installation of MySQL
- Sufficient MySQL database privileges to create database objects
- A JDBC driver for MySQL in the directory specified by the following demo procedure
- The connection string that is appropriate for the JDBC driver
- A DataStax Enterprise Analytics node
- A PATH environment variable that includes the bin directory of the DSE installation
To import data to CQL, the keyspace and CQL table must exist prior to the importation. If the CQL table contains data prior to the importation, cql-import updates the data.
Procedure
- Install MySQL and download the JDBC driver for MySQL from the MySQL site.
-
Copy the JDBC driver for MySQL to the Sqoop library.
The default location of the Sqoop library depends on the type of installation:
Installer-Services /usr/share/dse/resources/sqoop/lib Package installations /usr/share/dse/sqoop/lib Installer-No Services and Tarball installations install_location/resources/sqoop/lib -
Start DataStax Enterprise as an analytics node. For example:
- Installer-Services and Package installations:
- Set HADOOP_ENABLED=1 in /etc/default/dse.
- Start an analytics node:
$ sudo service dse start
- Installer-No Services and Tarball installations:
$ install_location/bin/dse cassandra -t
- Installer-Services and Package installations:
-
Start MySQL and create the demo database:
mysql> CREATE DATABASE npa_nxx_demo ;
-
Connect to the database and create the table:
mysql> CONNECT npa_nxx_demo; mysql> CREATE TABLE npa_nxx ( npa_nxx_key int(11) NOT NULL, npa int(11) DEFAULT NULL, nxx int(11) DEFAULT NULL, lat float DEFAULT NULL, lon float DEFAULT NULL, linetype char(1) DEFAULT NULL, state varchar(2) DEFAULT NULL, city varchar(36) DEFAULT NULL, PRIMARY KEY (npa_nxx_key) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
-
Locate the npa_nxx_csv file of the DataStax Enterprise
installation.
The default location of the Sqoop demo depends on the type of installation:
Installer-Services and Package installations /usr/share/dse/demos/sqoop Installer-No Services and Tarball installations install_location/demos/sqoop -
Populate the table by loading the CSV file:
mysql> LOAD DATA LOCAL INFILE 'npa_nxx.csv' INTO TABLE npa_nxx_demo.npa_nxx FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n';
Output is:
Query OK, 105291 rows affected (1.01 sec) Records: 105291 Deleted: 0 Skipped: 0 Warnings: 0
-
On the analytics node you started in step 3, create a CQL keyspace and table that maps to the SQL table.
Use compatible data types. For
example, start cqlsh and run these commands:
cqlsh> CREATE KEYSPACE npa_nxx WITH REPLICATION = {'class':'NetworkTopologyStrategy', 'Analytics':1}; cqlsh> CREATE TABLE npa_nxx.npa_nxx_data (npa int, nxx int, latitude float, longitude float, state text, city text, PRIMARY KEY(npa, nxx));
Alternatively, you can run the commands on the operating system command line from a cql script in the demos/sqoop directory.
The default location of the Sqoop demo depends on the type of installation:Installer-Services and Package installations /usr/share/dse/demos/sqoop Installer-No Services and Tarball installations install_location/demos/sqoop -
In a text editor, open the import.options file in the
demos/sqoop directory.
The default location of the Sqoop demo depends on the type of installation:
Installer-Services and Package installations /usr/share/dse/demos/sqoop Installer-No Services and Tarball installations install_location/demos/sqoop The import.options file contains these options:
import.options file Contents Description cql-import Perform an import operation. --table A SQL table name follows this option. npa_nxx SQL table name for the demo. --cassandra-keyspace A keyspace name follows this option. npa_nxx The keyspace name for the demo. --cassandra-table A Cassandra table name follows this option. npa_nxx_data The Cassandra table name for the demo. --cassandra-column-mapping A CQL:SQL column mapping follows this option. npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city The Cassandra column names:corresponding MySQL column names, cql1:sql1,cql2:sql2, . . . --connect The JDBC connection string follows this option. jdbc:mysql://<mysql_host>/npa_nxx_demo The JDBC connection string. --username A MySQL user name follows this option. <mysql_user> The user name you configured as the MySQL admin. --password A MySQL password follows this option. <mysql_password> The MySQL administrative password you configured. --cassandra-host The IP address of the MySQL host node follows this option. <cassandra_host> The IP address of the host node. For example, 127.0.0.1. A fully-qualified domain name if using Kerberos. Alternatively, you can enter these commands as options to thedse sqoop command
to import the data from the SQL to the CQL table as shown in the example of exporting data. -
Modify the import.options file for your environment. For
example, assuming you plan to run the demo on a single-node cluster, modify the
options as follows:
- --connect
FROM: jdbc:mysql://<mysql_host>/npa_nxx_demo
TO: jdbc:mysql://127.0.0.1/npa_nxx_demo
- --username
FROM: mysql_user
TO: your MySQL user name
- --password
FROM: mysql_password
TO: your MySQL password
- --cassandra-host
FROM: cassandra_host
TO: 127.0.0.1
- --connect
-
Import the SQL data into Cassandra using the file you edited. Use the
dse import
command to import the data from the MySQL table to the table in Cassandra. On Linux, for example:$ bin/dse sqoop --options-file fully_qualified_path/demos/sqoop/import.options
The MapReduce job runs and the end of the output looks like this:. . . 14/05/23 14:41:17 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 50.5956 seconds (0 bytes/sec) 14/05/23 14:41:17 INFO mapreduce.ImportJobBase: Retrieved 105291 records.
-
In cqlsh, verify that the data import succeeded.
cqlsh> SELECT * FROM npa_nxx.npa_nxx_data LIMIT 5;
npa | nxx | city | latitude | longitude | state -----+-----+-------------+----------+-----------+------- 660 | 200 | Braymer | 39.59 | 93.8 | MO 660 | 202 | Sedalia | 38.7 | 93.22 | MO 660 | 213 | La Belle | 40.11 | 91.91 | MO 660 | 214 | Chillicothe | 39.79 | 93.55 | MO 660 | 215 | Maryville | 40.34 | 94.87 | MO