Running the Sqoop demo

In this demo, you run SQL commands to put the data from a CSV file into a MySQL table in a MySQL database. Then import the SQL data from MySQL to a CQL table in Cassandra.

The Sqoop demo uses the MySQL database and data from the North American Numbering Plan. This data consists of the area-code (NPA) and telephone number (Nxx) for the USA and Canada. The demo runs SQL commands to put the data from a CSV file into a MySQL table in a MySQL database. You then import the SQL data from MySQL to a CQL table in Cassandra. The following steps show running the SQL commands from the mysql command line. Alternatively, you can run the commands on the operating system command line described below. The demo exports the data from MySQL, and then uses a subset of Sqoop commands to import the data into a CQL table.

Prerequisites

To run the demo, you need:

  • Latest version of Oracle Java SE Development Kit (JDK) 7. The JRE alone does not work.
  • An installation of MySQL
  • Sufficient MySQL database privileges to create database objects
  • A JDBC driver for MySQL in the directory specified by the following demo procedure
  • The connection string that is appropriate for the JDBC driver
  • A DataStax Enterprise Analytics node
  • A PATH environment variable that includes the bin directory of the DSE installation

To import data to CQL, the keyspace and CQL table must exist prior to the importation. If the CQL table contains data prior to the importation, cql-import updates the data.

Procedure

  1. Install MySQL and download the JDBC driver for MySQL from the MySQL site.
  2. Copy the JDBC driver for MySQL to the Sqoop library.
    The default location of the Sqoop library depends on the type of installation:
    Installer-Services /usr/share/dse/resources/sqoop/lib
    Package installations /usr/share/dse/sqoop/lib
    Installer-No Services and Tarball installations install_location/resources/sqoop/lib
  3. Start DataStax Enterprise as an analytics node. For example:
    • Installer-Services and Package installations:
      1. Set HADOOP_ENABLED=1 in /etc/default/dse.
      2. Start an analytics node:
        $ sudo service dse start
    • Installer-No Services and Tarball installations:
      $ install_location/bin/dse cassandra -t
  4. Start MySQL and create the demo database:
    mysql> CREATE DATABASE npa_nxx_demo ;
  5. Connect to the database and create the table:
    mysql> CONNECT npa_nxx_demo;
    
    mysql> CREATE TABLE npa_nxx (
             npa_nxx_key int(11) NOT NULL,
             npa          int(11)  DEFAULT NULL,
             nxx          int(11)  DEFAULT NULL,
             lat          float  DEFAULT NULL,
             lon          float  DEFAULT NULL,
             linetype     char(1)  DEFAULT NULL,
             state       varchar(2)  DEFAULT NULL,
             city        varchar(36) DEFAULT NULL,
             PRIMARY KEY (npa_nxx_key)
           ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
  6. Locate the npa_nxx_csv file of the DataStax Enterprise installation.
    The default location of the Sqoop demo depends on the type of installation:
    Installer-Services and Package installations /usr/share/dse/demos/sqoop
    Installer-No Services and Tarball installations install_location/demos/sqoop
  7. Populate the table by loading the CSV file:
    mysql> LOAD DATA LOCAL INFILE 'npa_nxx.csv'
             INTO TABLE npa_nxx_demo.npa_nxx
             FIELDS TERMINATED BY ','
             ENCLOSED BY '"'
             LINES TERMINATED BY '\n';

    Output is:

    Query OK, 105291 rows affected (1.01 sec)
    Records: 105291 Deleted: 0 Skipped: 0 Warnings: 0
  8. On the analytics node you started in step 3, create a CQL keyspace and table that maps to the SQL table. Use compatible data types. For example, start cqlsh and run these commands:
    cqlsh> CREATE KEYSPACE npa_nxx WITH REPLICATION = 
             {'class':'NetworkTopologyStrategy', 'Analytics':1};
    
    cqlsh> CREATE TABLE npa_nxx.npa_nxx_data (npa int, nxx int, 
             latitude float, longitude float, state text, city text, 
             PRIMARY KEY(npa, nxx));

    Alternatively, you can run the commands on the operating system command line from a cql script in the demos/sqoop directory.

    The default location of the Sqoop demo depends on the type of installation:
    Installer-Services and Package installations /usr/share/dse/demos/sqoop
    Installer-No Services and Tarball installations install_location/demos/sqoop
  9. In a text editor, open the import.options file in the demos/sqoop directory.
    The default location of the Sqoop demo depends on the type of installation:
    Installer-Services and Package installations /usr/share/dse/demos/sqoop
    Installer-No Services and Tarball installations install_location/demos/sqoop

    The import.options file contains these options:

    import.options file
    Contents Description
    cql-import Perform an import operation.
    --table A SQL table name follows this option.
    npa_nxx SQL table name for the demo.
    --cassandra-keyspace A keyspace name follows this option.
    npa_nxx The keyspace name for the demo.
    --cassandra-table A Cassandra table name follows this option.
    npa_nxx_data The Cassandra table name for the demo.
    --cassandra-column-mapping A CQL:SQL column mapping follows this option.
    npa:npa,nxx:nxx,latitude:lat,longitude:lon,state:state,city:city The Cassandra column names:corresponding MySQL column names, cql1:sql1,cql2:sql2, . . .
    --connect The JDBC connection string follows this option.
    jdbc:mysql://<mysql_host>/npa_nxx_demo The JDBC connection string.
    --username A MySQL user name follows this option.
    <mysql_user> The user name you configured as the MySQL admin.
    --password A MySQL password follows this option.
    <mysql_password> The MySQL administrative password you configured.
    --cassandra-host The IP address of the MySQL host node follows this option.
    <cassandra_host> The IP address of the host node. For example, 127.0.0.1. A fully-qualified domain name if using Kerberos.
    Alternatively, you can enter these commands as options to the dse sqoop command to import the data from the SQL to the CQL table as shown in the example of exporting data.
  10. Modify the import.options file for your environment. For example, assuming you plan to run the demo on a single-node cluster, modify the options as follows:
    • --connect

      FROM: jdbc:mysql://<mysql_host>/npa_nxx_demo

      TO: jdbc:mysql://127.0.0.1/npa_nxx_demo

    • --username

      FROM: mysql_user

      TO: your MySQL user name

    • --password

      FROM: mysql_password

      TO: your MySQL password

    • --cassandra-host

      FROM: cassandra_host

      TO: 127.0.0.1

    As described in the Sqoop reference, you can list multiple IP addresses.
  11. Import the SQL data into Cassandra using the file you edited. Use the dse import command to import the data from the MySQL table to the table in Cassandra. On Linux, for example:
    $ bin/dse sqoop --options-file fully_qualified_path/demos/sqoop/import.options
    The MapReduce job runs and the end of the output looks like this:
    . . .
    14/05/23 14:41:17 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 50.5956 seconds (0 bytes/sec)
    14/05/23 14:41:17 INFO mapreduce.ImportJobBase: Retrieved 105291 records.
  12. In cqlsh, verify that the data import succeeded.
    cqlsh> SELECT * FROM npa_nxx.npa_nxx_data LIMIT 5;
    
    
     npa | nxx | city        | latitude | longitude | state
    -----+-----+-------------+----------+-----------+-------
     660 | 200 |     Braymer |    39.59 |      93.8 |    MO
     660 | 202 |     Sedalia |     38.7 |     93.22 |    MO
     660 | 213 |    La Belle |    40.11 |     91.91 |    MO
     660 | 214 | Chillicothe |    39.79 |     93.55 |    MO
     660 | 215 |   Maryville |    40.34 |     94.87 |    MO