How to merge the data from two CQL tables having simple primary keys using
Pig.
For Pig to access data in Cassandra, the target keyspace and table must already
exist. Pig can save data from a Pig relation to a table in Cassandra and from a
Cassandra table to a pig relation, but it cannot create the table. This example
shows how to merge the data from two CQL tables having simple primary keys using
Pig.
A subsequent example shows how to merge data from
CQL tables having compound primary keys into one CQL table using Pig.
Procedure
-
Start cqlsh.
-
Using cqlsh, create and use a keyspace named, for example, cql3ks.
cqlsh> CREATE KEYSPACE cql3ks WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 1 };
cqlsh> USE cql3ks;
-
Create a two-column (a and b) Cassandra table named simple_table1 and another
two-column (x and y) table named simple_table2. Insert data into the
tables.
cqlsh:cql3ks> CREATE TABLE simple_table1 (a int PRIMARY KEY, b int);
cqlsh:cql3ks> CREATE TABLE simple_table2 (x int PRIMARY KEY, y int);
cqlsh:cql3ks> INSERT INTO simple_table1 (a,b) VALUES (1,1);
cqlsh:cql3ks> INSERT INTO simple_table1 (a,b) VALUES (2,2);
cqlsh:cql3ks> INSERT INTO simple_table1 (a,b) VALUES (3,3);
cqlsh:cql3ks> INSERT INTO simple_table2 (x, y) VALUES (4,4);
cqlsh:cql3ks> INSERT INTO simple_table2 (x, y) VALUES (5,5);
cqlsh:cql3ks> INSERT INTO simple_table2 (x, y) VALUES (6,6);
-
Using Pig, add logic to load the data (4, 5, 6) from the Cassandra
simple_table2 table into a Pig relation.
grunt> moretestvalues= LOAD 'cql://cql3ks/simple_table2/' USING CqlNativeStorage;
-
Convert the simple_table2 table data to a tuple. The key column is a chararray,
'a'.
grunt> insertformat= FOREACH moretestvalues GENERATE
TOTUPLE(TOTUPLE('a',x)),TOTUPLE(y);
-
Save the relation to the Cassandra simple_table1 table.
grunt> STORE insertformat INTO
'cql://cql3ks/simple_table1?output_query=UPDATE+cql3ks.simple_table1+set+b+%3D+%3F'
USING CqlNativeStorage;
Pig uses a URL-encoded prepared
statement to store the relation to Cassandra. The cql:// URL is
followed by an output_query, which specifies which key should be used in the
command. The rest of the arguments, the "?"s, for the prepared statement are
filled in by the values related to that key in Pig.
-
On the cqlsh command line, check that the simple_table1 table now contains its
original values plus the values from the simple_table2 table:
cqlsh:cql3ks> SELECT * FROM simple_table1;
a | b
--+--
5 | 5
1 | 1
2 | 2
4 | 4
6 | 6
3 | 3