COPY
Imports and exports CSV (comma-separated values) data to and from Cassandra.
Imports and exports CSV (comma-separated values) data to and from Apache Cassandra™.
Synopsis
COPY table_name ( column, ...)
FROM 'file_name' | STDIN
WITH option = 'value' AND ...
COPY table_name ( column , ... )
TO 'file_name' | STDOUT
WITH option = 'value' AND ...
A semicolon that terminates CQL statements is not included in the synopsis. |
Description
COPY Options | Default Value | Use To |
---|---|---|
DELIMITER | comma (,) | Set the character that separates fields having newline characters in the file. |
QUOTE | quotation mark (") | Set the character that encloses field values. |
ESCAPE | backslash (\) | Set the character that escapes literal uses of the QUOTE character. |
HEADER | false | Set true to indicate that first row of the file is a header. |
NULL | an empty string | Represents the absence of a value. |
DATETIMEFORMAT | '%Y-%m-%d %H:%M:%S%z' | Set the time format for reading or writing CSV time data. The timestamp uses the strftime format. If not set, the default value is set to the time_format value in the cqlshrc file. |
MAXATTEMPTS | 5 | Set the maximum number of attempts for errors. |
REPORTFREQUENCY | 0.25 | Set the frequency with which status is displayed, in seconds. |
DECIMALSEP | period (.) | Set a separator for decimal values. |
THOUSANDSSEP | None | Set a separator for thousands digit groups. |
BOOLSTYLE | True, False | Set a representation for boolean values for True and False. The values are case insensitive. Example: yes,no or 1,0. |
NUMPROCESSES | Number of cores - 1 | Set the number of worker processes. Maximum value is 16. |
CONFIGFILE | None | Specify a configuration file with the same format as the cqlshrc file to set WITH options. The following sections can be specified: [copy], [copy-to], [copy-from], [copy:ks.table], [copy-to:ks.table], [copy-from:ks.table], where <ks> is the keyspace name and <table> is the tablename. Options are read from these sections, in the order specified above. Command line options always override options in configuration files. Depending on the COPY direction, only the relevant copy-from or copy-to sections are used. If no configuration file is specified, the cqlshrc file is searched instead. |
RATEFILE | None | Specify a file for printing output statistics. |
COPY Options | Default Value | Use To |
---|---|---|
CHUNKSIZE | 1,000 | Set the size of chunks passed to worker processes. |
INGESTRATE | 100,000 | Set an approximate ingest rate in rows per second. Must be set to a greater value than chunk size. |
MAXBATCHSIZE | 20 | Set the maximum size of an import batch. |
MINBATCHSIZE | 2 | Set the minimum size of an import batch. |
MAXROWS | -1 | Set the maximum number of rows. "-1" sets no maximum. |
SKIPROWS | 0 | The number of rows to skip. |
SKIPCOLS | None | Set a comma-separated list of column names to skip. |
MAXPARSEERRORS | -1 | Set the maximum global number of parsing errors. "-1" sets no maximum. |
MAXINSERTERRORS | -1 | Set the maximum global number of insert errors. "-1" sets no maximum. |
ERRFILE | None | Set a file to store all rows that are not imported. If no value is set, the information is stored in import_ks_table.err where <ks> is the keyspace and <table> is the table name. |
TTL | 3600 | Set the time to live in seconds. By default, data will not expire. |
COPY Options | Default Value | Use To |
---|---|---|
ENCODING | UTF8 | Set the COPY TO command to output unicode strings. |
PAGESIZE | 1,000 | Set the page size for fetching results. |
PAGETIMEOUT | 10 | Set the page timeout for fetching results. |
BEGINTOKEN | None | Set the minimum token string for exporting data. |
ENDTOKEN | None | Set the maximum token string for exporting data. |
MAXREQUESTS | 6 | Set the maximum number of requests each worker process can work on in parallel. |
MAXOUTPUTSIZE | -1 | Set the maximum size of the output file, measured in number of lines. If a value is set, the output file will be split into segment when the value is exceeded. "-1" sets no maximum. |
The ENCODING
option is available only for the COPY TO
command. This table shows that, by default, Cassandra expects the CSV data to consist of
fields separated by commas (,), records separated by line separators (a newline, \r\n), and
field values enclosed in double-quotation marks (""). Also, to avoid ambiguity, escape a
literal double-quotation mark using a backslash inside a string enclosed in double-quotation
marks ("\""). By default, Cassandra does not expect the CSV file to have a header record on
the first line that consists of the column names. COPY TO
includes the
header in the output if HEADER=TRUE
. COPY FROM ignores the first line if
HEADER=TRUE
.
You cannot copy data to or from counter tables.
COPY FROM a CSV file
By default, when you use the COPY FROM
command, Cassandra expects every
row in the CSV input to contain the same number of columns. The number of columns in the CSV
input is the same as the number of columns in the Cassandra table metadata. Cassandra
assigns fields in the respective order. To apply your input data to a particular set of
columns, specify the column names in parentheses after the table name.
COPY FROM
loads rows from a CVS file in a parallel non-deterministic
order. Empty data for a column is assumed by default as NULL value and will override a
value; if the data is overwritten, unexpected results can occur. Data can be successfully
loaded but with non-deterministic random results if there is more than one row in the CSV
file with the same primary key. Having more than one row with the same primary key is not
explicitly checked, so unintended results can occur.
COPY FROM is intended for importing small datasets (a few million rows or less) into Cassandra. For importing larger datasets, use the Cassandra bulk loader.
COPY TO a CSV file
For example, assume you have the following table in CQL:
cqlsh> SELECT * FROM test.airplanes;
name | mach | manufacturer | year
---------------+------+--------------+------
P38-Lightning | 0.7 | Lockheed | 1937
After inserting data into the table, you can copy the data to a CSV file in another order by specifying the column names in parentheses after the table name:
COPY airplanes
(name, mach, year, manufacturer)
TO 'temp.csv'
Specifying the source or destination files
Specify the source file of the CSV input or the destination file of the CSV output by a file path. Alternatively, you can use the STDIN or STDOUT keywords to import from standard input and export to standard output. When using stdin, signal the end of the CSV data with a backslash and period ("\.") on a separate line. If the data is being imported into a table that already contains data, COPY FROM does not truncate the table beforehand. You can copy only a partial set of columns. Specify the entire set or a subset of column names in parentheses after the table name in the order you want to import or export them. By default, when you use the COPY TO command, Cassandra copies data to the CSV file in the order defined in the Cassandra table metadata. You can also omit listing the column names when you want to import or export all the columns in the order they appear in the source table or CSV file.
Roundtrip copying of a simple table
Copy a table to a CSV file.
- Using CQL, create a table named airplanes and copy it to a CSV file.
CREATE KEYSPACE test WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1 }; USE test; CREATE TABLE airplanes ( name text PRIMARY KEY, manufacturer ascii, year int, mach float ); INSERT INTO airplanes (name, manufacturer, year, mach) VALUES ('P38-Lightning', 'Lockheed', 1937, 0.7); COPY airplanes (name, manufacturer, year, mach) TO 'temp.csv';
1 rows exported in 0.004 seconds.
- Clear the data from the airplanes table and import the data from the temp.csv file.
TRUNCATE airplanes; COPY airplanes (name, manufacturer, year, mach) FROM 'temp.csv';
1 rows imported in 0.087 seconds.
Copy data from standard input to a table.
- Enter data directly during an interactive cqlsh session, using the COPY command
defaults.
TRUNCATE airplanes; COPY airplanes (name, manufacturer, year, mach) FROM STDIN;
The output is:
[Use \. on a line by itself to end input] [copy]
- At the [copy] prompt, enter the following data:
'F-14D Super Tomcat', Grumman, 1987, 2.34 'MiG-23 Flogger', Russian-made, 1964, 2.35 'Su-27 Flanker', U.S.S.R., 1981, 2.35 \.
- Query the airplanes table to see data imported from
STDIN:
SELECT * FROM airplanes;
Output is:name | manufacturer | year | mach --------------------+------+--------------+------------- F-14D Super Tomcat | Grumman | 1987 | 2.35 P38-Lightning | Lockheed | 1937 | 0.7 Su-27 Flanker | U.S.S.R. | 1981 | 2.35
Copying collections
Cassandra supports round-trip copying of collections to and from CSV files. To perform this example, download the sample code now.
- Unzip the downloaded file named cql_collections.zip.
- Copy/paste all the CQL commands from the cql_collections.txt file to the cqlsh command line.
- Take a look at the contents of the songs table. The table contains a map of venues, a
list of reviews, and a set of tags.
cqlsh> SELECT * FROM music.songs;
id |album|artist|data|reviews |tags |title|venue ------------+-----+------+----+-------------------+-----------------+-----+---------------------------------------------------------------------- 7db1a490...| null| null |null|['hot dance music']| {'rock'}| null|{'2013-09-22...': 'The Fillmore', '2013-10-01...': 'The Apple Barrel'} a3e64f8f...| null| null |null| null|{'1973', 'blues'}| null|null 8a172618...| null| null |null| null|'2007', 'covers'}| null|null
- Copy the music.songs table to a CSV file named
songs-20140603.csv.
cqlsh> COPY music.songs to 'songs-20140603.csv';
3 rows exported in 0.006 seconds.
- Check that the copy operation
worked.
cqlsh> exit; $ cat songs-20140603.csv 7db1a490...,,,,['hot dance music'],{'rock'},,"{'2013-09-22...': 'The Fillmore', '2013-10-01....': 'The Apple Barrel'}" a3e64f8f...,,,,,"{'1973', 'blues'}",, 8a172618...,,,,,"{'2007', 'covers'}",,
- Start cqlsh again, and create a table definition that matches the data in the
songs-204140603
file.
cqlsh> CREATE TABLE music.imported_songs ( id uuid PRIMARY KEY, album text, artist text, data blob, reviews list<text>, tags set<text>, title text, venue map<timestamp, text> );
- Copy the data from the CSV file into the imported_songs
table.
cqlsh> COPY music.imported_songs from 'songs-20140603.csv';
3 rows imported in 0.004 seconds.