Migrate or insert large amounts of data

When inserting large amounts of data, a programmatic approach is often more performant than the Astra Portal.

For required permissions to insert data, see Ways to insert data in Astra DB Serverless.

Use the Data API for bulk inserts

To insert a large amount of data with the Data API, use the command to insert many documents or insert many rows.

You must batch your requests to insert fewer than 100 documents or rows and fewer than 20 million characters at a time.

For Data API commands and examples, see the following:

If you’re new to the Data API, try the quickstart for collections or the quickstart for tables for a demo of some common operations, including inserting data from a file.

Use CQL and drivers for bulk inserts

There are multiple strategies you can use to handle bulk inserts in CQL statements, such as batching, prepared statements, and the COPY command.

For information, see the following:

Use the DataStax Bulk Loader to insert large CSV and JSON files

You can use the DataStax Bulk Loader (DSBulk) to insert CSV and JSON files into collections and tables in Astra DB Serverless databases. For files larger than 40 MB, DataStax recommends DSBulk instead of the Astra Portal because DSBulk is more performant at this scale.

Your CSV or JSON file must be compatible with Astra DB and, if applicable, the table schema. For example, if you insert a CSV file into a table, the CSV file must contain the same column names and data types as the table.

If you insert a JSON file exported from a database that isn’t based on Apache Cassandra®, you might need to transform the data into a format that is compatible with Astra DB before you insert the data. For more information, see Migrate from non-Cassandra sources.

The following steps explain how to install DSBulk and use it to insert a CSV file into a table. DSBulk can also load vector data, load JSON files, unload (export) data, and count rows/documents. For more information, see the DSBulk documentation.

Download DSBulk:

curl -OL https://downloads.datastax.com/dsbulk/dsbulk.tar.gz

Result

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   242  100   242    0     0    681      0 --:--:-- --:--:-- --:--:--   691
100 40.4M  100 40.4M    0     0  20.7M      0  0:00:01  0:00:01 --:--:-- 31.6M

DSBulk 1.11.0 or later is required to support vector data.

Extract the DSBulk archive:

tar -xzvf dsbulk.tar.gz

Result

This example uses DSBulk version 1.11.0:

x dsbulk-1.11.0/README.md
x dsbulk-1.11.0/LICENSE.txt
x dsbulk-1.11.0/manual/
x dsbulk-1.11.0/manual/driver.template.conf
x dsbulk-1.11.0/manual/settings.md
x dsbulk-1.11.0/manual/application.template.conf
x dsbulk-1.11.0/bin/dsbulk
x dsbulk-1.11.0/bin/dsbulk.cmd
x dsbulk-1.11.0/conf/
x dsbulk-1.11.0/conf/driver.conf
x dsbulk-1.11.0/conf/application.conf
x dsbulk-1.11.0/THIRD-PARTY.txt
x dsbulk-1.11.0/lib/java-driver-core-4.17.0.jar
x dsbulk-1.11.0/lib/native-protocol-1.5.1.jar
x dsbulk-1.11.0/lib/netty-handler-4.1.94.Final.jar
   .
   .
   .
x dsbulk-1.11.0/lib/lz4-java-1.8.0.jar
x dsbulk-1.11.0/lib/snappy-java-1.1.7.3.jar
x dsbulk-1.11.0/lib/jansi-1.18.jar

In the directory where you extracted DSBulk, verify the installation:
```
dsbulk-VERSION/bin/dsbulk --version
```
Result
DataStax Bulk Loader v1.11.0
Create an application token with the Administrator User role, and then store the token securely.
If you haven’t done so already, create an Astra DB Serverless database. You can use either a Serverless (Vector) or Serverless (Non-Vector) database.
Download the database’s Secure Connect Bundle (SCB).
Create a table in your database:
1. In the Astra Portal, click the name of your database, and then click CQL Console.
2. When the token@cqlsh> prompt appears, select the keyspace where you want to create the table:
  use KEYSPACE_NAME;
3. Create a table to load a sample dataset:
  CREATE TABLE KEYSPACE_NAME.world_happiness_report_2021 ( country_name text, regional_indicator text, ladder_score float, gdp_per_capita float, social_support float, healthy_life_expectancy float, generosity float, PRIMARY KEY (country_name) );
  If you want to load your own data, replace world_happiness_report_2021 with your own table name, and then adjust the column names, data types, and primary key for your data.
To load the sample dataset, download the World Happiness Report 2021 sample dataset.

This is a small sample dataset, but DSBulk can load, unload, and count extremely large datasets.

DSBulk can also load vector data. For more information, see Loading and unloading vector data. If you need a sample vector dataset, you can download the movie_openai_100.csv sample dataset (3.5MB).
Use DSBulk to insert data into the table:
```
dsbulk-VERSION/bin/dsbulk load -url PATH_TO_CSV_FILE -k KEYSPACE_NAME \
-t TABLE_NAME -b PATH_TO_SECURE_CONNECT_BUNDLE -u token \
-p APPLICATION_TOKEN
```
Replace the following:
- VERSION: Your version of DSBulk
- PATH_TO_CSV_FILE: The path to the CSV file that you want to upload
- KEYSPACE_NAME and TABLE_NAME: The name of the keyspace and table where you want to insert the data
- PATH_TO_SECURE_CONNECT_BUNDLE: The path to your database’s SCB
- APPLICATION_TOKEN: Your application token, ideally referenced through an environment variable or other secure method
  Result
  Operation directory: /path/to/directory/log/LOAD ... total | failed | rows/s | p50ms | p99ms | p999ms | batches 149 | 0 | 400 | 106.65 | 187.70 | 191.89 | 1.00
After the upload completes, you can query the table to verify that the data is present:
- Serverless (Vector) databases
- Serverless (Non-Vector) databases
SELECT * FROM KEYSPACE_NAME.worldhappinessreport2021;
SELECT * FROM KEYSPACE_NAME.world_happiness_report_2021 LIMIT 1;

Migrate data to Astra DB

DataStax offers several options to help migrate your data to Astra DB.

After migrating your data to Astra DB, your applications can connect exclusively to your Astra DB databases. Your code might not require any other changes if you already use a compatible driver and CQL statements. To get started on application development with Astra DB, see Connect to a database.

Migrate from DSE, HCD, or Apache Cassandra®

The following tools are designed to migrate Cassandra table data into a Cassandra-compatible cluster, such as Astra DB:

DataStax Zero Downtime Migrator (ZDM): Move your data to Astra DB with minimal configuration changes and little or no downtime.
Sideloader: A service running in Astra DB that directly imports data from snapshot backups that you’ve uploaded to Astra DB from an existing Apache Cassandra® or DataStax Enterprise (DSE) cluster.

You can use Sideloader alone or in conjunction with ZDM.
Cassandra Data Migrator (CDM): Migrate and validate tables between origin Cassandra clusters and target Astra DB databases, with available logging and reconciliation support.

You can use CDM alone or in conjunction with ZDM.
DataStax Bulk Migrator: DSBulk Migrator is an extension of DSBulk Loader that you can use to read data from a table from your origin database, and then write that data to a table in your target Astra DB database.

You can use DSBulk Migrator alone or in conjunction with ZDM.
DataStax Bulk Loader is an OSS command-line tool that you can use to extract and load CSV and JSON files containing Cassandra table data. You can use DSBulk to bring data from Cassandra, DataStax Enterprise (DSE), or Hyper-Converged Database (HCD) into Astra DB, as well as move data between collections and tables in Astra DB databases.

For more information about all of these options, see the DataStax data migration documentation.

Migrate from non-Cassandra sources

Because Astra DB is based on Apache Cassandra, it expects data to be in a format that is compatible with Cassandra table schemas.

When migrating from a schemaless source, you can use the Data API to insert documents into Astra DB collections. However, the Data API cannot transform your data if it is incompatible with Data API limits or functionality. For example, if fields exceed the maximum character limit or contain invalid values, the Data API throws an error. You must modify the incompatible data, and then reattempt the insert operation.

You can also use techniques like super shredding to flatten, normalize, and map schemaless or semi-structured JSON/CSV data into a Cassandra-compatible fixed schema, and then load the data into Astra DB with DSBulk Loader or other tools. However, super shredding can be complex and cumbersome, depending on the structure (or lack thereof) of the source data. For more information, see Building Data Services with Apache Cassandra.

Migrate or insert large amounts of data

Use the Data API for bulk inserts

Use CQL and drivers for bulk inserts

Use the DataStax Bulk Loader to insert large CSV and JSON files

Migrate data to Astra DB

Migrate from DSE, HCD, or Apache Cassandra®

Migrate from non-Cassandra sources

Was this helpful?

Give Feedback