Load your data

After you create a collection or table, you can load your data.

To load data, you must have the appropriate permissions, such as the Database Administrator role. To load data with the Data API, you need an application token with sufficient permissions.

Load vector data into a collection

You can load structured and unstructured vector data into a vector-enabled collection.

Load structured JSON or CSV data

You can use the Astra Portal or the Data API to load data from a JSON or CSV file into a collection.

Load a sample vector dataset

You can use sample datasets to explore features and test your applications before loading your own data.

In the Astra Portal, you can create a sample collection that automatically loads a sample dataset:

  1. In the Astra Portal navigation menu, select your Serverless (Vector) database, and then click Data Explorer.

  2. In the Keyspace field, select a keyspace that has no collections. If you don’t have an empty keyspace, create a keyspace for the sample dataset.

  3. Click Create Sample Collection.

After you load the sample dataset, you can interact with it using the Data Explorer or the Data API.

You can also do the following:

  • Astra Portal

  • Python

  • TypeScript

  • Java

  • curl

  1. In the Astra Portal, go to Databases, and then select your Serverless (Vector) database.

  2. Click Data Explorer.

  3. Select the keyspace and collection where you want to load data.

  4. Click Load Data.

  5. If your collection can load unstructured data, click Structured data.

  6. Click Select File, and then select the JSON or CSV file that contains your dataset.

    After the file uploads, the first ten rows of your data appear in the Data Preview section.

    If you get a Selected embedding does not match collection dimensions error, you need to create a new collection with vector dimensions that match your dataset.

  7. If your collection has an Astra DB vectorize integration, use the Vector Field to select the field to use to generate embeddings.

    The data importer applies the top-level $vectorize key to the selected field, and automatically generates an embedding vector from its contents. The resulting documents in the collection store the actual text in the $vectorize field and the resulting embedding in the $vector field. Documents in the collection do not retain the original field name for the $vectorize field.

    For more information, see Vector and vectorize.

  8. Optional: In the Data Preview section, select the data type for each field or column. The available types are String, Number, Array, Object, and Vector.

    If the data importer detects embeddings in your dataset, it automatically assigns the Vector data type to that field or column. Each collection can have only one vector field.

    These data type selections only apply to the initial data that you load, with the exception of Vector, which permanently maps the field to the reserved key $vector. Data type selections aren’t fixed in the schema, and they don’t apply to documents inserted later. For example, the same field can be a string in one document and a number in another. You can also have different sets of fields in different documents in the same collection.

  9. Click Load Data.

The Data API syntax depends on your embedding generation method and other configurations. For more information and options, see insertMany and Vector and vectorize.

  • Bring my own embeddings

  • Use an embedding provider integration

# Insert documents with embeddings into the collection.
documents = [
    {
        "text": "Chat bot integrated sneakers that talk to you",
        "$vector": [0.1, 0.15, 0.3, 0.12, 0.05],
    },
    {
        "text": "An AI quilt to help you sleep forever",
        "$vector": [0.45, 0.09, 0.01, 0.2, 0.11],
    },
    {
        "text": "A deep learning display that controls your mood",
        "$vector": [0.1, 0.05, 0.08, 0.3, 0.6],
    },
]
insertion_result = collection.insert_many(documents)
print(f"* Inserted {len(insertion_result.inserted_ids)} items.\n")

If you get a Mismatched vector dimension error, your collection and documents do not have the same dimensions.

When you load data into a collection that uses an embedding provider integration, Astra DB vectorize can automatically generate embeddings for your data. For more information, see Vector and vectorize.

# Insert documents into the collection.
# (UUIDs here are version 7.)
documents = [
    {
        "$vectorize": "Chat bot integrated sneakers that talk to you",
    },
    {
        "$vectorize": "An AI quilt to help you sleep forever",
    },
    {
        "$vectorize": "A deep learning display that controls your mood",
    },
]
insertion_result = collection.insert_many(documents)
print(f"* Inserted {len(insertion_result.inserted_ids)} items.\n")

The Data API syntax depends on your embedding generation method and other configurations. For more information and options, see insertMany and Vector and vectorize.

  • Bring my own embeddings

  • Use an embedding provider integration

  // Insert documents with embeddings into the collection.
  const documents = [
    {
      idea: 'Chat bot integrated sneakers that talk to you',
      $vector: [0.1, 0.15, 0.3, 0.12, 0.05],
    },
    {
      idea: 'An AI quilt to help you sleep forever',
      $vector: [0.45, 0.09, 0.01, 0.2, 0.11],
    },
    {
      idea: 'A deep learning display that controls your mood',
      $vector: [0.1, 0.05, 0.08, 0.3, 0.6],
    },
  ];

  const inserted = await collection.insertMany(documents);
  console.log(`* Inserted ${inserted.insertedCount} items.`);

If you get a Mismatched vector dimension error, your collection and documents do not have the same dimensions.

When you load data into a collection that uses an embedding provider integration, Astra DB vectorize can automatically generate embeddings for your data. For more information, see Vector and vectorize.

  // Insert documents into the collection (using UUIDv7s)
  const documents = [
    {
      $vectorize: 'Chat bot integrated sneakers that talk to you',
    },
    {
      $vectorize: 'An AI quilt to help you sleep forever',
    },
    {
      $vectorize: 'A deep learning display that controls your mood',
    },
  ];

  try {
    const inserted = await collection.insertMany(documents);
    console.log(`* Inserted ${inserted.insertedCount} items.`);
  } catch (e) {
    console.log('* Documents found on DB already. Let\'s move on!');
  }

The Data API syntax depends on your embedding generation method and other configurations. For more information and options, see insertMany and Vector and vectorize.

  • Bring my own embeddings

  • Use an embedding provider integration

    // Insert documents with embeddings into the collection
    collection.insertMany(
            new Document("1")
                    .append("text", "Chat bot integrated sneakers that talk to you")
                    .vector(new float[]{0.1f, 0.15f, 0.3f, 0.12f, 0.05f}),
            new Document("2")
                    .append("text", "An AI quilt to help you sleep forever")
                    .vector(new float[]{0.45f, 0.09f, 0.01f, 0.2f, 0.11f}),
            new Document("3")
                    .append("text", "A deep learning display that controls your mood")
                    .vector(new float[]{0.1f, 0.05f, 0.08f, 0.3f, 0.6f}));
    System.out.println("Inserted documents into the collection");

If you get a Mismatched vector dimension error, your collection and documents do not have the same dimensions.

When you load data into a collection that uses an embedding provider integration, Astra DB vectorize can automatically generate embeddings for your data. For more information, see Vector and vectorize.

// Insert documents into the collection
InsertManyResult insertResult = collection.insertMany(
  new Document()
   .vectorize("Chat bot integrated sneakers that talk to you"),
  new Document()
   .vectorize("An AI quilt to help you sleep forever"),
  new Document()
   .vectorize("A deep learning display that controls your mood")
);
System.out.println("Insert " + insertResult.getInsertedIds().size() + " items.");

The Data API syntax depends on your embedding generation method and other configurations. For more information and options, see insertMany and Vector and vectorize.

  • Bring my own embeddings

  • Use an embedding provider integration

# Insert documents with embeddings into the collection
curl -sS --location -X POST "$ASTRA_DB_API_ENDPOINT/api/json/v1/default_keyspace/vector_test" \
--header "Token: $ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "insertMany": {
    "documents": [
      {
        "text": "Chat bot integrated sneakers that talk to you",
        "$vector": [0.1, 0.15, 0.3, 0.12, 0.05]
      },
      {
        "text": "An AI quilt to help you sleep forever",
        "$vector": [0.45, 0.09, 0.01, 0.2, 0.11]
      },
      {
        "text": "A deep learning display that controls your mood",
        "$vector": [0.1, 0.05, 0.08, 0.3, 0.6]
      }
    ],
    "options": {
      "ordered": false
    }
  }
}' | jq

If you get a Mismatched vector dimension error, your collection and documents do not have the same dimensions.

When you load data into a collection that uses an embedding provider integration, Astra DB vectorize can automatically generate embeddings for your data. For more information, see Vector and vectorize.

# Insert documents into the collection and generate embeddings.
curl -sS --location -X POST "$ASTRA_DB_API_ENDPOINT/api/json/v1/default_keyspace/pass:q[**COLLECTION_NAME**]" \
--header "Token: $ASTRA_DB_APPLICATION_TOKEN" \
--header "Content-Type: application/json" \
--data '{
  "insertMany": {
    "documents": [
      {
        "$vectorize": "Chat bot integrated sneakers that talk to you"
      },
      {
        "$vectorize": "An AI quilt to help you sleep forever"
      },
      {
        "$vectorize": "A deep learning display that controls your mood"
      }
    ],
    "options": {
      "ordered": false
    }
  }
}' | jq

After loading data, you can interact with it using the Data Explorer or the Data API.

unstructured Load unstructured vector data

This Astra DB Serverless feature is currently in public preview. Development is ongoing, and the features and functionality are subject to change. Astra DB Serverless, and the use of such, is subject to the DataStax Preview Terms.

The Astra DB Unstructured.io integration transforms your unstructured PDF files into structured vector data. Astra DB processes your files with Unstructured Open Source and loads the resulting JSON or CSV data into your collection.

The Unstructured data loader integration has the following requirements and limitations:

To process PDFs with the Unstructured data loader integration, do the following:

  1. In the Astra Portal, go to Databases, and then select your Serverless (Vector) database.

  2. Click Data Explorer.

  3. Select the keyspace and collection where you want to load data.

  4. Click Load Data.

  5. Click Unstructured data.

  6. Click Select File, and then select up to 10 PDF files to process and load.

    You can load unstructured and structured vector data into the same collection, but you can’t load JSON and CSV files together with unstructured file types. You must upload JSON and CSV files separately through the Structured data option.

  7. Optional: Modify the chunking configuration:

    • Chunk max characters: Set the maximum chunk length. Unstructured splits oversized chunks to fit this limit. The default is 500. For more information, see chunk_max_characters.

      Your embedding model must support the chunk size you set. If the chunk size is larger than your model supports, an error occurs when you try to load data.

    • Chunk character overlap: Set a chunk prefix of the last n characters from the prior chunk. This applies to oversized chunks only. The default is 0. For more information, see chunk_overlap.

  8. Click Load Data.

  9. Wait while Astra DB processes your files. If necessary, you can cancel processing for any files that have not started processing. Once a file begins processing, you can’t cancel it.

    During processing, Astra DB does the following:

    After loading data, you can interact with it using the Data Explorer or the Data API.

Count records

After you load data, the Data Explorer in the Astra Portal shows the count of Records in the collection.

For collections with fewer than 1,000 documents, Astra DB shows an exact number.

For collections with 1,000 or more documents, Astra DB shows an approximate number of records.

You can use the Data API countDocuments and estimatedDocumentCount endpoints to run these counts for yourself.

Load non-vector data into a table

To load non-vector data into a table in the Astra Portal, do the following:

  1. In the Astra Portal, go to Databases, and then select your Serverless (Non-Vector) database.

  2. Click Load Data.

  3. In the Data Loader, click Select File and choose a CSV to upload. Wait for the upload to complete.

    If you don’t have a CSV file, you can click Load a sample dataset and then select a sample dataset.

  4. Click Next.

  5. Optional: Change the Table Name.

  6. Review the data types for each column and choose the correct data type from the drop-down if you see any that aren’t correct.

  7. From the Partition keys drop-down, select the columns to use as the partition keys.

  8. Optional: From the Clustering columns drop-down, select the columns to use as the clustering columns.

  9. Click Next.

  10. Optional: Click the Target Database drop-down and select a different database if you want to upload the dataset to a different database from the one you started with.

  11. Click the Target Keyspace drop-down and select the keyspace where you want to create your table.

  12. Click Finish.

You receive an email notification when the data import is complete.

Load data with DSBulk

If your CSV file is more than 40MB, you can upload data with the DataStax Bulk Loader (DSBulk). DSBulk provides commands like dsbulk load, dsbulk unload, and dsbulk count, along with extensive options. For more information, see the DataStax Bulk Loader reference.

  1. Download the dsbulk installation file. DSBulk 1.11.0 or later is required to support the vector CQL data type. The following command automatically downloads the latest DSBulk version:

    curl -OL https://downloads.datastax.com/dsbulk/dsbulk.tar.gz
    Results
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   242  100   242    0     0    681      0 --:--:-- --:--:-- --:--:--   691
    100 40.4M  100 40.4M    0     0  20.7M      0  0:00:01  0:00:01 --:--:-- 31.6M
  2. Extract the DSBulk archive:

    tar -xzvf dsbulk.tar.gz
    Results

    This example uses DSBulk version 1.11.0:

    x dsbulk-1.11.0/README.md
    x dsbulk-1.11.0/LICENSE.txt
    x dsbulk-1.11.0/manual/
    x dsbulk-1.11.0/manual/driver.template.conf
    x dsbulk-1.11.0/manual/settings.md
    x dsbulk-1.11.0/manual/application.template.conf
    x dsbulk-1.11.0/bin/dsbulk
    x dsbulk-1.11.0/bin/dsbulk.cmd
    x dsbulk-1.11.0/conf/
    x dsbulk-1.11.0/conf/driver.conf
    x dsbulk-1.11.0/conf/application.conf
    x dsbulk-1.11.0/THIRD-PARTY.txt
    x dsbulk-1.11.0/lib/java-driver-core-4.17.0.jar
    x dsbulk-1.11.0/lib/native-protocol-1.5.1.jar
    x dsbulk-1.11.0/lib/netty-handler-4.1.94.Final.jar
       .
       .
       .
    x dsbulk-1.11.0/lib/lz4-java-1.8.0.jar
    x dsbulk-1.11.0/lib/snappy-java-1.1.7.3.jar
    x dsbulk-1.11.0/lib/jansi-1.18.jar
  3. To verify the installation, run the following command in the same directory where you extracted DSBulk:

    dsbulk-VERSION/bin/dsbulk --version
    Results
    DataStax Bulk Loader v1.11.0
  4. Create an application token with the Administrator User role, and then store the token securely.

  5. If you haven’t done so already, create a database. You can use either a Serverless (Vector) or Serverless (Non-Vector) database.

  6. Download the database’s Secure Connect Bundle (SCB).

  7. Create a table in your database:

    1. In the Astra Portal, go to your database, and click CQL Console.

    2. When the token@cqlsh> prompt appears, select the keyspace where you want to create the table:

      use KEYSPACE_NAME;
    3. Create a table to load a sample dataset:

      CREATE TABLE KEYSPACE_NAME.world_happiness_report_2021 (
        country_name text,
        regional_indicator text,
        ladder_score float,
        gdp_per_capita float,
        social_support float,
        healthy_life_expectancy float,
        generosity float,
        PRIMARY KEY (country_name)
      );

      If you want to load your own data, replace world_happiness_report_2021 with your own table name, and then adjust the column names and data types for your data.

  8. To load the sample dataset, download the World Happiness Report 2021 sample dataset. This is a small sample dataset, but DSBulk can load, unload, and count extremely large files.

    DSBulk can also load vector data. For more information, see Loading and unloading vector data with DSBulk. If you need a sample vector dataset, you can download the movie_openai_100.csv sample dataset (3.5MB).

  9. Use DSBulk to load data into the table:

    dsbulk-VERSION/bin/dsbulk load -url PATH_TO_CSV_FILE -k KEYSPACE_NAME \
    -t TABLE_NAME -b PATH_TO_SECURE_CONNECT_BUNDLE -u token \
    -p APPLICATION_TOKEN
    Results
    Operation directory: /path/to/directory/log/LOAD ...
    total | failed | rows/s |  p50ms |  p99ms | p999ms | batches
      149 |      0 |    400 | 106.65 | 187.70 | 191.89 |    1.00
  10. After the upload completes, you can query the loaded data from the CQL Console:

    • Serverless (Vector) databases

    • Serverless (Non-Vector) databases

    SELECT * FROM KEYSPACE_NAME.worldhappinessreport2021;
    SELECT * FROM KEYSPACE_NAME.world_happiness_report_2021 LIMIT 1;

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com