Load your data

You must load your data to perform a search.

Prerequisites

Load vector data into a collection

Use the following methods to load vector data into a collection.

Alternatively, you can load a sample dataset.

  • Astra Portal

  • Python

  • TypeScript

  • Java

Use the Astra Portal to load a dataset from a JSON or a CSV file.

  1. In the Astra Portal, go to Databases, and then select your database.

  2. Click Data Explorer.

  3. Select a collection.

    If you haven’t created a collection yet, see Create a collection.

  4. Click Load Data.

  5. In the Load Data dialog, click Select File.

  6. Select the file on your computer that contains your dataset.

    Once the file upload is complete, the first ten rows of your data appear in the Data Preview section.

    If you get a Selected embedding does not match collection dimensions error, you need to create a new collection with vector dimensions that match your dataset.

  7. If you’ve configured your collection to auto-generate embeddings using an embedding provider, then you can use the Vector Field dropdown to select the field that you want to auto-generate embeddings for.

    The Load Data dialog with Vector Field dropdown expanded.

    The data importer will apply the top-level $vectorize key to the Vector Field, and automatically generate an embedding vector from its contents. The resulting documents in the collection will have the actual text stored in the special $vectorize field, and the resulting embedding stored in the $vector field.

  8. Optional: Configure field data types.

    In the Data Preview section, use the drop-down controls to change the data type for each field or column.

    The options are:

    • String

    • Number

    • Array

    • Object

    • Vector

      If the data importer detects embeddings in your dataset, it automatically assigns the Vector data type to that field or column. Currently, only one vector field is supported per collection.

    Data type selections you make in the Data Preview section only apply to the initial data that you load (with the exception of Vector, which permanently maps the field to the reserved key $vector). These selections aren’t fixed in the schema, and don’t apply to documents inserted later on. The same field can be a string in one document, and a number in another. You can also have different sets of fields in different documents in the same collection.

  9. Click Load Data.

Once your dataset has loaded, you can interact with it using the Data Explorer or the client APIs.

Use the Python client to load data into your database. The syntax depends on whether you’re bringing your own embeddings or using an external embeddings provider.

  • Bring my own

  • Use an external provider

# Insert documents into the collection.
# (UUIDs here are version 7.)
documents = [
    {
        "_id": UUID("018e65c9-df45-7913-89f8-175f28bd7f74"),
        "text": "ChatGPT integrated sneakers that talk to you",
        "$vector": [0.1, 0.15, 0.3, 0.12, 0.05],
    },
    {
        "_id": UUID("018e65c9-e1b7-7048-a593-db452be1e4c2"),
        "text": "An AI quilt to help you sleep forever",
        "$vector": [0.45, 0.09, 0.01, 0.2, 0.11],
    },
    {
        "_id": UUID("018e65c9-e33d-749b-9386-e848739582f0"),
        "text": "A deep learning display that controls your mood",
        "$vector": [0.1, 0.05, 0.08, 0.3, 0.6],
    },
]
try:
    insertion_result = collection.insert_many(documents)
    print(f"* Inserted {len(insertion_result.inserted_ids)} items.\n")
except InsertManyException:
    print("* Documents found on DB already. Let's move on.\n")

If you get a Mismatched vector dimension error, your collection and documents do not have the same dimensions.

# Insert documents into the collection.
# (UUIDs here are version 7.)
documents = [
    {
        "_id": UUID("018e65c9-df45-7913-89f8-175f28bd7f74"),
        "$vectorize": "ChatGPT integrated sneakers that talk to you",
    },
    {
        "_id": UUID("018e65c9-e1b7-7048-a593-db452be1e4c2"),
        "$vectorize": "An AI quilt to help you sleep forever",
    },
    {
        "_id": UUID("018e65c9-e33d-749b-9386-e848739582f0"),
        "$vectorize": "A deep learning display that controls your mood",
    },
]
try:
    insertion_result = collection.insert_many(documents)
    print(f"* Inserted {len(insertion_result.inserted_ids)} items.\n")
except InsertManyException:
    print("* Documents found on DB already. Let's move on.\n")

Use the TypeScript client to load data into your database. The syntax depends on whether you’re bringing your own embeddings or using an external embeddings provider.

  • Bring my own

  • Use an external provider

  // Insert documents into the collection (using UUIDv7s)
  const documents = [
    {
      _id: new UUID('018e65c9-df45-7913-89f8-175f28bd7f74'),
      text: 'ChatGPT integrated sneakers that talk to you',
      $vector: [0.1, 0.15, 0.3, 0.12, 0.05],
    },
    {
      _id: new UUID('018e65c9-e1b7-7048-a593-db452be1e4c2'),
      text: 'An AI quilt to help you sleep forever',
      $vector: [0.45, 0.09, 0.01, 0.2, 0.11],
    },
    {
      _id: new UUID('018e65c9-e33d-749b-9386-e848739582f0'),
      text: 'A deep learning display that controls your mood',
      $vector: [0.1, 0.05, 0.08, 0.3, 0.6],
    },
  ];

  try {
    const inserted = await collection.insertMany(documents);
    console.log(`* Inserted ${inserted.insertedCount} items.`);
  } catch (e) {
    console.log('* Documents found on DB already. Let\'s move on!');
  }

If you get a Mismatched vector dimension error, your collection and documents do not have the same dimensions.

  // Insert documents into the collection (using UUIDv7s)
  const documents = [
    {
      _id: new UUID('018e65c9-df45-7913-89f8-175f28bd7f74'),
      $vectorize: 'ChatGPT integrated sneakers that talk to you',
    },
    {
      _id: new UUID('018e65c9-e1b7-7048-a593-db452be1e4c2'),
      $vectorize: 'An AI quilt to help you sleep forever',
    },
    {
      _id: new UUID('018e65c9-e33d-749b-9386-e848739582f0'),
      $vectorize: 'A deep learning display that controls your mood',
    },
  ];

  try {
    const inserted = await collection.insertMany(documents);
    console.log(`* Inserted ${inserted.insertedCount} items.`);
  } catch (e) {
    console.log('* Documents found on DB already. Let\'s move on!');
  }

Use the Java client to load data into your database. The syntax depends on whether you’re bringing your own embeddings or using an external embeddings provider.

  • Bring my own

  • Use an external provider

    // Insert documents into the collection
    collection.insertMany(
            new Document("1")
                    .append("text", "ChatGPT integrated sneakers that talk to you")
                    .vector(new float[]{0.1f, 0.15f, 0.3f, 0.12f, 0.05f}),
            new Document("2")
                    .append("text", "An AI quilt to help you sleep forever")
                    .vector(new float[]{0.45f, 0.09f, 0.01f, 0.2f, 0.11f}),
            new Document("3")
                    .append("text", "A deep learning display that controls your mood")
                    .vector(new float[]{0.1f, 0.05f, 0.08f, 0.3f, 0.6f}));
    System.out.println("Inserted documents into the collection");

If you get a Mismatched vector dimension error, your collection and documents do not have the same dimensions.

    // Insert documents into the collection
    collection.insertMany(
            new Document("1").vectorize("ChatGPT integrated sneakers that talk to you"),
            new Document("2").vectorize("An AI quilt to help you sleep forever"),
            new Document("3").vectorize("A deep learning display that controls your mood"));
    System.out.println("Inserted documents into the collection");

Load a sample vector dataset

You can use sample datasets to help you explore features and test your applications. Here’s how to load a sample vector dataset into a pre-configured collection.

  1. In the Astra Portal, go to Databases, and then select your database.

  2. Click Data Explorer.

  3. Use the Namespace drop-down to select an empty namespace. If you don’t have an empty namespace, create a new one.

  4. Click Create Sample Collection.

Once the sample dataset has loaded, you can interact with it using the Data Explorer or the client APIs.

If you’re using a client or the API directly, download the movie_openai_100.csv (3.5 MB) dataset.

Load non-vector data into a table

Here’s how to load non-vector data into a table using the Astra Portal.

  1. In the Astra Portal, go to Databases, and then select your Serverless (Non-Vector) database.

  2. Click Load Data.

  3. In the Data Loader, click Select File and choose a CSV to upload. Wait for the upload to complete.

    If you don’t have a CSV file, you can click Load a sample dataset and then select a sample dataset.

  4. Click Next.

  5. Optional: Change the Table Name.

  6. Review the data types for each column and choose the correct data type from the drop-down if you see any that aren’t correct.

  7. From the Partition keys drop-down, select the columns to use as the partition keys.

  8. Optional: From the Clustering columns drop-down, select the columns to use as the clustering columns.

  9. Click Next.

  10. Optional: Click the Target Database drop-down and select a different database if you want to upload the dataset to a different database from the one you started with.

  11. Click the Target Keyspace drop-down and select the keyspace where you want to create your table.

  12. Click Finish.

You will receive an email notification when the data import is complete.

Load data with DSBulk

If your CSV file is more than 40 MB, you can upload data with DataStax Bulk Loader (DSBulk).

Here are the steps to load your CSV data into an Astra DB database using a dsbulk load command.

As a prerequisite, you must first install DSBulk 1.11.0 or higher, which includes support for the vector CQL data type.

  1. From your terminal, download the dsbulk installation file. The following command automatically downloads the latest DSBulk version.

    curl -OL https://downloads.datastax.com/dsbulk/dsbulk.tar.gz
    Sample result
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   242  100   242    0     0    681      0 --:--:-- --:--:-- --:--:--   691
    100 40.4M  100 40.4M    0     0  20.7M      0  0:00:01  0:00:01 --:--:-- 31.6M
  2. Unpack the folder:

    tar -xzvf dsbulk.tar.gz
    Sample result
    x dsbulk-1.11.0/README.md
    x dsbulk-1.11.0/LICENSE.txt
    x dsbulk-1.11.0/manual/
    x dsbulk-1.11.0/manual/driver.template.conf
    x dsbulk-1.11.0/manual/settings.md
    x dsbulk-1.11.0/manual/application.template.conf
    x dsbulk-1.11.0/bin/dsbulk
    x dsbulk-1.11.0/bin/dsbulk.cmd
    x dsbulk-1.11.0/conf/
    x dsbulk-1.11.0/conf/driver.conf
    x dsbulk-1.11.0/conf/application.conf
    x dsbulk-1.11.0/THIRD-PARTY.txt
    x dsbulk-1.11.0/lib/java-driver-core-4.17.0.jar
    x dsbulk-1.11.0/lib/native-protocol-1.5.1.jar
    x dsbulk-1.11.0/lib/netty-handler-4.1.94.Final.jar
       .
       .
       .
    x dsbulk-1.11.0/lib/lz4-java-1.8.0.jar
    x dsbulk-1.11.0/lib/snappy-java-1.1.7.3.jar
    x dsbulk-1.11.0/lib/jansi-1.18.jar
  3. Check the dsbulk executable. Make sure everything is running correctly through the command line. For example, from the same directory where you ran the tar command, enter:

    dsbulk-1.11.0/bin/dsbulk --version
    Sample result
    DataStax Bulk Loader v1.11.0

The latest version of DSBulk is installed and ready to use.

DSBulk provides the following commands:

  • dsbulk load

  • dsbulk unload

  • dsbulk count

The DSBulk commands come with an extensive set of options. For details, start on About DataStax Bulk Loader. See especially the DataStax Bulk Loader reference and its topics that describe all the options.

Astra credentials

Get the necessary credentials to connect to Astra DB. You also need the Secure Connect Bundle (SCB).

Use the following steps to generate a token.

  1. Open your Astra Portal and select Tokens from the left navigation.

  2. Select the Administrator User role from the Select a Token Role list.

  3. Select Generate Token, and then copy or download the Client ID and Client Secret.

  4. Navigate to the database you want to load data into.

    1. If necessary, Create a database.

Your database type determines how you copy or download a SCB.

  • Serverless (Vector) databases

  • Serverless (Non-Vector) databases

  1. In the Astra Portal, select Databases in the main navigation.

  2. Select a vector database you want to load data into.

  3. Go to Region and click more_vert More next to the region for which you want to download an SCB.

  4. Click Download SCB. You have five minutes to copy or download these details for later.

  5. Click ESC to close.

  1. In the Astra Portal, select Databases in the main navigation.

  2. Choose a non-vector database, and then select Connect.

  3. In the Get a Secure Connect Bundle section, click Get Bundle.

  4. Select a region, and copy or download your secure connect bundle.

    1. For more than one region, select a region and then a domain. Copy or download your secure connect bundle. You have five minutes to copy or download these details for later.

  5. Click ESC or Close.

Create a namespace or keyspace

To load your data with DSBulk Loader, you need to create a namespace for your vector database or a keyspace for your non-vector database. For more, see Namespaces-versus-keyspaces.

Your database type determines how you create your namespace or keyspace.

  • Serverless (Vector) databases

  • Serverless (Non-Vector) databases

  1. In the Astra Portal, go to Databases, and then select your vector database.

  2. Click Data Explorer.

  3. From the Namespace drop-down, click Create Namespace.

  4. Type your namespace name in the Create Namespace field and select Add Namespace. Your database is in maintenance mode until the namespace is ready for use.

  1. In the Astra Portal, go to Databases, and then select your non-vector database.

  2. Click Add Keyspace.

  3. Type your keyspace name in the Keyspace field and select Add Keyspace. Your database is in maintenance mode until the keyspace is ready for use.

Create the table

Your database type determines how you create your table.

Here’s how to create an empty table using the Astra Portal.

  1. In the Astra Portal, go to Databases, and then select your Serverless (Non-Vector) database.

  2. In the Overview tab, note the list of available keyspaces in the Keyspaces section. You will create your table in one of these keyspaces.

  3. Click CQL Console. Wait a few seconds for the token@cqlsh> prompt to appear.

  4. Select the keyspace you want to create your table in.

    use KEYSPACE_NAME;
  5. Add the following code to the console. Change YOUR_KEYSPACE to your actual keyspace name in the code.

    CREATE TABLE YOUR_KEYSPACE.world_happiness_report_2021 (
      country_name text,
      regional_indicator text,
      ladder_score float,
      gdp_per_capita float,
      social_support float,
      healthy_life_expectancy float,
      generosity float,
      PRIMARY KEY (country_name)
    );

The table is successfully created.

Load your data

To try out the DSBulk data loader, use our sample data.

Download the sample CSV file, World Happiness Report 2021.

Use DSBulk to upload this CSV file.

dsbulk-X.Y.Z/bin/dsbulk load -url <path-to-csv-file> -k <keyspace_name> -t <table_name> -b <path-to-secure-connect-bundle> -u <client_id> -p <client_secret>
Results
Operation directory: /path/to/directory/log/LOAD ...
total | failed | rows/s |  p50ms |  p99ms | p999ms | batches
  149 |      0 |    400 | 106.65 | 187.70 | 191.89 |    1.00

Your rows are loaded into the table. This is a small example dataset, but DSBulk can load, unload, and count extremely large files.

View your data in Astra DB

Your database type determines how you view the data.

  • Serverless (Vector) databases

  • Serverless (Non-Vector) databases

From the CQL Console, run this command:

SELECT * FROM YOUR_NAMESPACE.worldhappinessreport2021;

Check your email for two messages about the data: the first says when the job starts, and the second informs you that the data is successfully loaded. With the data successfully loaded, you can run the command from the second email in the CQL Console:

SELECT * FROM YOUR_KEYSPACE.world_happiness_report_2021 LIMIT 1;

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com