Count data
Use the dsbulk count command to return information about the loaded data:
dsbulk count -k KEYSPACE_NAME -t TABLE_NAME OPTIONS
Replace the following:
-
KEYSPACE_NAME: The name of the keyspace that contains the table you want to query. -
TABLE_NAME: The name of the table you want to query. -
OPTIONS: Any options for thecountcommand or commondsbulkoptions.
Count with authentication
If you aren’t running the dsbulk count command against a local cluster that doesn’t require authentication, then you must provide authentication and connection details.
Pass the relevant options with your count commands if your cluster requires authentication or uses SSL encryption.
You can pass the values directly or use a configuration file.
-
DSE, HCD, and Cassandra
-
Astra DB
If your database requires authentication, provide the username and password using the -u and -p options, respectively:
dsbulk count -k ks1 -t table1 \
-u username -p password
If the cluster is remote, include driver options like host, contact points, and port:
dsbulk count -k ks1 -t table1 \
-u username -p password -h '10.200.1.3, 10.200.1.4' -port 9876
If your cluster has both authentication and SSL enabled, pass -u, -p, and the SSL options.
For example:
dsbulk count -h '["fe80::f861:3eff:fe1d:9d7a"]' -u username -p password
--driver.auth.provider DsePlainTextAuthProvider
--driver.ssl.provider JDK
--driver.ssl.keystore.path /etc/dse/keystores/client.keystore
--driver.ssl.keystore.password sslkspassword
--driver.ssl.truststore.path /etc/dse/keystores/client.truststore
--driver.ssl.truststore.password ssltrustpassword
-k ks1 -t table1
dsbulk count -k ks1 -t table1 \
-b "path/to/SCB.zip" -u token -p AstraCS:...
For Astra DB, the expected connection credentials are:
-
-b: Provide the path to the database’s Secure Connect Bundle (SCB) zip file. The SCB includes certificates and key files for SSL-encrypted connections as well as information about the database’s contact points. -
-u: Set to the literal stringtoken. -
-p: Provide an application token. DataStax recommends using secure references to tokens, such as environment variables, rather than specifying them directly on the command line.
Count partition data
The following example gets information about the partition data used in a table named comments that is in a keyspace named cycling:
dsbulk count -k cycling -t comments --stats.modes partitions --stats.numPartitions 50
The console prints the log directory, performance metrics for the operation, and the data retrieved by the command. The retrieved data is presented in three columns:
-
The first column is the partition key value.
-
The second column is the number of rows using that partition key value.
-
The third column is the percentage of rows in the partition compared to the total number of rows that were scanned for the query.
Operation directory: /home/automaton/cycling/logs/COUNT_20190424-213840-954894
total | failed | rows/s | mb/s | kb/row | p50ms | p99ms | p999ms
31 | 0 | 74 | 0.00 | 0.02 | 27.59 | 31.33 | 31.33
Operation COUNT_20190424-213840-954894 completed successfully in 2 seconds.
fb372533-eb95-4bb4-8685-6ef61e994caa 5 16.13
8566eb59-07df-43b1-a21b-666a3c08c08a 4 12.90
c7fceba0-c141-4207-9494-a29f9809de6f 4 12.90
e7ae5cf3-d358-4d99-b900-85902fda9bb0 4 12.90
6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 3 9.68
9011d3be-d35c-4a8d-83f7-a3c543789ee7 2 6.45
95addc4c-459e-4ed7-b4b5-472f19a67995 2 6.45
38ab64b6-26cc-4de9-ab28-c257cf011659 2 6.45
5b6962dd-3f90-4c93-8f61-eabfa4a803e2 1 3.23
c4b65263-fe58-4846-83e8-f0e1c13d518f 1 3.23
e7cd5752-bc0d-4157-a80f-7523add8dbcd 1 3.23
6d5f1663-89c0-45fc-8cfd-60a373b01622 1 3.23
220844bf-4860-49d6-9a4b-6b5d3a79cbfb 1 3.23
Count graph data
You can use the dsbulk count command to return information about loaded graph data.
The command can return either vertices or edges:
dsbulk count -g GRAPH_NAME -v VERTEX_LABEL
The following example returns the number of person vertices that exist in the person vertex label table that is stored in the food graph:
dsbulk count -g food -v person
Similarly, you can retrieve the number of edges in an edge label table.
The following example returns the number of authored edges that exist between person vertices and book vertices in the authored edge label table that are stored in the food graph:
dsbulk count -g food -e authored -from person -to book
Use dsbulk count as a replacement for SELECT COUNT(*)
The dsbulk count command can be more efficient than the SELECT COUNT(*) CQL query, particularly for large tables.
The dsbulk count command is optimized for counting rows and can provide better performance and lower resource consumption compared to executing a CQL query that selects all rows.
A query such as SELECT COUNT(*) FROM KEYSPACE_NAME.TABLE_NAME; can be replaced by the following dsbulk count command:
dsbulk count -k KEYSPACE_NAME -t TABLE_NAME