Install DataStax Apache Pulsar Connector
Install DataStax Apache Pulsar Connector from the DataStax distribution tar file using an account that has write access to the Pulsar configuration directory.
System requirements
The system requirements for Pulsar Connector depends on the workload and network capacity. The factors include characteristics of the Pulsar topic and the cluster data models and volume. DataStax recommends testing with realistic data flows before committing to an instance type for the Pulsar Connector.
- Database
-
The Pulsar Connector supports the following databases:
-
DataStax Astra DB
-
DataStax Enterprise (DSE) 4.7 and later (non-EOL versions recommended)
-
Open source Apache Cassandra® 2.1 and later (non-EOL versions recommended, full compatibility with 5.x isn’t guaranteed)
-
- Pulsar
-
The Pulsar Connector requires Apache Pulsar 2.7.0 or later.
The Pulsar Connector supports the following data structures in Pulsar topics:
-
Primitive string values
-
Avro
-
JSON formatted string with JSON schema
-
JSON formatted string inside a schemaless topic
-
- Operating system
-
The supported operating systems are Linux and macOS.
- CPU
-
Pulsar Connector is bound by the amount of CPU available on the host. The Pulsar Connector holds all the records pulled from Pulsar topics in memory, along with the cluster metadata and prepared statements.
Memory pressure is influenced by the following:
-
Record size of Pulsar topics
-
Number of records pulled at the same time, where the maximum is set by the workers
batchSizeparameter. -
Number of simultaneous tasks run by the Pulsar Connector
-
- Network
-
Pulsar Connector needs adequate network capacity for the payload. This includes the connections from Pulsar Servers to the target platform. Scale the Pulsar Connector horizontally by adding additional workers to increase overall throughput.
The Pulsar Connector framework automatically rebalances the load when workers are added by reallocating the tasks among the workers.
Install the Pulsar Connector
Perform the following steps on a Pulsar Connect node running Apache Pulsar 2.7.0 or later. If you need to install Pulsar, see Pulsar connector single instance quickstart for DSE.
-
Download the Pulsar Connector tar file from the DataStax downloads site.
-
Extract the files, replacing VERSION with the version number of the tar file you downloaded:
tar zxf cassandra-enhanced-pulsar-sink-VERSION.tar.gzThe following files are unpacked into a directory named after the tar file, such as
cassandra-enhanced-pulsar-sink-1.4.0:LICENSE.txt README.md THIRD-PARTY.txt conf/example.yml cassandra-enhanced-pulsar-sink-1.4.0.nar -
In your Pulsar
homedirectory, find theconnectorsdirectory. If there isn’t aconnectorsdirectory, create one. -
Move the Pulsar Connector NAR file to the Pulsar
connectorsdirectory:mv installation_location/cassandra-enhanced-pulsar-sink-1.4.0.nar pulsar_home/connectors -
Copy the sample configuration file
example.ymlfromcassandra-enhanced-pulsar-sink-VERSION/conf/to your Pulsarconfigdirectory.If you plan to create multiple sinks from this connector, give your configuration file a unique name.
-
Edit the configuration file as necessary using the information provided in this documentation. For example, for information about connection, authentication, and encryption parameters, see Connect the DataStax Apache Pulsar Connector.
-
Ensure that the user running Pulsar has permission to access the configuration and NAR files.
Start Pulsar with the Pulsar Connector
-
Start, restart, or reload your Pulsar instance:
bin/pulsar-admin sinks reload -
Check that the Pulsar Connector is available:
bin/pulsar-admin sinks available-sinks -
Create a Pulsar sink:
bin/pulsar-admin sinks create \ --name SINK_NAME \ --classname com.datastax.oss.sink.pulsar.StringCassandraSinkTask \ --sink-config-file config/example.yml \ --sink-type cassandra-enhanced \ --tenant TENANT_NAME \ --namespace NAMESPACE_NAME \ --inputs "persistent://TENANT_NAME/NAMESPACE_NAME/TOPIC_NAME"Replace the following:
-
SINK_NAME: A unique name for the sink. -
TENANT_NAME: The name of the relevant Pulsar tenant. -
NAMESPACE_NAME: The name of the relevant Pulsar namespace. -
TOPIC_NAME: The name of the Pulsar topic that you want to stream to your database.The topic mapping is set in the Pulsar Connector’s configuration YAML file. Make sure that it matches the schema of the table where you want to write the messages. For more information, see Pulsar topic-to-table parameters.
-
-
Use
pulsar-client produceto send some messages to your new sink, and then usecqlshto verify that the messages were written to the table in your database.
Next steps
Explore the rest of the documentation to learn more about configuring and using the Pulsar Connector.