Question

我有一个Cassandra集群，它有一个共存的Spark集群，我可以通过编译它们，复制它们并使用./spark-submit脚本来运行常用的Spark作业。我写了一个小作业，接受SQL作为命令行参数，将它作为Spark SQL提交给Spark，Spark运行SQL对抗Cassandra并将输出写入csv文件。

现在我觉得我正在绕圈试图弄清楚是否可以直接在JDBC连接中通过Spark SQL查询Cassandra（例如来自Squirrel SQL）。 Spark SQL文档说

Connect through JDBC or ODBC.

A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.

Spark SQL编程指南说

Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any 
code.

所以我可以运行Thrift Server，并向它提交SQL。但我无法弄清楚的是，如何让Thrift服务器连接到Cassandra？我是否只是在Thrift Server类路径上弹出Datastax Cassandra Connector？如何告诉Thrift服务器我的Cassandra集群的IP和端口？有没有人这样做过，可以给我一些指示？

Answer 1

在spark-default.conf文件中配置这些属性

spark.cassandra.connection.host    192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username   smb
spark.cassandra.auth.password   bigdata@123

启动thrift服务器，使用spark-cassandra-connector依赖项和mysql-connector依赖项，以及一些将通过JDBC或Squirrel连接的端口。

sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar

为了获取cassandra表，运行Spark-SQL查询，如

CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');

Answer 2

为什么不使用spark-cassandra-connector和cassandra-driver-core？只需添加依赖项，在spark上下文中指定主机地址/登录，然后就可以使用sql读取/写入cassandra。

使用JDBC（例如Squirrel SQL）使用Spark SQL查询Cassandra

2 个答案: