Question

我正在尝试从pyspark执行CQL。目前，我可以读写表。

$ pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78

>>> sqlContext.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="my_table", keyspace="my_keyspace")\
    .load()

+-----+-----+--------+
|cCode|pCode|   mDate|
+-----+-----+--------+
|  135|  379|20180428|
|   31|  898|20180429|
|   31|  245|20180430|
+-----+-----+--------+

我希望从我的pyspark界面能够执行create语句，例如：

CREATE TABLE IF NOT EXISTS keyspace_name.table_name 
( column_definition, column_definition, ...)
WITH property AND property ...

通常，当我在Hive上执行SQL时，我必须使用sqlContext.sql()，但在这种情况下，我需要以某种方式在某处添加此信息format("org.apache.spark.sql.cassandra")，我只是不知道把它放在哪里。

Answer 1

在Scala / Java中，有一个CassandraConnector类允许使用withSessionDo函数执行任意命令（参见docs）。

但是根据documentation，Cassandra的PySpark接口仅限于使用DataFrames：

通过包含Cassandra数据源，PySpark现在可以与Connector一起使用来访问Cassandra数据。这不需要DataStax Enterprise，但仅限于DataFrame操作。

所以唯一可能的是直接构建＆amp;使用Python driver中的Cluster / Session类。

从pyspark执行CQL

1 个答案: