如何将Cassandra ResultSet转换为Spark DataFrame?

时间:2016-01-21 16:31:32

标签: apache-spark cassandra-2.0 datastax spark-cassandra-connector

我通常会使用Java以这种方式将数据从Cassandra加载到Apache Spark中:

OrderSelect()

但是想象一下我有一个分片器,我需要将几个分区键加载到这个DataFrame中。我可以在我的查询中使用WHERE IN(...)并再次使用cassandraSql方法。但是由于在协调器节点方面存在一个故障点的臭名昭着的问题,我有点不愿意使用WHERE IN。这在这里解释:

https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

有没有办法使用多个查询但是将它们加载到单个DataFrame中?

1 个答案:

答案 0 :(得分:1)

一种方法是运行单个查询和unionAll / union多个DataFrames / RDD。

SparkContext sparkContext = StorakleSparkConfig.getSparkContext();

CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
    sqlContext.setKeyspace("midatabase");

DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");

DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");

DataFrame allCustomers = customersOne.unionAll(CustomersTwo)