Question

使用DataStax cassandra连接器使用SparkR（spark-2.1.0）。

我有一个连接到Cassandra中的表的数据框。 cassandra表中的一些列是map和set类型。我需要对这些“集合”列执行各种过滤/聚合操作。

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

<div class="item">
  <div class="title">Thing 1</div>
</div>
<div>
  <a class="disable-link" href="#">Click to Disable Thing 1</a>
</div>
<div class="item">
  <div class="title">Thing 2</div>
</div>
<div>
  <a class="disable-link" href="#">Click to Disable Thing 2</a>
</div>
<div class="item">
  <div class="title">Thing 3</div>
</div>
<div>
  <a class="disable-link" href="#">Click to Disable Thing 3</a>
</div>

我想获得：

包含my_data_frame中所有行的col2地图中唯一字符串KEYS的新数据框。
放置在my_data_frame中新列中的每一行的col2地图中VALUES的总和（）。
col3数组中my_data_frame中所有行中的唯一值集合到新数据框中

cassandra for col2中的地图数据如下所示： VALUES（{'key1'：100，'key2'：20，'key3'：50，...}）

如果原始的cassandra表格如下：

my_data_frame <-read.df(
    source = "org.apache.spark.sql.cassandra",
    keyspace = "my_keyspace", table = "some_table")

my_data_frame
SparkDataFrame[id:string,  col2:map<string,int>, col3:array<string>]

schema(my_data_frame)
StructType
|-name = "id", type = "StringType", nullable = TRUE
|-name = "col2", type = "MapType(StringType,IntegerType,true)", nullable = TRUE
|-name = "col3", type = "ArrayType(StringType,true)", nullable = TRUE

我想获得一个包含唯一键的数据框：

id   col2
1    {'key1':100, 'key2':20}
2    {'key3':40,  'key4':10}
3    {'key1':10,  'key3':30}

每个id的值的总和：

col2_keys
key1
key2
key3
key4

每个ID的最大值：

id  col2_sum
1   120
2   60
3   40

其他信息：

id  col2_max
1   100
2   40
3   30

头（col2_df）

col2_df <- select(my_data_frame, my_data_frame$col2)

我是Spark和R的新手并且可能错过了一些明显的东西，但我没有看到以这种方式转换地图和数组的任何明显函数。

我确实看到了一些在R中使用“环境”作为地图的参考，但我不确定这对我的要求有何用处。

                           col2
1 <environment: 0x7facfb4fc4e8>
2 <environment: 0x7facfb4f3980>
3 <environment: 0x7facfb4eb980>
4 <environment: 0x7facfb4e0068>

row1 <- first(my_data_frame)
row1
                           col2
1 <environment: 0x7fad00023ca0>

非常感谢您提供任何帮助。

Spark-R：如何将Cassandra地图和数组列转换为新的DataFrame

0 个答案: