Question

假设我有像这样的火花DataFrame

+------------------+----------+--------------+-----+
|              user|        dt|        action|count|
+------------------+----------+--------------+-----+
|Albert            |2018-03-24|Action1       |   19|
|Albert            |2018-03-25|Action1       |    1|
|Albert            |2018-03-26|Action1       |    6|
|Barack            |2018-03-26|Action2       |    3|
|Barack            |2018-03-26|Action3       |    1|
|Donald            |2018-03-26|Action3       |   29|
|Hillary           |2018-03-24|Action1       |    4|
|Hillary           |2018-03-26|Action2       |    2|

我想在单独的计数中计算Action1 / Action2 / Action3的计数，所以要把它转换成另一个像这样的DataFrame

+------------------+----------+-------------+-------------+-------------+
|              user|        dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert            |2018-03-24|           19|            0|            0|
|Albert            |2018-03-25|            1|            0|            0|
|Albert            |2018-03-26|            6|            0|            0|
|Barack            |2018-03-26|            0|            3|            0|
|Barack            |2018-03-26|            0|            0|            1|
|Donald            |2018-03-26|            0|            0|           29|
|Hillary           |2018-03-24|            4|            0|            0|
|Hillary           |2018-03-26|            0|            2|            0|

由于我是Spark的新手，我尝试实现这一目标非常枯燥乏味：

通过每个“行动”过滤获得3个新的DF
使用新DF中的第二个DF“计数”

我试过的代码看起来像这样：

val a1 = originalDf.filter("action = 'Action1'")
val df1 = originalDf.as('o)
  .join(a1,
        ($"o.user" === $"a1.user" && $"o.dt" === $"a1.dt"), 
        "left_outer")
  .select($"o.user", $"o.dt", $"a1.count".as("action1_count"))

然后对Action2 / Action3执行相同操作，然后加入。

然而，即使在这个阶段，我已经遇到过这种方法的几个问题：

它根本不起作用 - 我的意思是失败，错误的原因我不理解：org.apache.spark.sql.AnalysisException: cannot resolve 'o.user' given input columns: [user, dt, action, count, user, dt, action, count];
即使成功了，我还是假设我需要零，我需要零。
我觉得应该有更好的方法来实现这个目标。像一些地图构造或东西。但目前我觉得我无法构建将第一个数据帧转换为第二个数据帧所需的转换。

所以现在我根本没有工作解决方案，我会非常感谢任何建议。

UPD ：我可能还会获得不包含所有3个可能的“操作”值的DF，例如

+------------------+----------+--------------+-----+
|              user|        dt|        action|count|
+------------------+----------+--------------+-----+
|Albert            |2018-03-24|Action1       |   19|
|Albert            |2018-03-25|Action1       |    1|
|Albert            |2018-03-26|Action1       |    6|
|Hillary           |2018-03-24|Action1       |    4|

对于那些人，我仍然需要具有3列的结果DF：

+------------------+----------+-------------+-------------+-------------+
|              user|        dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert            |2018-03-24|           19|            0|            0|
|Albert            |2018-03-25|            1|            0|            0|
|Albert            |2018-03-26|            6|            0|            0|
|Hillary           |2018-03-24|            4|            0|            0|

Answer 1

您可以使用join选择适当的列值来避免多个when。关于您的join，我并不认为它会像cannot resolve 'o.user'那样异常，您可能需要再次检查您的代码。

val df = Seq(("Albert","2018-03-24","Action1",19),
("Albert","2018-03-25","Action1",1),
("Albert","2018-03-26","Action1",6),
("Barack","2018-03-26","Action2",3),
("Barack","2018-03-26","Action3",1),
("Donald","2018-03-26","Action3",29),
("Hillary","2018-03-24","Action1",4),
("Hillary","2018-03-26","Action2",2)).toDF("user", "dt", "action", "count")

val df2 = df.withColumn("count1", when($"action" === "Action1", $"count").otherwise(lit(0))).
withColumn("count2", when($"action" === "Action2", $"count").otherwise(lit(0))).
withColumn("count3", when($"action" === "Action3", $"count").otherwise(lit(0)))

+-------+----------+-------+-----+------+------+------+
|user   |dt        |action |count|count1|count2|count3|
+-------+----------+-------+-----+------+------+------+
|Albert |2018-03-24|Action1|19   |19    |0     |0     |
|Albert |2018-03-25|Action1|1    |1     |0     |0     |
|Albert |2018-03-26|Action1|6    |6     |0     |0     |
|Barack |2018-03-26|Action2|3    |0     |3     |0     |
|Barack |2018-03-26|Action3|1    |0     |0     |1     |
|Donald |2018-03-26|Action3|29   |0     |0     |29    |
|Hillary|2018-03-24|Action1|4    |4     |0     |0     |
|Hillary|2018-03-26|Action2|2    |0     |2     |0     |
+-------+----------+-------+-----+------+------+------+

Answer 2

这是使用pivot和first的一种方法，其优点是无需知道action值是什么：

val df = Seq(
  ("Albert", "2018-03-24", "Action1", 19),
  ("Albert", "2018-03-25", "Action1", 1),
  ("Albert", "2018-03-26", "Action1", 6),
  ("Barack", "2018-03-26", "Action2", 3),
  ("Barack", "2018-03-26", "Action3", 1),
  ("Donald", "2018-03-26", "Action3", 29),
  ("Hillary", "2018-03-24", "Action1", 4),
  ("Hillary", "2018-03-26", "Action2", 2)
).toDF("user", "dt", "action", "count")

val pivotDF = df.groupBy("user", "dt", "action").pivot("action").agg(first($"count")).
  na.fill(0).
  orderBy("user", "dt", "action")

// +-------+----------+-------+-------+-------+-------+
// |   user|        dt| action|Action1|Action2|Action3|
// +-------+----------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1|     19|      0|      0|
// | Albert|2018-03-25|Action1|      1|      0|      0|
// | Albert|2018-03-26|Action1|      6|      0|      0|
// | Barack|2018-03-26|Action2|      0|      3|      0|
// | Barack|2018-03-26|Action3|      0|      0|      1|
// | Donald|2018-03-26|Action3|      0|      0|     29|
// |Hillary|2018-03-24|Action1|      4|      0|      0|
// |Hillary|2018-03-26|Action2|      0|      2|      0|
// +-------+----------+-------+-------+-------+-------+

[UPDATE]

根据评论，如果您要创建的列数比{d}列中的Action?更多，则可以遍历缺少 Action?以将其添加为零 - 填充为列：

val fullActionList = List("Action1", "Action2", "Action3", "Action4", "Action5")

val missingActions = fullActionList.diff(
  pivotDF.select($"action").as[String].collect.toList.distinct
)
// missingActions: List[String] = List(Action4, Action5)

missingActions.foldLeft( pivotDF )( _.withColumn(_, lit(0)) ).
show

// +-------+----------+-------+-------+-------+-------+-------+-------+
// |   user|        dt| action|Action1|Action2|Action3|Action4|Action5|
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1|     19|      0|      0|      0|      0|
// | Albert|2018-03-25|Action1|      1|      0|      0|      0|      0|
// | Albert|2018-03-26|Action1|      6|      0|      0|      0|      0|
// | Barack|2018-03-26|Action2|      0|      3|      0|      0|      0|
// | Barack|2018-03-26|Action3|      0|      0|      1|      0|      0|
// | Donald|2018-03-26|Action3|      0|      0|     29|      0|      0|
// |Hillary|2018-03-24|Action1|      4|      0|      0|      0|      0|
// |Hillary|2018-03-26|Action2|      0|      2|      0|      0|      0|
// +-------+----------+-------+-------+-------+-------+-------+-------+

将Spark DataFrame列中的数字计数拆分为多个列

2 个答案: