将Spark DataFrame列中的数字计数拆分为多个列

时间:2018-03-26 23:54:41

标签: apache-spark dataframe apache-spark-sql spark-dataframe

假设我有像这样的火花DataFrame

+------------------+----------+--------------+-----+
|              user|        dt|        action|count|
+------------------+----------+--------------+-----+
|Albert            |2018-03-24|Action1       |   19|
|Albert            |2018-03-25|Action1       |    1|
|Albert            |2018-03-26|Action1       |    6|
|Barack            |2018-03-26|Action2       |    3|
|Barack            |2018-03-26|Action3       |    1|
|Donald            |2018-03-26|Action3       |   29|
|Hillary           |2018-03-24|Action1       |    4|
|Hillary           |2018-03-26|Action2       |    2|

我想在单独的计数中计算Action1 / Action2 / Action3的计数,所以要把它转换成另一个像这样的DataFrame

+------------------+----------+-------------+-------------+-------------+
|              user|        dt|action1_count|action2_count|action3_count|
+------------------+----------+-------------+-------------+-------------+
|Albert            |2018-03-24|           19|            0|            0|
|Albert            |2018-03-25|            1|            0|            0|
|Albert            |2018-03-26|            6|            0|            0|
|Barack            |2018-03-26|            0|            3|            0|
|Barack            |2018-03-26|            0|            0|            1|
|Donald            |2018-03-26|            0|            0|           29|
|Hillary           |2018-03-24|            4|            0|            0|
|Hillary           |2018-03-26|            0|            2|            0|

由于我是Spark的新手,我尝试实现这一目标非常枯燥乏味:

  • 通过每个“行动”过滤获得3个新的DF
  • 使用新DF中的第二个DF“计数”
  • ,将原始DF加入每个新DF

我试过的代码看起来像这样:

val a1 = originalDf.filter("action = 'Action1'")
val df1 = originalDf.as('o)
  .join(a1,
        ($"o.user" === $"a1.user" && $"o.dt" === $"a1.dt"), 
        "left_outer")
  .select($"o.user", $"o.dt", $"a1.count".as("action1_count"))

然后对Action2 / Action3执行相同操作,然后加入。

然而,即使在这个阶段,我已经遇到过这种方法的几个问题:

  1. 它根本不起作用 - 我的意思是失败,错误的原因我不理解:org.apache.spark.sql.AnalysisException: cannot resolve 'o.user' given input columns: [user, dt, action, count, user, dt, action, count];

  2. 即使成功了,我还是假设我需要零,我需要零。

  3. 我觉得应该有更好的方法来实现这个目标。像一些地图构造或东西。但目前我觉得我无法构建将第一个数据帧转换为第二个数据帧所需的转换。
  4. 所以现在我根本没有工作解决方案,我会非常感谢任何建议。

    UPD :我可能还会获得不包含所有3个可能的“操作”值的DF,例如

    +------------------+----------+--------------+-----+
    |              user|        dt|        action|count|
    +------------------+----------+--------------+-----+
    |Albert            |2018-03-24|Action1       |   19|
    |Albert            |2018-03-25|Action1       |    1|
    |Albert            |2018-03-26|Action1       |    6|
    |Hillary           |2018-03-24|Action1       |    4|
    

    对于那些人,我仍然需要具有3列的结果DF:

    +------------------+----------+-------------+-------------+-------------+
    |              user|        dt|action1_count|action2_count|action3_count|
    +------------------+----------+-------------+-------------+-------------+
    |Albert            |2018-03-24|           19|            0|            0|
    |Albert            |2018-03-25|            1|            0|            0|
    |Albert            |2018-03-26|            6|            0|            0|
    |Hillary           |2018-03-24|            4|            0|            0|
    

2 个答案:

答案 0 :(得分:2)

您可以使用join选择适当的列值来避免多个when。 关于您的join,我并不认为它会像cannot resolve 'o.user'那样异常,您可能需要再次检查您的代码。

val df = Seq(("Albert","2018-03-24","Action1",19),
("Albert","2018-03-25","Action1",1),
("Albert","2018-03-26","Action1",6),
("Barack","2018-03-26","Action2",3),
("Barack","2018-03-26","Action3",1),
("Donald","2018-03-26","Action3",29),
("Hillary","2018-03-24","Action1",4),
("Hillary","2018-03-26","Action2",2)).toDF("user", "dt", "action", "count")

val df2 = df.withColumn("count1", when($"action" === "Action1", $"count").otherwise(lit(0))).
withColumn("count2", when($"action" === "Action2", $"count").otherwise(lit(0))).
withColumn("count3", when($"action" === "Action3", $"count").otherwise(lit(0)))

+-------+----------+-------+-----+------+------+------+
|user   |dt        |action |count|count1|count2|count3|
+-------+----------+-------+-----+------+------+------+
|Albert |2018-03-24|Action1|19   |19    |0     |0     |
|Albert |2018-03-25|Action1|1    |1     |0     |0     |
|Albert |2018-03-26|Action1|6    |6     |0     |0     |
|Barack |2018-03-26|Action2|3    |0     |3     |0     |
|Barack |2018-03-26|Action3|1    |0     |0     |1     |
|Donald |2018-03-26|Action3|29   |0     |0     |29    |
|Hillary|2018-03-24|Action1|4    |4     |0     |0     |
|Hillary|2018-03-26|Action2|2    |0     |2     |0     |
+-------+----------+-------+-----+------+------+------+

答案 1 :(得分:1)

这是使用pivotfirst的一种方法,其优点是无需知道action值是什么:

val df = Seq(
  ("Albert", "2018-03-24", "Action1", 19),
  ("Albert", "2018-03-25", "Action1", 1),
  ("Albert", "2018-03-26", "Action1", 6),
  ("Barack", "2018-03-26", "Action2", 3),
  ("Barack", "2018-03-26", "Action3", 1),
  ("Donald", "2018-03-26", "Action3", 29),
  ("Hillary", "2018-03-24", "Action1", 4),
  ("Hillary", "2018-03-26", "Action2", 2)
).toDF("user", "dt", "action", "count")

val pivotDF = df.groupBy("user", "dt", "action").pivot("action").agg(first($"count")).
  na.fill(0).
  orderBy("user", "dt", "action")

// +-------+----------+-------+-------+-------+-------+
// |   user|        dt| action|Action1|Action2|Action3|
// +-------+----------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1|     19|      0|      0|
// | Albert|2018-03-25|Action1|      1|      0|      0|
// | Albert|2018-03-26|Action1|      6|      0|      0|
// | Barack|2018-03-26|Action2|      0|      3|      0|
// | Barack|2018-03-26|Action3|      0|      0|      1|
// | Donald|2018-03-26|Action3|      0|      0|     29|
// |Hillary|2018-03-24|Action1|      4|      0|      0|
// |Hillary|2018-03-26|Action2|      0|      2|      0|
// +-------+----------+-------+-------+-------+-------+

[UPDATE]

根据评论,如果您要创建的列数比{d}列中的Action?更多,则可以遍历缺少 Action?以将其添加为零 - 填充为列:

val fullActionList = List("Action1", "Action2", "Action3", "Action4", "Action5")

val missingActions = fullActionList.diff(
  pivotDF.select($"action").as[String].collect.toList.distinct
)
// missingActions: List[String] = List(Action4, Action5)

missingActions.foldLeft( pivotDF )( _.withColumn(_, lit(0)) ).
show

// +-------+----------+-------+-------+-------+-------+-------+-------+
// |   user|        dt| action|Action1|Action2|Action3|Action4|Action5|
// +-------+----------+-------+-------+-------+-------+-------+-------+
// | Albert|2018-03-24|Action1|     19|      0|      0|      0|      0|
// | Albert|2018-03-25|Action1|      1|      0|      0|      0|      0|
// | Albert|2018-03-26|Action1|      6|      0|      0|      0|      0|
// | Barack|2018-03-26|Action2|      0|      3|      0|      0|      0|
// | Barack|2018-03-26|Action3|      0|      0|      1|      0|      0|
// | Donald|2018-03-26|Action3|      0|      0|     29|      0|      0|
// |Hillary|2018-03-24|Action1|      4|      0|      0|      0|      0|
// |Hillary|2018-03-26|Action2|      0|      2|      0|      0|      0|
// +-------+----------+-------+-------+-------+-------+-------+-------+