通过Spark Dataframe中的数组结构中的最后一项删除重复的数组结构

时间:2017-08-02 10:30:00

标签: scala apache-spark apache-spark-sql

所以我的表看起来像这样:

customer_1|place|customer_2|item          |count
-------------------------------------------------
    a     | NY  | b        |(2010,304,310)| 34
    a     | NY  | b        |(2024,201,310)| 21
    a     | NY  | b        |(2010,304,312)| 76
    c     | NY  | x        |(2010,304,310)| 11
    a     | NY  | b        |(453,131,235) | 10

我已经尝试过了,但这并没有消除重复,因为前一个数组仍然存在(因为它应该是,我需要它用于最终结果)。

val df=  df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
      .groupBy(col("customer_1"), col("place"), col("customer_2"))
      .agg(max("vs").alias("vs"))
      .select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))

我想按customer_1,place和customer_2列进行分组,只返回最后一项(-1)唯一且数量最多的数组结构,任何想法?

预期产出:

customer_1|place|customer_2|item          |count
-------------------------------------------------
    a     | NY  | b        |(2010,304,312)| 76
    a     | NY  | b        |(2010,304,310)| 34
    a     | NY  | b        |(453,131,235) | 10
    c     | NY  | x        |(2010,304,310)| 11

1 个答案:

答案 0 :(得分:1)

鉴于schema的{​​{1}}为

dataframe

您可以应用root |-- customer_1: string (nullable = true) |-- place: string (nullable = true) |-- customer_2: string (nullable = true) |-- item: array (nullable = true) | |-- element: integer (containsNull = false) |-- count: string (nullable = true) 个功能来创建concat列以检查重复行,如下所示

temp

你应该得到以下输出

import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
    .dropDuplicates("temp")
    .drop("temp")

<强> STRUCT

鉴于+----------+-----+----------+----------------+-----+ |customer_1|place|customer_2|item |count| +----------+-----+----------+----------------+-----+ |a |NY |b |[2010, 304, 312]|76 | |c |NY |x |[2010, 304, 310]|11 | |a |NY |b |[453, 131, 235] |10 | |a |NY |b |[2010, 304, 310]|34 | +----------+-----+----------+----------------+-----+ 的{​​{1}}为

schema

我们仍然可以像上面一样做,只需稍微更改从dataframe获取第三项

root
 |-- customer_1: string (nullable = true)
 |-- place: string (nullable = true)
 |-- customer_2: string (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- _1: integer (nullable = false)
 |    |-- _2: integer (nullable = false)
 |    |-- _3: integer (nullable = false)
 |-- count: string (nullable = true)

希望答案有用