Question

所以我的表看起来像这样：

customer_1|place|customer_2|item          |count
-------------------------------------------------
    a     | NY  | b        |(2010,304,310)| 34
    a     | NY  | b        |(2024,201,310)| 21
    a     | NY  | b        |(2010,304,312)| 76
    c     | NY  | x        |(2010,304,310)| 11
    a     | NY  | b        |(453,131,235) | 10

我已经尝试过了，但这并没有消除重复，因为前一个数组仍然存在（因为它应该是，我需要它用于最终结果）。

val df=  df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
      .groupBy(col("customer_1"), col("place"), col("customer_2"))
      .agg(max("vs").alias("vs"))
      .select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))

我想按customer_1，place和customer_2列进行分组，只返回最后一项（-1）唯一且数量最多的数组结构，任何想法？

预期产出：

customer_1|place|customer_2|item          |count
-------------------------------------------------
    a     | NY  | b        |(2010,304,312)| 76
    a     | NY  | b        |(2010,304,310)| 34
    a     | NY  | b        |(453,131,235) | 10
    c     | NY  | x        |(2010,304,310)| 11

Answer 1

鉴于schema的{{1}}为

dataframe

temp

你应该得到以下输出

import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
    .dropDuplicates("temp")
    .drop("temp")

<强> STRUCT

鉴于+----------+-----+----------+----------------+-----+ |customer_1|place|customer_2|item |count| +----------+-----+----------+----------------+-----+ |a |NY |b |[2010, 304, 312]|76 | |c |NY |x |[2010, 304, 310]|11 | |a |NY |b |[453, 131, 235] |10 | |a |NY |b |[2010, 304, 310]|34 | +----------+-----+----------+----------------+-----+的{{1}}为

schema

我们仍然可以像上面一样做，只需稍微更改从dataframe获取第三项

root
 |-- customer_1: string (nullable = true)
 |-- place: string (nullable = true)
 |-- customer_2: string (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- _1: integer (nullable = false)
 |    |-- _2: integer (nullable = false)
 |    |-- _3: integer (nullable = false)
 |-- count: string (nullable = true)

希望答案有用

通过Spark Dataframe中的数组结构中的最后一项删除重复的数组结构

1 个答案: