您好, 我有一个数据集,它按Transaction_ID对所有产品进行分组,我想要排除少于两个产品的Transaction_ID。为此,我使用这个:
val edges = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID"))).count().filter("count >= 2")
但是当我执行此操作时,我收到此错误:
<console>:37: error: value filter is not a member of Long
我该如何解决这个问题?
非常感谢!
答案 0 :(得分:0)
您可以尝试以下内容。
val df = Seq(("tx-1", "aaa"), ("tx-2", "bbb"), ("tx-1", "ccc"),("tx-4", "ccc")).toDF("Transaction_ID", "Product_ID")
df.show
+--------------+----------+
|Transaction_ID|Product_ID|
+--------------+----------+
| tx-1| aaa|
| tx-2| bbb|
| tx-1| ccc|
| tx-4| ccc|
+--------------+----------+
如果您只想要Transaction_ID,那么可以使用
val df4 =df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
df4.show
如果你想要Transaction_ID和Product_ID那么
val df1 = df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
val df2 = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID")))
val df3 = df1.join(df2, df1("Transaction_ID") === df2("Transaction_ID"), "inner").select(df2("Transaction_ID"),df2("Product_ID"))
df3.show
+--------------+----------+
|Transaction_ID|Product_ID|
+--------------+----------+
| tx-1| aaa,ccc|
+--------------+----------+