SPARK Mlllib - <console>:37:错误:值过滤器不是Long的成员

时间:2016-09-25 16:02:28

标签: apache-spark scala

您好, 我有一个数据集,它按Transaction_ID对所有产品进行分组,我想要排除少于两个产品的Transaction_ID。为此,我使用这个:

val edges = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID"))).count().filter("count >= 2")

但是当我执行此操作时,我收到此错误:

 <console>:37: error: value filter is not a member of Long

我该如何解决这个问题?

非常感谢!

1 个答案:

答案 0 :(得分:0)

您可以尝试以下内容。

val df = Seq(("tx-1", "aaa"), ("tx-2", "bbb"), ("tx-1", "ccc"),("tx-4", "ccc")).toDF("Transaction_ID", "Product_ID")
df.show

+--------------+----------+
|Transaction_ID|Product_ID|
+--------------+----------+
|          tx-1|       aaa|
|          tx-2|       bbb|
|          tx-1|       ccc|
|          tx-4|       ccc|
+--------------+----------+

如果您只想要Transaction_ID,那么可以使用

val df4 =df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
df4.show

如果你想要Transaction_ID和Product_ID那么

val df1 = df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
val df2 = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID")))
val df3 = df1.join(df2, df1("Transaction_ID") === df2("Transaction_ID"), "inner").select(df2("Transaction_ID"),df2("Product_ID"))
df3.show

+--------------+----------+
|Transaction_ID|Product_ID|
+--------------+----------+
|          tx-1|   aaa,ccc|
+--------------+----------+