pyspark过滤行错误

时间:2017-07-23 05:34:11

标签: python apache-spark pyspark pyspark-sql

我在PySpark中尝试删除按CustomerID分组的计数小于10的客户行。所以我首先得到CustomerID客户的数量< 10.然后我通过删除不在删除列表中的CustomerID来过滤它。但我得到了Py4JJavaError error。谁能让我了解如何正确地做到这一点?

rm_user_1 = cleaned_df.groupBy('CustomerID').count().withColumnRenamed("count", "n").filter("n < 10").select('CustomerID').collect()

cleaned_df = cleaned_df.filter(~cleaned_df.CustomerID.isin(rm_user_1))

1 个答案:

答案 0 :(得分:1)

rm_user_1 = cleaned_df.groupBy('CustomerID').count().withColumnRenamed("count", "n").filter("n < 10").select('CustomerID').collect()

变量rm_user_1属于Row类型。您需要访问行内的CustomerID值。列表理解就足够了:

rm_users = [x.CustomerID for x in rm_user_1]
cleaned_df = cleaned_df.filter(~cleaned_df.CustomerID.isin(rm_users))