我在PySpark中尝试删除按CustomerID
分组的计数小于10的客户行。所以我首先得到CustomerID
客户的数量< 10.然后我通过删除不在删除列表中的CustomerID
来过滤它。但我得到了Py4JJavaError error
。谁能让我了解如何正确地做到这一点?
rm_user_1 = cleaned_df.groupBy('CustomerID').count().withColumnRenamed("count", "n").filter("n < 10").select('CustomerID').collect()
cleaned_df = cleaned_df.filter(~cleaned_df.CustomerID.isin(rm_user_1))
答案 0 :(得分:1)
rm_user_1 = cleaned_df.groupBy('CustomerID').count().withColumnRenamed("count", "n").filter("n < 10").select('CustomerID').collect()
变量rm_user_1
属于Row
类型。您需要访问行内的CustomerID
值。列表理解就足够了:
rm_users = [x.CustomerID for x in rm_user_1]
cleaned_df = cleaned_df.filter(~cleaned_df.CustomerID.isin(rm_users))