我是Spark新手,我正在尝试将groupby
和count
应用于df
属性上的数据框users
。
import pandas as pd
comments = [ (1, "Hi I heard about Spark"),
(1, "Spark is awesome"),
(2, None),
(2, "And I don't know why..."),
(3, "Blah blah")]
df = pd.DataFrame(comments )
df.columns = ["users", "comments"]
这看起来像是熊猫
users comments
0 1 Hi I heard about Spark
1 1 Spark is awesome
2 2 None
3 2 And I don't know why
4 3 Blah blah
我想找到pyspark
的以下pandas代码的等价物df.groupby(['users'])['users'].transform('count')
输出如下:
0 2
1 2
2 2
3 2
4 1
dtype: int64
你能帮我解决一下如何在PySpark
中实现这一点吗?
答案 0 :(得分:1)
这应该适用于pyspark:df.groupby('user').count()
。在pyspark groupby()
中是groupBy()
Pyspark docs are pretty easy reading with some good examples.
<强>更新强>
既然我已经理解了这个要求,那么pyspark现在还没有支持transform
支持。 See this answer.
但你可以通过加入来实现。
df2=df.groupby('users').count()
df.join(df2, df.users==df2.users, "left")\
.drop(df2.users).drop(df.comments)
+-----+-----+
|users|count|
+-----+-----+
| 1| 2|
| 1| 2|
| 3| 1|
| 2| 2|
| 2| 2|
+-----+-----+