使用pysparks并处理一些yelp数据,我试图对那些精英成员和那些不是精英成员进行计数。
df_Usr2.groupby(['name', 'business_id', 'Elite_Member']).count().sort('business_id', acending=True).show(50, truncate=False)
现在我创建一个计数时,它会按从上到下的顺序显示它们,如下所示,我想要做的是让它们并排显示,并带有一个.none列,其中包含无精英的计数。
目前它是这样的:
[name] [Business_id] [EliteMem] [Count]
a 123 No 5
a 123 Yes 10
我希望有更多类似的东西:
[name] [Business_id] [EliteMem] [NonEliteMem]
a 123 10 5
答案 0 :(得分:0)
您可以将groupby
与pivot
和sum
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.pivot)一起使用
df1 = sqlContext.createDataFrame([
Row(name="a",
business_id=123,
elitemem="no",
count=5),
Row(name="a",
business_id=123,
elitemem="yes",
count=10)]
)
df1.show()
+-----------+-----+--------+----+
|business_id|count|elitemem|name|
+-----------+-----+--------+----+
| 123| 5| no| a|
| 123| 10| yes| a|
+-----------+-----+--------+----+
df1.groupby('name','business_id').pivot('elitemem',['no','yes']).sum('count').show()
+----+-----------+---+---+
|name|business_id| no|yes|
+----+-----------+---+---+
| a| 123| 5| 10|
+----+-----------+---+---+
如果您担心列名是no
和yes
,则可以采用这种方法。
from pyspark.sql import functions as F
df2 = df1.withColumn("elitemem_new",F.when(df1['elitemem']=="yes","elitemem").otherwise("nonelitemem"))
df2.groupby('name','business_id').pivot('elitemem_new').sum('count').show()
+----+-----------+--------+-----------+
|name|business_id|elitemem|nonelitemem|
+----+-----------+--------+-----------+
| a| 123| 10| 5|
+----+-----------+--------+-----------+