Pyspark-如何将行中的特定值分隔到其自己的列中

时间:2018-12-10 20:34:19

标签: apache-spark pyspark apache-spark-sql pyspark-sql

使用pysparks并处理一些yelp数据,我试图对那些精英成员和那些不是精英成员进行计数。

df_Usr2.groupby(['name', 'business_id', 'Elite_Member']).count().sort('business_id', acending=True).show(50, truncate=False)

现在我创建一个计数时,它会按从上到下的顺序显示它们,如下所示,我想要做的是让它们并排显示,并带有一个.none列,其中包含无精英的计数。

目前它是这样的:

[name]   [Business_id]   [EliteMem] [Count]
   a          123            No        5
   a          123            Yes      10

我希望有更多类似的东西:

 [name]   [Business_id]   [EliteMem] [NonEliteMem]
    a          123            10           5

Heres a look of what my dataframe looks like exactly

1 个答案:

答案 0 :(得分:0)

您可以将groupbypivotsumhttp://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.pivot)一起使用

df1 = sqlContext.createDataFrame([
                    Row(name="a",
                        business_id=123,
                        elitemem="no",
                        count=5),
                    Row(name="a",
                        business_id=123,
                        elitemem="yes",
                        count=10)]
                        )

 df1.show()
+-----------+-----+--------+----+
|business_id|count|elitemem|name|
+-----------+-----+--------+----+
|        123|    5|      no|   a|
|        123|   10|     yes|   a|
+-----------+-----+--------+----+

df1.groupby('name','business_id').pivot('elitemem',['no','yes']).sum('count').show()
+----+-----------+---+---+
|name|business_id| no|yes|
+----+-----------+---+---+
|   a|        123|  5| 10|
+----+-----------+---+---+

如果您担心列名是noyes,则可以采用这种方法。

from pyspark.sql import functions as F
df2 = df1.withColumn("elitemem_new",F.when(df1['elitemem']=="yes","elitemem").otherwise("nonelitemem"))

 df2.groupby('name','business_id').pivot('elitemem_new').sum('count').show()
+----+-----------+--------+-----------+
|name|business_id|elitemem|nonelitemem|
+----+-----------+--------+-----------+
|   a|        123|      10|          5|
+----+-----------+--------+-----------+