简化代码并减少pyspark数据帧中的连接语句

时间:2018-05-15 23:42:15

标签: apache-spark pyspark

我在pyspark中有一个数据框,如下所示。

df.show()

+---+-------------+
| id|       device|
+---+-------------+
|  3|      mac pro|
|  1|       iphone|
|  1|android phone|
|  1|   windows pc|
|  1|   spy camera|
|  2|   spy camera|
|  2|       iphone|
|  3|   spy camera|
|  3|         cctv|
+---+-------------+


phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']

from pyspark.sql.functions import col

phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")

phones_df.show()

+---+------+
| id|phones|
+---+------+
|  1|     2|
|  2|     1|
+---+------+


pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")

pc_df.show()

+---+---+
| id| pc|
+---+---+
|  1|  1|
|  3|  1|
+---+---+

security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")

security_df.show()

+---+--------+
| id|security|
+---+--------+
|  1|       1|
|  2|       1|
|  3|       2|
+---+--------+

然后我想在所有三个数据帧上进行完全外连接。我在下面做过。

full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)

final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)

Final_df.show()

+---+------+----+--------+
| id|phones|  pc|security|
+---+------+----+--------+
|  1|     2|   1|       1|
|  2|     1|null|       1|
|  3|  null|   1|       2|
+---+------+----+--------+

我能够得到我想要的东西但想要简化我的代码。

1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement

我该怎么做?任何人都可以解释一下。

1 个答案:

答案 0 :(得分:1)

以下是使用when.otherwise将列映射到类别,然后将pivot映射到所需输出的一种方法:

import pyspark.sql.functions as F

df.withColumn('cat', 
    F.when(df.device.isin(phone_list), 'phones').otherwise(
    F.when(df.device.isin(pc_list), 'pc').otherwise(
    F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()

+---+----+------+--------+
| id|  pc|phones|security|
+---+----+------+--------+
|  1|   1|     2|       1|
|  3|   1|  null|       2|
|  2|null|     1|       1|
+---+----+------+--------+