我在pyspark
中有一个数据框,如下所示。
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
然后我想在所有三个数据帧上进行完全外连接。我在下面做过。
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
我能够得到我想要的东西但想要简化我的代码。
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
我该怎么做?任何人都可以解释一下。
答案 0 :(得分:1)
以下是使用when.otherwise
将列映射到类别,然后将pivot
映射到所需输出的一种方法:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+