我在pyspark中有一个如下所示的数据框。
+---+-------------+----+
| id| device| val|
+---+-------------+----+
| 3| mac pro| 1|
| 1| iphone| 2|
| 1|android phone| 2|
| 1| windows pc| 2|
| 1| spy camera| 2|
| 2| spy camera| 3|
| 2| iphone| 3|
| 3| spy camera| 1|
| 3| cctv| 1|
+---+-------------+----+
我想根据以下lists
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera']
ucg_list = ['ipad']
我在下面做过。
from pyspark.sql.functions import col, when, lit
from pyspark.sql.types import IntegerType
df1 = df.withColumn('phones', lit(None).cast(IntegerType())).withColumn('pc', lit(None).cast(IntegerType())).withColumn('security', lit(None).cast(IntegerType())).withColumn('null', lit(None).cast(IntegerType())).withColumn('ucg', lit(None).cast(IntegerType()))
import pyspark.sql.functions as F
df1.withColumn('cat',
F.when(df1.device.isin(phone_list), 'phones').otherwise(
F.when(df1.device.isin(pc_list), 'pc').otherwise(
F.when(df1.device.isin(security_list), 'security')))
).groupBy('id', 'phones', 'pc', 'security', 'null', 'ucg').pivot('cat').agg(F.count('cat')).show()
输出我正在接收
+---+------+----+--------+----+----+----+---+------+--------+
| id|phones| pc|security|null| ucg|null| pc|phones|security|
+---+------+----+--------+----+----+----+---+------+--------+
| 3| null|null| null|null|null| 0| 1| 0| 1|
| 2| null|null| null|null|null| 0| 0| 1| 1|
| 1| null|null| null|null|null| 0| 1| 2| 1|
+---+------+----+--------+----+----+----+---+------+--------+
我想要的是首先根据列表名称创建列,然后填充值
预期输出
+---+------+---+------+--------+----+
| id| ucg| pc|phones|security|null|
+---+------+---+------+--------+----+
| 1| 0| 1| 2| 1| 0|
| 2| 0| 0| 1| 1| 0|
| 3| 0| 1| 0| 1| 1|
+---+------+---+------+--------+----+
我怎样才能得到我想要的东西?
修改
当我执行以下操作时
df1 = df.withColumn('cat',
f.when(df.device.isin(phone_list), 'phones').otherwise(
f.when(df.device.isin(pc_list), 'pc').otherwise(
f.when(df.device.isin(ucg_list), 'ucg').otherwise(
f.when(df.device.isin(security_list), 'security')))))
输出
+---+-------------+---+--------+
| id| device|val| cat|
+---+-------------+---+--------+
| 3| mac pro| 1| pc|
| 3| spy camera| 1|security|
| 3| cctv| 1| null|
| 1| iphone| 2| phones|
| 1|android phone| 2| phones|
| 1| windows pc| 2| pc|
| 1| spy camera| 2|security|
| 2| spy camera| 3|security|
| 2| iphone| 3| phones|
+---+-------------+---+--------+
在输出中,您可以看到id
3
在null
列中的值为cat
答案 0 :(得分:1)
仅针对None
为'phones', 'pc', 'ucg', 'security', 'null'
列创建和填充groupBy
没有意义。 将ID和以上所有列分组为null或仅按ID分组,两者都相同。
您可以做的是找到实际的旋转列和目标列之间的差异,然后创建并填充0
所以以下内容对您有用
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera']
ucg_list = ['ipad']
from pyspark.sql import functions as f
df = df.withColumn('cat',
f.when(df.device.isin(phone_list), 'phones').otherwise(
f.when(df.device.isin(pc_list), 'pc').otherwise(
f.when(df.device.isin(ucg_list), 'ucg').otherwise(
f.when(df.device.isin(security_list), 'security'))))
)\
.groupBy('id').pivot('cat').agg(f.count('val'))\
.na.fill(0)\
columnList = ['phones', 'pc', 'ucg', 'security', 'null']
actualcolumnList = df.columns[1:]
diffColumns = [x for x in columnList if x not in actualcolumnList]
for y in diffColumns:
df = df.withColumn(y, f.lit(0))
df.show(truncate=False)
应该给你
+---+----+---+------+--------+---+
|id |null|pc |phones|security|ucg|
+---+----+---+------+--------+---+
|3 |1 |1 |0 |1 |0 |
|1 |0 |1 |2 |1 |0 |
|2 |0 |0 |1 |1 |0 |
+---+----+---+------+--------+---+
我希望答案很有帮助