我有一个pyspark数据框,看起来像这样:
Subscription_id Subscription parameters
5516 ["'catchupNotificationsEnabled': True","'newsNotificationsEnabled': True","'autoDownloadsEnabled': False"]
我需要输出数据框为:
Subscription_id catchupNotificationsEnabled newsNotificationsEnabled autoDownloadsEnabled
5516 True True False
如何在Pyspark中实现这一目标?我已经尝试过使用UDF的几种方法,但无法成功。
任何帮助将不胜感激。
答案 0 :(得分:1)
您可以使用类似以下的内容
>>> df.show()
+---------------+-----------------------+
|Subscription_id|Subscription_parameters|
+---------------+-----------------------+
| 5516| ["'catchupNotific...|
+---------------+-----------------------+
>>>
>>> df1 = df.select('Subscription_id')
>>>
>>> data = df.select('Subscription_parameters').rdd.map(list).collect()
>>> data = [i[0][1:-1].split(',') for i in data]
>>> data = {i.split(':')[0][2:-1]:i.split(':')[1].strip()[:-1] for i in data[0]}
>>>
>>> df2 = spark.createDataFrame(sc.parallelize([data]))
>>>
>>> df3 = df1.crossJoin(df2)
>>>
>>> df3.show()
+---------------+--------------------+---------------------------+------------------------+
|Subscription_id|autoDownloadsEnabled|catchupNotificationsEnabled|newsNotificationsEnabled|
+---------------+--------------------+---------------------------+------------------------+
| 5516| False| True| True|
+---------------+--------------------+---------------------------+------------------------+
答案 1 :(得分:0)
让我们假设您的“订阅参数”列为ArrayType()。
from pyspark.sql import functions as F
from pyspark.sql import Row
from pyspark.context import SparkContext
# Call SparkContext
sc = SparkContext.getOrCreate()
sc = sparkContext
首先创建DataFrame
df = sc.createDataFrame([Row(Subscription_id=5516,
Subscription_parameters=["'catchupNotificationsEnabled': True",
"'newsNotificationsEnabled': True", "'autoDownloadsEnabled': False"])])
通过简单的索引将该数组分为三列:
df = df.select("Subscription_id",
F.col("Subscription_parameters")[0].alias("catchupNotificationsEnabled"),
F.col("Subscription_parameters")[1].alias("newsNotificationsEnabled"),
F.col("Subscription_parameters")[2].alias("autoDownloadsEnabled"))
现在,您的DataFrame已正确拆分,每个新列都包含一个字符串,例如“'catchupNotificationsEnabled':True”:
+---------------+---------------------------+------------------------+--------------------+
|Subscription_id|catchupNotificationsEnabled|newsNotificationsEnabled|autoDownloadsEnabled|
+---------------+---------------------------+------------------------+--------------------+
| 5516| 'catchupNotificat...| 'newsNotification...|'autoDownloadsEna...|
+---------------+---------------------------+------------------------+--------------------+
然后我建议通过检查列值是否包含“ True”来更新列值
df = df.withColumn('catchupNotificationsEnabled',
F.when(F.col("catchupNotificationsEnabled").contains("True"), True).otherwise(False))\
.withColumn('newsNotificationsEnabled',
F.when(F.col("newsNotificationsEnabled").contains("True"), True).otherwise(False))\
.withColumn('autoDownloadsEnabled',
F.when(F.col("autoDownloadsEnabled").contains("True"), True).otherwise(False))
生成的DataFrame符合预期
+---------------+---------------------------+------------------------+--------------------+
|Subscription_id|catchupNotificationsEnabled|newsNotificationsEnabled|autoDownloadsEnabled|
+---------------+---------------------------+------------------------+--------------------+
| 5516| true| true| false|
+---------------+---------------------------+------------------------+--------------------+
PS:如果该列不是ArrayType(),则可能需要稍微修改一下此代码。See this question for example