根据另一列中的Array更新Pyspark DF列

时间:2016-04-22 16:39:34

标签: python apache-spark dataframe pyspark apache-spark-sql

这是我的pyspark数据帧架构:

root
 |-- user: string (nullable = true)
 |-- table: string (nullable = true)
 |-- changeDate: string (nullable = true)
 |-- fieldList: string (nullable = true)
 |-- id: string (nullable = true)
 |-- value2: integer (nullable = false)
 |-- value: double (nullable = false)
 |-- name: string (nullable = false)
 |-- temp: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- num_cols_changed: integer (nullable = true)

数据框中的数据:

+--------+-----+--------------------+--------------------+------+------+-----+----+--------------------+----------------+
|    user|table|          changeDate|           fieldList|     id|value2|value|name|                temp|num_cols_changed|
+--------+-----+--------------------+--------------------+------+------+-----+----+--------------------+----------------+
| user11 | TAB1| 2016-01-24 19:10...|         value2 = 100|555555|   200|  0.5| old|      [value2 = 100]|               1|
| user01 | TAB1| 2015-12-31 13:12...|value = 0.34,name=new|  1111|   200|  0.5| old|[value = 0.34,  n...|               2|
+--------+-----+--------------------+--------------------+------+------+-----+----+--------------------+----------------+

我想在temp列中读取数组,并根据其中的值,我想更改数据框中的列。例如,第一行只有一列被更改,即value 2,所以我想用新值100更新列df.value2。同样,在下一行中,更改了2列,所以我需要使用其值提取值和名称,并更新数据框中的相应列。所以输出应该是:

+--------+-----+--------------------+------+------+-----+----+
|    user|table|          changeDate|    id|value2|value|name|
+--------+-----+--------------------+------+------+-----+----+
| user11 | TAB1| 2016-01-24 19:10...|555555|   100|  0.5| old|
| user01 | TAB1| 2015-12-31 13:12...|  1111|   200| 0.34| new|
+--------+-----+--------------------+------+------+-----+----+

我想要记住程序的性能,因此只关注使用数据帧的方法,但如果没有选项,我也可以使用rdd路由。 基本上,我不知道如何连续处理多个值然后进行比较。我知道我可以使用column in df.columns比较列名,但是使用数组为每一行执行此操作会让我感到困惑。任何帮助或新想法都表示赞赏。

1 个答案:

答案 0 :(得分:0)

以下是我使用explode解决此问题的方法:

df = df.withColumn('temp', split(df.fieldList, ','))
df = df.withColumn('cols', explode(df.temp))
df = df.withColumn('col_value', split(df.cols, '='))
df = df.withColumn('deltaCol', df.col_value[0])
       .withColumn('deltaValue',df.col_value[1])

上述最终输出(在删除不相关的列之后)导致:

+------+-----+--------+--------------------+--------+----------+
|    id|table|    user|          changeDate|deltaCol|deltaValue|
+------+-----+--------+--------------------+--------+----------+
|555555| TAB2| user11 | 2016-01-24 19:10...| value2 |       100|
|  1111| TAB1| user01 | 2015-12-31 13:12...|  value |      0.34|
|  1111| TAB1| user01 | 2015-12-31 13:12...|   name | 'newName'|
+------+-----+--------+--------------------+--------+----------+

在此之后,我将其注册为表并执行SQL操作来转动数据:

>>> res = sqlContext.sql("select id, table, user, changeDate, max(value2) as value2, max(value) as value, max(name) as name \
... from (select id, table, user, changeDate, case when trim(deltaCol) == 'value2' then deltaValue else Null end value2,\
... case when trim(deltaCol) == 'value' then deltaValue else Null end value,\
... case when trim(deltaCol) == 'name' then deltaValue else Null end name from delta) t group by id, table, user, changeDate")

结果是:

+------+-----+--------+--------------------+------+-----+----------+
|    id|table|    user|          changeDate|value2|value|      name|
+------+-----+--------+--------------------+------+-----+----------+
|555555| TAB2| user11 | 2016-01-24 19:10...|   100| null|      null|
|  1111| TAB1| user01 | 2015-12-31 13:12...|  null| 0.34| 'newName'|
+------+-----+--------+--------------------+------+-----+----------+

为了使用不同表格的代码,我使用主DF(我的最终目标表)的列来准备一列列:

>>> string = [(", max(" + c + ") as " + c) for c in masterDF.columns]
>>> string = "".join(string)
>>> string
', max(id) as id, max(value) as value, max(name) as name, max(value2) as value2'