使用条件其他列列Dataframe替换列值

时间:2018-03-12 02:18:54

标签: pyspark

现在我正在使用另一个列值

创建一个新列
targetDf = resultDataFrame.withColumn("weekday",psf.when(resultDataFrame["day"] == 0 , 'MON')
    .when(resultDataFrame["day"] == 1 , 'TUE')
    .when(resultDataFrame["day"] == 2 , 'WED')
    .when(resultDataFrame["day"] == 3 , 'THU')
    .when(resultDataFrame["day"] == 4 , 'FRI')
    .when(resultDataFrame["day"] == 5 , 'SAT')
    .otherwise('SUN'))      

我想简化这个像

这样的东西
dayList = ['SUN' , 'MON' , 'TUE' , 'WED' , 'THR' , 'FRI' , 'SAT']
resultDataFrame.withColumn("weekday" , dayList[resultDataFrame.day])

但我得到错误它必须是一个整数不能一列。还有其他选择吗?

1 个答案:

答案 0 :(得分:2)

示例数据

zprint-clj -i <filein.ext> -o <fileout.ext>

使用df = spark.createDataFrame([[0], [3], [5]], ['day']) df.show() +---+ |day| +---+ | 0| | 3| | 5| +---+ 创建链式列表达式:

reduce

import pyspark.sql.functions as F from functools import reduce df.withColumn('weekday', reduce(lambda col, i: col.when(df.day == i, dayList[i]), range(7), F)).show() +---+-------+ |day|weekday| +---+-------+ | 0| SUN| | 3| WED| | 5| FRI| +---+-------+ 生成列表达式为:

reduce

或者制作一个udf:

reduce(lambda col, i: col.when(df.day == i, dayList[i]), range(7), F)
# Column<b'CASE WHEN (day = 0) THEN SUN WHEN (day = 1) THEN MON WHEN (day = 2) THEN TUE WHEN (day = 3) THEN WED WHEN (day = 4) THEN THR WHEN (day = 5) THEN FRI WHEN (day = 6) THEN SAT END'>