在pyspark Dataframe中添加一个新列(在pandas DF中替代.apply)

时间:2018-04-02 12:22:21

标签: pyspark

 

我有一个pyspark.sql.DataFrame.dataframe df

id    col1
1       abc
2       bcd
3       lal
4       bac

我想在df中再添加一列标志,这样如果id为奇数,则标志应为' odd' ,即使是偶数'

最终输出应为

id    col1    flag
1       abc    odd
2       bcd    even
3       lal    odd
4       bac    even

我试过了:

def myfunc(num):
    if num % 2 == 0:
        flag = 'EVEN' 
    else:
        flag = 'ODD' 
    return flag

df['new_col'] = df['id'].map(lambda x: myfunc(x))
df['new_col'] = df['id'].apply(lambda x: myfunc(x))

它给了我错误:TypeError: 'Column' object is not callable

如何在pyspark中使用.apply(我在pandas dataframe中使用)

1 个答案:

答案 0 :(得分:1)

 

pyspark不提供申请,另一种方法是使用withColumn功能。使用withColumn执行此操作。

from pyspark.sql import functions as F

df = sqlContext.createDataFrame([
    [1,"abc"],
    [2,"bcd"],
    [3,"lal"],
    [4,"bac"]
 ],
 ["id","col1"]
)
df.show()
+---+----+
| id|col1|
+---+----+
|  1| abc|
|  2| bcd|
|  3| lal|
|  4| bac|
+---+----+

df.withColumn(
    "flag", 
     F.when(F.col("id")%2 == 0, F.lit("Even")).otherwise(
        F.lit("odd"))
 ).show()

+---+----+----+
| id|col1|flag|
+---+----+----+
|  1| abc| odd|
|  2| bcd|Even|
|  3| lal| odd|
|  4| bac|Even|
+---+----+----+