pyspark:用于定义NaN或Null的用户定义函数不起作用

时间:2017-11-03 14:48:03

标签: null pyspark nan

我正在尝试在pyspark中编写一个用户定义的函数,用于确定数据框中的给定条目是否为坏(Null或NaN)。我似乎无法弄清楚我在这个函数中做错了什么:

%let prog1 = Y;
%let prog2 = N;

data _null_;
if "&prog1." = "Y" then do;
    %findit(&file1.);
    %findit(&file2);
end;
run;

data _null_;
if "prog2." = "Y" then do;
    %findit(&file3.);
end;
run;

这是一个神秘的错误:

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *

def is_bad(value):
   if (value != value | (value.isNull())):
       return True
   else:
       return False

isBadEntry = UserDefinedFunction(lambda x: is_bad(x),BooleanType())

df_test = sql.createDataFrame([(1,1,None ), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6,None )], ('session',"timestamp", "id"))
df_test =df_test.withColumn("testing", isBadEntry(df_test.id)).show()

有人可以帮忙吗?

1 个答案:

答案 0 :(得分:6)

正如Psidom在评论中所暗示的那样,在Python中,NULL对象是单例Nonesource);按如下方式更改功能可以正常工作:

def is_bad(value):
   if (value != value) | (value is None):
       return True
   else:
       return False

isBadEntry = UserDefinedFunction(lambda x: is_bad(x),BooleanType())
df_test.withColumn("testing", is_bad(df_test.id)).show()
# +-------+---------+----+-------+ 
# |session|timestamp|  id|testing|
# +-------+---------+----+-------+
# |      1|        1|null|   true|
# |      1|        2|   5|  false|
# |      1|        3|null|   true|
# |      1|        4|null|   true|
# |      1|        5|  10|  false|
# |      1|        6|null|   true|
# +-------+---------+----+-------+

并与NaN同时使用:

from pyspark.sql import Row

# toy data:
df = spark.createDataFrame([Row(1.0, 7., None),
                          Row(2., 4., float('nan')),
                          Row(3., 3., 5.0),
                          Row(4., 1., 4.0),
                          Row(5., 1., 1.0)],
                          ["col_1", "col_2", "col_3"])

df.withColumn("testing", isBadEntry(df.col_3)).show()
# +-----+-----+-----+-------+ 
# |col_1|col_2|col_3|testing|
# +-----+-----+-----+-------+ 
# |  1.0|  7.0| null|   true|
# |  2.0|  4.0|  NaN|   true|
# |  3.0|  3.0|  5.0|  false|
# |  4.0|  1.0|  4.0|  false|
# |  5.0|  1.0|  1.0|  false|
# +-----+-----+-----+-------+