我正在尝试在pyspark中编写一个用户定义的函数,用于确定数据框中的给定条目是否为坏(Null或NaN)。我似乎无法弄清楚我在这个函数中做错了什么:
%let prog1 = Y;
%let prog2 = N;
data _null_;
if "&prog1." = "Y" then do;
%findit(&file1.);
%findit(&file2);
end;
run;
data _null_;
if "prog2." = "Y" then do;
%findit(&file3.);
end;
run;
这是一个神秘的错误:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *
def is_bad(value):
if (value != value | (value.isNull())):
return True
else:
return False
isBadEntry = UserDefinedFunction(lambda x: is_bad(x),BooleanType())
df_test = sql.createDataFrame([(1,1,None ), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6,None )], ('session',"timestamp", "id"))
df_test =df_test.withColumn("testing", isBadEntry(df_test.id)).show()
有人可以帮忙吗?
答案 0 :(得分:6)
正如Psidom在评论中所暗示的那样,在Python中,NULL对象是单例None
(source);按如下方式更改功能可以正常工作:
def is_bad(value):
if (value != value) | (value is None):
return True
else:
return False
isBadEntry = UserDefinedFunction(lambda x: is_bad(x),BooleanType())
df_test.withColumn("testing", is_bad(df_test.id)).show()
# +-------+---------+----+-------+
# |session|timestamp| id|testing|
# +-------+---------+----+-------+
# | 1| 1|null| true|
# | 1| 2| 5| false|
# | 1| 3|null| true|
# | 1| 4|null| true|
# | 1| 5| 10| false|
# | 1| 6|null| true|
# +-------+---------+----+-------+
并与NaN
同时使用:
from pyspark.sql import Row
# toy data:
df = spark.createDataFrame([Row(1.0, 7., None),
Row(2., 4., float('nan')),
Row(3., 3., 5.0),
Row(4., 1., 4.0),
Row(5., 1., 1.0)],
["col_1", "col_2", "col_3"])
df.withColumn("testing", isBadEntry(df.col_3)).show()
# +-----+-----+-----+-------+
# |col_1|col_2|col_3|testing|
# +-----+-----+-----+-------+
# | 1.0| 7.0| null| true|
# | 2.0| 4.0| NaN| true|
# | 3.0| 3.0| 5.0| false|
# | 4.0| 1.0| 4.0| false|
# | 5.0| 1.0| 1.0| false|
# +-----+-----+-----+-------+