我在PySpark中发现了一种奇怪的行为。也许你们其中一个人会知道会发生什么。如果我这样做:
def create_my_date(mydate):
try:
return mydate.strftime('%Y%m')
except:
return None
df = df.withColumn(
"date_string",
F.udf(create_id, StringType())(df.mydate)
)
df.filter(~df.mydate.isNotNull()).count()
df.filter(df.mydate.isNotNull()).count()
此输出:
0
10
这意味着我在列df.mydate中没有Null值。
但是如果我改变了create_my_date函数并删除了try / except:
def create_my_date(mydate):
return mydate.strftime('%Y%m')
df = df.withColumn(
"date_string",
F.udf(create_id, StringType())(df.mydate)
)
df.filter(~df.mydate.isNotNull()).count()
df.filter(df.mydate.isNotNull()).count()
JVM打破并说:
Py4JJavaError: An error occurred while calling o7058.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 997.0 failed 4 times, most recent failure: Lost task 22.3 in stage 997.0 (TID 335940, 126.102.230.110, executor 29): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 106, in <lambda>
func = lambda _, it: map(mapper, it)
File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 92, in <lambda>
mapper = lambda a: udf(*a)
File "/home/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 70, in <lambda>
return lambda *a: f(*a)
File "<ipython-input-109-422e4b5e07cf>", line 2, in create_my_date
AttributeError: 'NoneType' object has no attribute 'strftime'
有人对我有解释吗?
谢谢!
答案 0 :(得分:2)
您收到属性错误的原因是您尝试在None
类型上使用strftime。您可以看到在'create_my_date'期间触发了错误,因为它是udf正在使用rdd对象的python表示。所以基本上它是这样做的:
>>> None.strftime("%Y%m")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'strftime'
相反,您可以使用数据框函数完成您想要的任务(比udf更快,并且不需要try-except块):
from pyspark.sql.functions import date_format
from datetime import datetime
df = spark.createDataFrame([[datetime(2018, 3, 2).date()], [None]], ["mydate"])
df = df.withColumn("date_string", date_format("mydate", "YMM"))
df.show()
结果数据框:
+----------+-----------+
| mydate|date_string|
+----------+-----------+
|2018-03-02| 201803|
| null| null|
+----------+-----------+
然后你的计数:
df.filter(df["mydate"].isNotNull()).count()
df.filter(df["mydate"].isNull()).count()
按预期退货:
1
1