Question

我在pyspark中遇到一个非常奇怪的问题..我正在使用正则表达式从冗长的日期时间字符串中提取unixtimestamp（存储的字符串不适合直接转换）。将其写入withColumn函数时，这可以正常工作：

r= "([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2})\.([0-9]*)"
latencyhops.select('time') \
.withColumn('TimeSec',f.unix_timestamp(f.regexp_extract('time', r, 1))) \
.show(5,False)

+---------------------------+----------+
|time                       |TimeSec   |
+---------------------------+----------+
|2018-01-22 14:39:00.0743640|1516631940|
|2018-01-23 05:47:34.2797780|1516686454|
|2018-01-23 05:47:34.2797780|1516686454|
|2018-01-23 05:47:34.2797780|1516686454|
|2018-01-24 08:06:29.2989410|1516781189|

然而，当通过UDF运行时，它失败了：

from pyspark.sql.functions import udf
from pyspark.sql.types import *
def timeConversion (time):
    return f.unix_timestamp(f.regexp_extract(time, "([0-9]{4}-[0-9]{2}-
[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2})\.([0-9]*)", 1))
to_nano =udf(timeConversion, IntegerType())

latencyhops.select('time') \
.withColumn('TimeSec',to_nano('time', r, 1)) \
.show(5,False)

使用：

..../pyspark/sql/functions.py", line 1521, in regexp_extract
    jc = sc._jvm.functions.regexp_extract(_to_java_column(str), pattern, idx)
AttributeError: 'NoneType' object has no attribute '_jvm'

据我所知，这些应该完全相同。我已经尝试了多种定义UDF（lambda表达式等）的变体，但总是遇到同样的错误。有没有人有建议？

由于

Answer 1

正如@mkaran所指出的，许多pyspark函数只能在列级别上工作，因此不适合UDF使用。找到了一个有效的.withColumn解决方案。

pyspark udf regex_extract - ＆＃34;＆＃39; NoneType＆＃39;对象没有属性＆＃39; _jvm＆＃39;＆＃34;

1 个答案: