我有长时间运行的任务(udf),我需要在PySpark上运行,其中一些可以运行数小时,但是我想添加某种超时包装,以防它们真的运行太长时间。如果超时,我只想返回一个None
。
我已经用signal
做过一些事,但是我敢肯定这不是最安全的方法。
import pyspark
import signal
import time
from pyspark import SQLContext
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import udf
conf = pyspark.SparkConf()
sc = pyspark.SparkContext.getOrCreate(conf=conf)
spark = SQLContext(sc)
schema = StructType([
StructField("sleep", IntegerType(), True),
StructField("value", StringType(), True),
])
data = [[1, "a"], [2, "b"], [3, "c"], [4, "d"], [1, "e"], [2, "f"]]
df = spark.createDataFrame(data, schema=schema)
def handler(signum, frame):
raise TimeoutError()
def squared_typed(s):
def run_timeout():
signal.signal(signal.SIGALRM, handler)
signal.alarm(3)
time.sleep(s)
return s * s
try:
return run_timeout()
except TimeoutError as e:
return None
squared_udf = udf(squared_typed, IntegerType())
df.withColumn('sq', squared_udf('sleep')).show()
它可以正常工作,并且给了我预期的输出,但是有没有一种以 pysparkly 方式实现它的方法呢?
+-----+-----+----+
|sleep|value| sq|
+-----+-----+----+
| 1| a| 1|
| 2| b| 4|
| 3| c|null|
| 4| d|null|
| 1| e| 1|
| 2| f| 4|
+-----+-----+----+
谢谢