使用Spark Submit提交自定义udf时键入错误

时间:2018-07-03 14:34:31

标签: python apache-spark pyspark

通过spark-submit --py-files udf提交时获得TypeError

TypeError: 'in <string>' requires string as left operand, not NoneType

我已将所有UDF写在proj_udf.py

group_1 =['EAST','NORTH','SOUTH','SOUTHEAST','SOUTHWEST']
group_2 =['AUTORX','CAREWORKS','CHIROSPORT']

mearged_list = group_1 + group_2
str1 = ''.join(mearged_list)

def search_list(column):
    return any(column in item for item in str1)

sqlContext.udf.register("search_list_udf", search_list, BooleanType())

从pyspark-shell调用此函数时,不会引发任何错误。当我使用spark-submit运行此命令时,出现以下错误。

错误:

  File "/hd_data/disk23/hadoop/yarn/local/usercache/hscrsawd/appcache/application_1530205632093_12027/container_1530205632093_12027_01_000007/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/hd_data/disk23/hadoop/yarn/local/usercache/hscrsawd/appcache/application_1530205632093_12027/container_1530205632093_12027_01_000007/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hd_data/disk23/hadoop/yarn/local/usercache/hscrsawd/appcache/application_1530205632093_12027/container_1530205632093_12027_01_000007/pyspark.zip/pyspark/worker.py", line 104, in <lambda>
    func = lambda _, it: map(mapper, it)
  File "<string>", line 1, in <lambda>
  File "/hd_data/disk23/hadoop/yarn/local/usercache/hscrsawd/appcache/application_1530205632093_12027/container_1530205632093_12027_01_000007/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
    return lambda *a: f(*a)
  File "NAM_Udfs.py", line 115, in search_list
    return any(column in item for item in str1)
  File "NAM_Udfs.py", line 115, in <genexpr>
    return any(column in item for item in str1)
TypeError: 'in <string>' requires string as left operand, not NoneType

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
        at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

1 个答案:

答案 0 :(得分:1)

您只需要更改UDF即可解决NULL,如下所示。您可能还想考虑列值中的空字符串。

def search_list(column):
    if column is None:
        return False
    return any(column in item for item in str1)