在Pyspark JavaPackage中调用Scala UDF不可调用错误

时间:2019-07-11 04:37:57

标签: scala apache-spark pyspark user-defined-functions

我正在尝试在scala中使用pyspark UDF 我的scala udf如下所示。

package com.ParseGender
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._


object ParseGender{
      def testudffunction(s: String): String =  {
          if(List("cis female","f","female","woman","femake","female ",
                  "cis-female/femme","female (cis)","femail").contains(s.toLowerCase))
          "Female"
          else if(List("male","m","male-ish","maile","mal","male (cis)",
                         "make","male ","man","msle","mail","malr","cis man","cis male").contains(s.toLowerCase))
                "Male"
            else
                "Transgender"
        }
    def getFun(): UserDefinedFunction = udf(testudffunction _)
    }

我正在使用sbt package将其打包到jars中。我的build.sbt如下所示

name := "ParseGender"
    version := "1.0"
    organization := "testcase"
    scalaVersion := "2.11.8"
    val sparkVersion = "2.3.0"

    libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % sparkVersion,
    "org.apache.spark" %% "spark-sql" % sparkVersion)

最后要测试它,我编写了一个小的pyspark程序

from pyspark.sql import functions as F
from pyspark.sql import types as T
import pandas as pd
from pyspark.sql import SparkSession
import numpy as np
from pyspark.sql.column import Column, _to_java_column, _to_seq

spark = SparkSession.builder \
     .master("local") \
     .appName("UDAF") \
     .getOrCreate()
spark.sparkContext.setLogLevel("WARN")

def test_udf(col):
  sc = spark.sparkContext
  _test_udf = sc._jvm.com.test.ParseGender.getFun()
  return Column(_test_udf.apply(_to_seq(sc, [col], _to_java_column)))

df = spark.createDataFrame(
  [("female",), ("male",), ("femail",)],
  ("text",)
)

df = df.withColumn('text2', test_udf(df['text']))
df.show(3)

我像这样用spark-submit在本地运行

spark-submit --jars /somepath/scala-2.11/parsegender_2.11-1.0.jar xyz.py

其中xyz.py具有上面的pyspark代码。

发出提交后,出现以下错误

    _test_udf = sc._jvm.com.test.ParseGender.getFun()
    TypeError: 'JavaPackage' object is not callable

我怀疑这与pyspark无法读取/检测程序包有关,但我不确定。有人可以提供一些有关如何解决此问题的指示吗?

0 个答案:

没有答案