我正在尝试在scala
中使用pyspark
UDF
我的scala
udf如下所示。
package com.ParseGender
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions._
object ParseGender{
def testudffunction(s: String): String = {
if(List("cis female","f","female","woman","femake","female ",
"cis-female/femme","female (cis)","femail").contains(s.toLowerCase))
"Female"
else if(List("male","m","male-ish","maile","mal","male (cis)",
"make","male ","man","msle","mail","malr","cis man","cis male").contains(s.toLowerCase))
"Male"
else
"Transgender"
}
def getFun(): UserDefinedFunction = udf(testudffunction _)
}
我正在使用sbt package
将其打包到jars
中。我的build.sbt
如下所示
name := "ParseGender"
version := "1.0"
organization := "testcase"
scalaVersion := "2.11.8"
val sparkVersion = "2.3.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion)
最后要测试它,我编写了一个小的pyspark程序
from pyspark.sql import functions as F
from pyspark.sql import types as T
import pandas as pd
from pyspark.sql import SparkSession
import numpy as np
from pyspark.sql.column import Column, _to_java_column, _to_seq
spark = SparkSession.builder \
.master("local") \
.appName("UDAF") \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
def test_udf(col):
sc = spark.sparkContext
_test_udf = sc._jvm.com.test.ParseGender.getFun()
return Column(_test_udf.apply(_to_seq(sc, [col], _to_java_column)))
df = spark.createDataFrame(
[("female",), ("male",), ("femail",)],
("text",)
)
df = df.withColumn('text2', test_udf(df['text']))
df.show(3)
我像这样用spark-submit
在本地运行
spark-submit --jars /somepath/scala-2.11/parsegender_2.11-1.0.jar xyz.py
其中xyz.py
具有上面的pyspark代码。
发出提交后,出现以下错误
_test_udf = sc._jvm.com.test.ParseGender.getFun()
TypeError: 'JavaPackage' object is not callable
我怀疑这与pyspark无法读取/检测程序包有关,但我不确定。有人可以提供一些有关如何解决此问题的指示吗?