Question

我有两个数据帧：dataDf和regexDf。 dataDf有大量记录，regexDf有两列正则表达式。我的问题是，我需要根据regexDef中两列中的两个列匹配正则表达式来过滤dataDf。我想出了这个

SELECT COUNT(*), table2.pc_name
from table2
inner JOIN table1
ON
table2.pc_id = table1.p_id

当我说

dataDf.registerTempTable("dataTable")
sqlContext.udf.register("matchExpressionCombination", matchExpressionCombination _)

val matchingResults = sqlContext.sql("SELECT * FROM dataTable WHERE matchExpressionCombination(col1, col2)")
def matchExpressionCombination(col1Text: String, col2Text: String): Boolean = {
  val regexDf = getRegexDf()
  var isMatch = false
  for(row <- regexDf.collect) {
    if(col1Text.matches(row(0).toString) && col2Text.matches(row(1).toString)) {
      isMatch = true
    }
  }
  isMatch
}

我收到以下错误： -

matchingResults.count().println

Answer 1

您不能在UDF中使用收集操作，因为所有数据都将被发送到该节点，而collect应该仅用于spark-shell类型的环境中的实验除此之外，您不能使用任何使用spark上下文的操作，因为这些操作是在驱动程序上执行的，但UDF代码是发送到Executor节点而Executor没有spark上下文对象

Answer 2

将为matchExpressionCombination中的每一行调用您的UDF dataTable，但它涉及收集RDD（regexDf.collect）。这将导致每行“dataTable”执行一次收集操作，这应该是非常低效的。

您应该加入RDD，使用UDF函数确定表匹配的位置，或者将UDF外部的正则表达式RDD收集到本地val中，并在UDF中使用该val。

您的异常显示Caused by: java.io.NotSerializableException: com.salesforce.RegexDeduper，因此您应该详细了解代码中此类的使用位置。

Spark SQL中的udf

2 个答案: