Question

我有两个逗号分隔的字符串列（sourceAuthors和targetAuthors）。

val df = Seq(
  ("Author1,Author2,Author3","Author2,Author3,Author1")
).toDF("source","target")

我想添加另一列nCommonAuthors，其中包含普通作者的数量。

我尝试通过这种方式这样做：

def myUDF = udf { (s1: String, s2: String) =>
  s1.split(",")
  s2.split(",")
  s1.intersect(s2).length
}
val newDF = myDF.withColumn("nCommonAuthors", myUDF($"source", $"target"))

我收到以下错误：

线程“ main”中的异常java.lang.UnsupportedOperationException：不支持单元类型的架构

知道为什么会出现此错误吗？如何找到两列之间的共同元素？

Answer 1

除非我误解了您的问题，否则function maybeTransform(transform) { const myString = "abc"; return transform ? transform(myString) : myString; }和Type 'string' is not assignable to type 'T'会有一些可以帮助您的标准函数（因此您不必编写UDF）。

给出以下数据集：

split

您可以编写以下结构化查询：

array_intersect

Answer 2

该错误表示您的udf正在返回单位（根本没有返回，因为void un Java）

尝试一下。您将在原始s1和S2上应用相交，而不是在分割后的相交上。

def myUDF = udf((s1: String, s2: String) =>{

  val splitted1 = s1.split(",")


  val splitted2= s2.split(",")


splitted1.intersect(splitted2).length

} )

Answer 3

根据SCouto的回答，我为您提供了适用于我的完整解决方案：

  def myUDF: UserDefinedFunction = udf(
(s1: String, s2: String) => {
  val splitted1 = s1.split(",")
  val splitted2 = s2.split(",")
  splitted1.intersect(splitted2).length
})

  val spark = SparkSession.builder().master("local").getOrCreate()

  import spark.implicits._

  val df = Seq(("Author1,Author2,Author3","Author2,Author3,Author1")).toDF("source","target")

  df.show(false)

+-----------------------+-----------------------+
|source                 |target                 |
+-----------------------+-----------------------+
|Author1,Author2,Author3|Author2,Author3,Author1|
+-----------------------+-----------------------+

  val newDF: DataFrame = df.withColumn("nCommonAuthors", myUDF('source,'target))

  newDF.show(false)

+-----------------------+-----------------------+--------------+
|source                 |target                 |nCommonAuthors|
+-----------------------+-----------------------+--------------+
|Author1,Author2,Author3|Author2,Author3,Author1|3             |
+-----------------------+-----------------------+--------------+

如何在两个数组列中查找公用元素？

3 个答案: