如何在两个数组列中查找公用元素?

时间:2018-12-12 21:52:05

标签: scala apache-spark apache-spark-sql

我有两个逗号分隔的字符串列(sourceAuthorstargetAuthors)。

val df = Seq(
  ("Author1,Author2,Author3","Author2,Author3,Author1")
).toDF("source","target")

我想添加另一列nCommonAuthors,其中包含普通作者的数量。

我尝试通过这种方式这样做:

def myUDF = udf { (s1: String, s2: String) =>
  s1.split(",")
  s2.split(",")
  s1.intersect(s2).length
}
val newDF = myDF.withColumn("nCommonAuthors", myUDF($"source", $"target"))

我收到以下错误:

  

线程“ main”中的异常java.lang.UnsupportedOperationException:不支持单元类型的架构

知道为什么会出现此错误吗?如何找到两列之间的共同元素?

3 个答案:

答案 0 :(得分:1)

除非我误解了您的问题,否则function maybeTransform(transform) { const myString = "abc"; return transform ? transform(myString) : myString; } Type 'string' is not assignable to type 'T'会有一些可以帮助您的标准函数(因此您不必编写UDF)。

给出以下数据集:

split

您可以编写以下结构化查询:

array_intersect

答案 1 :(得分:0)

该错误表示您的udf正在返回单位(根本没有返回,因为void un Java)

尝试一下。您将在原始s1和S2上应用相交,而不是在分割后的相交上。

def myUDF = udf((s1: String, s2: String) =>{

  val splitted1 = s1.split(",")


  val splitted2= s2.split(",")


splitted1.intersect(splitted2).length

} )

答案 2 :(得分:0)

根据SCouto的回答,我为您提供了适用于我的完整解决方案:

  def myUDF: UserDefinedFunction = udf(
(s1: String, s2: String) => {
  val splitted1 = s1.split(",")
  val splitted2 = s2.split(",")
  splitted1.intersect(splitted2).length
})

  val spark = SparkSession.builder().master("local").getOrCreate()

  import spark.implicits._

  val df = Seq(("Author1,Author2,Author3","Author2,Author3,Author1")).toDF("source","target")

  df.show(false)

+-----------------------+-----------------------+
|source                 |target                 |
+-----------------------+-----------------------+
|Author1,Author2,Author3|Author2,Author3,Author1|
+-----------------------+-----------------------+

  val newDF: DataFrame = df.withColumn("nCommonAuthors", myUDF('source,'target))

  newDF.show(false)

+-----------------------+-----------------------+--------------+
|source                 |target                 |nCommonAuthors|
+-----------------------+-----------------------+--------------+
|Author1,Author2,Author3|Author2,Author3,Author1|3             |
+-----------------------+-----------------------+--------------+