嵌套WrappedArrays上的Spark映射

时间:2018-07-11 15:58:12

标签: scala apache-spark apache-spark-sql

尝试使用spark 2.1计算单词共现

我的输入数据如下:

+---+-------------------+
| id|           keywords|
+---+-------------------+
|  8|       [mouse, cat]|
|  9|         [bat, cat]|
| 10|[mouse, house, cat]|
+---+-------------------+

我想要的结果是这些行中关键字的组合,如下所示:

+-------+--------+
| word1 |  word2 |
+-------+--------+
| cat   |  mouse |
| bat   |  cat   |
| cat   |  house |
| cat   |  mouse |
| house |  mouse |
+-------+--------+

由于输入行很少包含超过20个左右的关键字,Scala的combinations()似乎足以建立事件对。

给出一个fn和UDF,将其包装为:

def combine(items: Seq[String]) = {
  items.sorted.combinations(2).toList
}
val combineUDF = udf(combine _)

使用一个简单的序列,我可以得到成对的关键字,如下所示:

val simpleSeq = Seq("cat", "mouse", "house")
println(combine(simpleSeq)

List(List(cat, house), List(cat, mouse), List(house, mouse))

使用数据框
使用输入数据的数据框,如下所示:

val comboDF = sourceDF.withColumn("combinations", combineUDF($"keywords"))

comboDF.printSchema
comboDF.show

comboDF: org.apache.spark.sql.DataFrame = [id: int, keywords: array<string> ... 1 more field]
root
 |-- id: integer (nullable = false)
 |-- keywords: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- combinations: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
+---+-------------------+--------------------+
| id|           keywords|        combinations|
+---+-------------------+--------------------+
|  9|       [mouse, cat]|[WrappedArray(cat...|
|  8|         [bat, cat]|[WrappedArray(bat...|
| 10|[mouse, house, cat]|[WrappedArray(cat...|
+---+-------------------+--------------------+

我接下来要提取组合列中的每一对,并将每一对作为一行

我不知道该怎么做。

添加的列的类型为[WrappedArray[WrappedArray[String]]],我似乎无法映射到该列:

import scala.collection.mutable.WrappedArray

comboDF.map(row => row.get(2).asInstanceOf[WrappedArray[Seq[String]]].array).show

<console>:55: error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
       comboDF.map(row => row.get(2).asInstanceOf[WrappedArray[Seq[String]]].array).show

使用RDD

要解决基于DF的明显无法处理嵌套包装数组的问题,我尝试了RDD(我不太熟悉)。

我可以通过以下方式获得每行关键字的组合:

val wrappedPairs = listDF.select("keywords").
   rdd.collect.map(r => 
     combine(r.get(0).asInstanceOf[WrappedArray[String]].array.toList)

Array[List[Seq[String]]] = Array(List(List(cat, mouse)), List(List(bat, cat)), List(List(cat, house), List(cat, mouse), List(house, mouse)))

这基本上给了我

Array(
  List(
    List(cat, mouse)
  ), 
  List(
    List(bat, cat)
  ), 
  List(
    List(cat, house), 
    List(cat, mouse), 
    List(house, mouse)
  )
)

我想去:

+-------+--------+
| word1 |  word2 |
+-------+--------+
| cat   |  mouse |
| bat   |  cat   |
| cat   |  house |
| cat   |  mouse |
| house |  mouse |
+-------+--------+

我可以使用println来获得配对,但似乎无法解决如何将它们提升为行的情况:

wrappedPairs.map(outerList => outerList.asInstanceOf[List[List[String]]].
    map(innerList => innerList.asInstanceOf[List[String]]).
        map(pair => (pair(0),pair(1).toSeq)).foreach(println)
    )


wrappedPairs: Array[List[Seq[String]]] = Array(List(List(cat, mouse)), List(List(bat, cat)), List(List(cat, house), List(cat, mouse), List(house, mouse)))
(cat,mouse)
(bat,cat)
(cat,house)
(cat,mouse)
(house,mouse)

1 个答案:

答案 0 :(得分:1)

您非常亲密!您需要做的两件事是explode以及如何从结构类型中投影字段。 explode将一行中的项目列表扩展为单个行(复制所有其他字段)。爆炸之后,您需要以“组合”的形式在数组内选择值。有两种方法可以做到这一点,但是我通常按照下面显示的方法来完成。

查看此示例代码:

val input = Seq(
    (Seq("mouse", "cat")),
    (Seq("bat", "cat")),
    (Seq("mouse", "house", "cat"))
).toDF("keywords")

def combine(items: Seq[String]) = {
  items.sorted.combinations(2).toList
}
val combineUDF = udf(combine _)

val df = input.withColumn("combinations", explode(combineUDF($"keywords")))
df.show

这将为您提供一个这样的DataFrame:

+-------------------+--------------+
|           keywords|  combinations|
+-------------------+--------------+
|       [mouse, cat]|  [cat, mouse]|
|         [bat, cat]|    [bat, cat]|
|[mouse, house, cat]|  [cat, house]|
|[mouse, house, cat]|  [cat, mouse]|
|[mouse, house, cat]|[house, mouse]|
+-------------------+--------------+

现在您可以像这样在每行中从数组中选择两个字段:

val df2 = df.selectExpr("combinations[0] as word1", "combinations[1] as word2")
df2.show

输出:

+-----+-----+
|word1|word2|
+-----+-----+
|  cat|mouse|
|  bat|  cat|
|  cat|house|
|  cat|mouse|
|house|mouse|
+-----+-----+