尝试使用spark 2.1计算单词共现
我的输入数据如下:
+---+-------------------+
| id| keywords|
+---+-------------------+
| 8| [mouse, cat]|
| 9| [bat, cat]|
| 10|[mouse, house, cat]|
+---+-------------------+
我想要的结果是这些行中关键字的组合,如下所示:
+-------+--------+
| word1 | word2 |
+-------+--------+
| cat | mouse |
| bat | cat |
| cat | house |
| cat | mouse |
| house | mouse |
+-------+--------+
由于输入行很少包含超过20个左右的关键字,Scala的combinations()似乎足以建立事件对。
给出一个fn和UDF,将其包装为:
def combine(items: Seq[String]) = {
items.sorted.combinations(2).toList
}
val combineUDF = udf(combine _)
使用一个简单的序列,我可以得到成对的关键字,如下所示:
val simpleSeq = Seq("cat", "mouse", "house")
println(combine(simpleSeq)
List(List(cat, house), List(cat, mouse), List(house, mouse))
使用数据框
使用输入数据的数据框,如下所示:
val comboDF = sourceDF.withColumn("combinations", combineUDF($"keywords"))
comboDF.printSchema
comboDF.show
comboDF: org.apache.spark.sql.DataFrame = [id: int, keywords: array<string> ... 1 more field]
root
|-- id: integer (nullable = false)
|-- keywords: array (nullable = true)
| |-- element: string (containsNull = true)
|-- combinations: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
+---+-------------------+--------------------+
| id| keywords| combinations|
+---+-------------------+--------------------+
| 9| [mouse, cat]|[WrappedArray(cat...|
| 8| [bat, cat]|[WrappedArray(bat...|
| 10|[mouse, house, cat]|[WrappedArray(cat...|
+---+-------------------+--------------------+
我接下来要提取组合列中的每一对,并将每一对作为一行。
我不知道该怎么做。
添加的列的类型为[WrappedArray[WrappedArray[String]]]
,我似乎无法映射到该列:
import scala.collection.mutable.WrappedArray
comboDF.map(row => row.get(2).asInstanceOf[WrappedArray[Seq[String]]].array).show
<console>:55: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
comboDF.map(row => row.get(2).asInstanceOf[WrappedArray[Seq[String]]].array).show
使用RDD
要解决基于DF的明显无法处理嵌套包装数组的问题,我尝试了RDD(我不太熟悉)。
我可以通过以下方式获得每行关键字的组合:
val wrappedPairs = listDF.select("keywords").
rdd.collect.map(r =>
combine(r.get(0).asInstanceOf[WrappedArray[String]].array.toList)
Array[List[Seq[String]]] = Array(List(List(cat, mouse)), List(List(bat, cat)), List(List(cat, house), List(cat, mouse), List(house, mouse)))
这基本上给了我
Array(
List(
List(cat, mouse)
),
List(
List(bat, cat)
),
List(
List(cat, house),
List(cat, mouse),
List(house, mouse)
)
)
我想去:
+-------+--------+
| word1 | word2 |
+-------+--------+
| cat | mouse |
| bat | cat |
| cat | house |
| cat | mouse |
| house | mouse |
+-------+--------+
我可以使用println来获得配对,但似乎无法解决如何将它们提升为行的情况:
wrappedPairs.map(outerList => outerList.asInstanceOf[List[List[String]]].
map(innerList => innerList.asInstanceOf[List[String]]).
map(pair => (pair(0),pair(1).toSeq)).foreach(println)
)
wrappedPairs: Array[List[Seq[String]]] = Array(List(List(cat, mouse)), List(List(bat, cat)), List(List(cat, house), List(cat, mouse), List(house, mouse)))
(cat,mouse)
(bat,cat)
(cat,house)
(cat,mouse)
(house,mouse)
答案 0 :(得分:1)
您非常亲密!您需要做的两件事是explode
以及如何从结构类型中投影字段。 explode
将一行中的项目列表扩展为单个行(复制所有其他字段)。爆炸之后,您需要以“组合”的形式在数组内选择值。有两种方法可以做到这一点,但是我通常按照下面显示的方法来完成。
查看此示例代码:
val input = Seq(
(Seq("mouse", "cat")),
(Seq("bat", "cat")),
(Seq("mouse", "house", "cat"))
).toDF("keywords")
def combine(items: Seq[String]) = {
items.sorted.combinations(2).toList
}
val combineUDF = udf(combine _)
val df = input.withColumn("combinations", explode(combineUDF($"keywords")))
df.show
这将为您提供一个这样的DataFrame:
+-------------------+--------------+
| keywords| combinations|
+-------------------+--------------+
| [mouse, cat]| [cat, mouse]|
| [bat, cat]| [bat, cat]|
|[mouse, house, cat]| [cat, house]|
|[mouse, house, cat]| [cat, mouse]|
|[mouse, house, cat]|[house, mouse]|
+-------------------+--------------+
现在您可以像这样在每行中从数组中选择两个字段:
val df2 = df.selectExpr("combinations[0] as word1", "combinations[1] as word2")
df2.show
输出:
+-----+-----+
|word1|word2|
+-----+-----+
| cat|mouse|
| bat| cat|
| cat|house|
| cat|mouse|
|house|mouse|
+-----+-----+