Question

我有一个带有两列的数据框（DF1）

+-------+------+
|words  |value |
+-------+------+
|ABC    |1.0   |
|XYZ    |2.0   |
|DEF    |3.0   |
|GHI    |4.0   |
+-------+------+

和另一个像这样的数据框（DF2）

+-----------------------------+
|string                       |
+-----------------------------+
|ABC DEF GHI                  |
|XYZ ABC DEF                  |                
+-----------------------------+

我必须用DF1中的相应值替换DF2中的各个字符串值，例如，在操作之后，我应该回到这个数据帧。

+-----------------------------+
|stringToDouble               |
+-----------------------------+
|1.0 3.0 4.0                  |
|2.0 1.0 3.0                  |                
+-----------------------------+

我尝试了多种方法，但似乎无法找到解决方案。

 def createCorpus(conversationCorpus: Dataset[Row], dataDictionary: Dataset[Row]): Unit = {
 import spark.implicits._

 def getIndex(word: String): Double = {
 val idxRow = dataDictionary.selectExpr("index").where('words.like(word))
 val idx = idxRow.toString
 if (!idx.isEmpty) idx.trim.toDouble else 1.0
 }

 conversationCorpus.map { //eclipse doesnt like this map here.. throws an error..
    r =>
    def row = {
       val arr = r.getString(0).toLowerCase.split(" ")
       val arrList = ArrayBuffer[Double]()
       arr.map {
          str =>
          val index = getIndex(str)
       }
       Row.fromSeq(arrList.toSeq)
       }
       row

   }
 }

Answer 1

组合多个数据框以创建新列需要连接。通过查看您的两个数据框，似乎我们可以通过words的{{1}}和df1 string列的df2列加入，但{{1} } column需要string并稍后组合（可以通过在爆炸之前为每行提供唯一ID来完成）。 explode 为monotically_increasing_id中的每一行提供唯一ID。 df2函数将 split列转换为数组以进行爆炸。然后你可以string他们。然后，其余步骤是通过执行join和聚合将展开的行合并回原始。

最后，使用groupBy函数将收集的数组列更改为所需的字符串列

长话短说，以下解决方案应该适合你

udf

应该给你

import org.apache.spark.sql.functions._
def arrayToString = udf((array: Seq[Double])=> array.mkString(" "))

df2.withColumn("rowId", monotonically_increasing_id())
  .withColumn("string", explode(split(col("string"), " ")))
  .join(df1, col("string") === col("words"))
  .groupBy("rowId")
  .agg(collect_list("value").as("stringToDouble"))
  .select(arrayToString(col("stringToDouble")).as("stringToDouble"))

使用另一个数据帧中的值映射一个数据框中的单个值

1 个答案: