Question

我找到了类似的帖子here，但是当我将其应用于String变量时会出现一些额外的问题。让我解释一下我想做什么。我有一个单列DataFrame df1，它包含一些地方信息：

+-------+
|place  |
+-------+
|Place A|
|Place B|
|Place C|
+-------+

另一个DataFrame df2如下：

+--+-------+
|id|place  |
+--+-------+
|1| Place A|
|2| Place C|
|3| Place C|
|4| Place B|

我需要循环遍历df2以检查每个id匹配的位置，并对匹配的ID执行某些操作。代码段如下：

  val places = df1.distinct.map(_.toString).collect
  for (place <- places){
    val students = df2.where(s"place = '$place'").select("id","place")
    // do something on students (add some unique columns depending the place)
    students.show(2)
}

我得到的错误是SQL ParseException：

extraneous input '[' expecting {'(', ....}
== SQL ==
academic_college = [Place A]
-------------------^^^

我现在的理解是这个Parse Exception来自于我执行collect操作后的places数据。它固有地包含“[]”：

places = Array([Place A], [Place B], [Place C])

我的问题有两个方面：

我只知道如何将df1收集到Array中并循环遍历它以实现我想要的，因为每个地方的操作都不同。如果我们采用这种方法，删除“[]”或将其更改为“（）”或执行其他操作来解决Parse异常的最佳方法是什么？
有没有更好的方法来实现这一目标而不收集（实现）df1并将所有内容保存在DataFrame中？

Answer 1

您可以从df1获取Array[String]

val places = df1.distinct().collect().map(_.getString(0))

现在您可以从数组中选择每个

places.foreach(place => {
  val student = df2.where($"place" === place).select("id","place")
  student.show()
})

但要确保这不会影响原始数据框架。

如果df1很小并且可以放入你的记忆中，你可以在驱动程序中收集它，否则，不建议使用它。

如果您提供一些输入和预期输出，您可以轻松获得更多帮助。

Answer 2

我需要遍历df2以检查每个id匹配的位置，并对匹配的ID执行某些操作。

collect()并且迭代收集的数据是昂贵的，因为所有处理都发生在驱动程序节点中。

我建议您使用 join

让我们说你有

df1
+-------+
|place  |
+-------+
|Place A|
|Place B|
+-------+

和

df2
+---+-------+
|id |place  |
+---+-------+
|1  |Place A|
|2  |Place C|
|3  |Place C|
|4  |Place B|
+---+-------+

您可以使用加入作为

获取带有ID的匹配位置

df2.join(df1, Seq("place"))

应该给你

+-------+---+
|place  |id |
+-------+---+
|Place A|1  |
|Place B|4  |
+-------+---+

现在，您可以在此数据框上执行do something on the matched ids。

我希望答案很有帮助

如何在Spark SQL表达式中使用字符串变量？

2 个答案: