Question

我们希望根据其他数据帧行的值过滤数据帧。数据帧的过滤是在udf内部执行的，实际上并没有发生。即使我们尝试显示数据帧(df.show())，服务器也不会停止或我们得到输出。

main()
{
    val x = udf(y_) 
    val df1 = //read from source file1
    val df2 = //read from source file2
    df1.select(x(df1(col1)))
}

y(col1 : String) : String{
    val output = df2.filter(df2(col1)===col1).select(df2(col2)).first().get(0).toString()
    return output
}

示例输入：

  Dataframe1:

  |PERSON_SK|         STATE|          ADDRESS1|
  |---------|--------------|------------------|
  |   111101|      Delaware|3020 Ode Turner Rd|
  |    11111|       Alabama| 2136 Pine Tree Ln|  
  |   211111|       mexico |3320 Burke Mill Rd|


  Dataframe2:

  |PERSON_SK|         STATE|          ADDRESS1|  city code|
  |---------|--------------|------------------|-----------|
  |         |      Delaware|3020 Ode Turner Rd|      62410|
  |         |       Alabama| 2136 Pine Tree Ln|      64128|

示例输出：
（想要将匹配的person_sk数据更新到dataframe2中的列，而不使用join）。通过使用过滤条件。

  |PERSON_SK|         STATE|          ADDRESS1|  city code|
  |---------|--------------|------------------|-----------|
  |   111101|      Delaware|3020 Ode Turner Rd|      62410|
  |   11111 |       Alabama| 2136 Pine Tree Ln|      64128|

Answer 1

问题的常见方法是使用join而不是用户定义的函数

val df1 = //
val df2 = //
val df3 = df1.join(df2, df1(col1) === df2(col1)) // joins dataframes by values in col1 column
val df = df3.select(df2(col2)) // gets only needed column from the join
df.show()

无法使用scala

1 个答案: