检查其他数据帧中是否存在特定的标识符

时间:2019-06-26 10:23:01

标签: scala apache-spark dataframe apache-spark-sql

val df1 = Seq(("[1,10,20]", "bat","43243"),("[20,4,10]","mouse","4324432"),("[30,20,3]", "horse","4324234")).toDF("id", "word","userid") 

val df2 = Seq((1, "raj", "name"),(2, "kiran","name"),(3,"karnataka","state"),(4, "Andrapradesh","state")).toDF("id", "name", "code")

说明

我有两个数据帧df1df2df1的{​​{1}}列具有ID列表。

我需要检查id数据框中是否存在任何ID。

条件

如果df2 id列中有df2,并且代码为id,则获取特定state的{​​{1}}从df2中创建并使用name列创建一个新的数据框

预期产量

id

2 个答案:

答案 0 :(得分:1)

您可以先将id列变平,方法是将其转换为数组并应用explode。然后,您可以在数据框之间应用普通的联接操作。

例如:

val df1 = Seq(("[1,10,20]", "bat","43243"),("[20,4,10]","mouse","4324432"),("[30,20,3]", "horse","4324234")).toDF("id", "word","userid") 
val df2 = Seq((1, "raj", "name"),(2, "kiran","name"),(3,"karnataka","state"),(4, "Andrapradesh","state")).toDF("id", "name", "code")

val flattenDf1 = df1.
  select(
    col("id"),
    expr("""split(regexp_replace(id, "\\[|\\]",""), ",")""").as("idArray"), col("word"),
    col("userid")).
  withColumn("id_", explode(col("idArray"))).
  drop("idArray")

df2.as("df2").
  join(
    flattenDf1.as("df1"),
    col("df2.id") === col("df1.id_")).
  filter("code = 'state'").
  select("df1.id", "word", "userid", "name").
  show
// Result: 
// +---------+-----+-------+------------+
// |       id| word| userid|        name|
// +---------+-----+-------+------------+
// |[30,20,3]|horse|4324234|   karnataka|
// |[20,4,10]|mouse|4324432|Andrapradesh|
// +---------+-----+-------+------------+

希望对您有帮助。

答案 1 :(得分:0)

您可以只将UDF作为连接中的条件:

val arrayJoin = udf { 
   (a: WrappedArray[Int], v: Int) => a.contains(v) 
}

val result = df1
      .join(df2.as("df2"), arrayJoin(df2("id"), df1("id"))) //join using udf
      .drop("df2.id", "df2.code") //drop unnecessary columns