val df1 = Seq(("[1,10,20]", "bat","43243"),("[20,4,10]","mouse","4324432"),("[30,20,3]", "horse","4324234")).toDF("id", "word","userid")
val df2 = Seq((1, "raj", "name"),(2, "kiran","name"),(3,"karnataka","state"),(4, "Andrapradesh","state")).toDF("id", "name", "code")
说明:
我有两个数据帧df1
和df2
。 df1
的{{1}}列具有ID列表。
我需要检查id
数据框中是否存在任何ID。
条件
如果df2
id
列中有df2
,并且代码为id
,则获取特定state
的{{1}}从df2中创建并使用name
列创建一个新的数据框
预期产量
id
答案 0 :(得分:1)
您可以先将id
列变平,方法是将其转换为数组并应用explode
。然后,您可以在数据框之间应用普通的联接操作。
例如:
val df1 = Seq(("[1,10,20]", "bat","43243"),("[20,4,10]","mouse","4324432"),("[30,20,3]", "horse","4324234")).toDF("id", "word","userid")
val df2 = Seq((1, "raj", "name"),(2, "kiran","name"),(3,"karnataka","state"),(4, "Andrapradesh","state")).toDF("id", "name", "code")
val flattenDf1 = df1.
select(
col("id"),
expr("""split(regexp_replace(id, "\\[|\\]",""), ",")""").as("idArray"), col("word"),
col("userid")).
withColumn("id_", explode(col("idArray"))).
drop("idArray")
df2.as("df2").
join(
flattenDf1.as("df1"),
col("df2.id") === col("df1.id_")).
filter("code = 'state'").
select("df1.id", "word", "userid", "name").
show
// Result:
// +---------+-----+-------+------------+
// | id| word| userid| name|
// +---------+-----+-------+------------+
// |[30,20,3]|horse|4324234| karnataka|
// |[20,4,10]|mouse|4324432|Andrapradesh|
// +---------+-----+-------+------------+
希望对您有帮助。
答案 1 :(得分:0)
您可以只将UDF作为连接中的条件:
val arrayJoin = udf {
(a: WrappedArray[Int], v: Int) => a.contains(v)
}
val result = df1
.join(df2.as("df2"), arrayJoin(df2("id"), df1("id"))) //join using udf
.drop("df2.id", "df2.code") //drop unnecessary columns