鉴于两个Spark 数据集,A和B我可以在单列上进行连接,如下所示:
a.joinWith(b, $"a.col" === $"b.col", "left")
我的问题是你是否可以使用多列进行连接。基本上相当于以下 DataFrames api代码:
a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")
答案 0 :(得分:12)
您可以使用与Dataframe
完全相同的方式执行此操作:
val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS
xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// | _1| _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]| null|
// +------------+-----------+
在Spark< 2.0.0你可以使用这样的东西:
xs.as("xs").joinWith(
ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")
答案 1 :(得分:8)
通过链接where
一个接一个地加入另一种方式。首先指定一个连接(以及可选的类型),后跟where
运算符,即
scala> case class A(id: Long, name: String)
defined class A
scala> case class B(id: Long, name: String)
defined class B
scala> val as = Seq(A(0, "zero"), A(1, "one")).toDS
as: org.apache.spark.sql.Dataset[A] = [id: bigint, name: string]
scala> val bs = Seq(B(0, "zero"), B(1, "jeden")).toDS
bs: org.apache.spark.sql.Dataset[B] = [id: bigint, name: string]
scala> as.join(bs).where(as("id") === bs("id")).show
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
| 0|zero| 0| zero|
| 1| one| 1|jeden|
+---+----+---+-----+
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).show
+---+----+---+----+
| id|name| id|name|
+---+----+---+----+
| 0|zero| 0|zero|
+---+----+---+----+
这样一个好东西的原因是Spark优化器将加入(没有双关语)连续where
s到join
的一个。使用explain
运算符查看基础逻辑和物理计划。
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).explain(extended = true)
== Parsed Logical Plan ==
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Analyzed Logical Plan ==
id: bigint, name: string, id: bigint, name: string
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Optimized Logical Plan ==
Join Inner, ((name#31 = name#36) && (id#30L = id#35L))
:- Filter isnotnull(name#31)
: +- LocalRelation [id#30L, name#31]
+- Filter isnotnull(name#36)
+- LocalRelation [id#35L, name#36]
== Physical Plan ==
*BroadcastHashJoin [name#31, id#30L], [name#36, id#35L], Inner, BuildRight
:- *Filter isnotnull(name#31)
: +- LocalTableScan [id#30L, name#31]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false], input[0, bigint, false]))
+- *Filter isnotnull(name#36)
+- LocalTableScan [id#35L, name#36]
答案 2 :(得分:0)
在Java中,&&
运算符不起作用。基于Spark-Java中的多个列进行联接的正确方法如下:
Dataset<Row> datasetRf1 = joinedWithDays.join(
datasetFreq,
datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
.and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
"inner"
);
and
函数的作用类似于&&
运算符。