如何在Spark中实现“交叉加入”?

时间:2014-07-21 05:56:54

标签: apache-spark cross-join

我们计划将Apache Pig代码移动到新的Spark平台。

Pig具有“Bag / Tuple / Field”概念,其行为与关系数据库类似。 Pig为CROSS / INNER / OUTER连接提供支持。

对于CROSS JOIN,我们可以使用alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];

但是当我们转向Spark平台时,我在Spark API中找不到任何对应物。你有什么想法吗?

2 个答案:

答案 0 :(得分:20)

oneRDD.cartesian(anotherRDD)

答案 1 :(得分:3)

以下是Spark 2.x Datasets和DataFrames的推荐版本:

scala> val ds1 = spark.range(10)
ds1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> ds1.cache.count
res1: Long = 10

scala> val ds2 = spark.range(10)
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> ds2.cache.count
res2: Long = 10

scala> val crossDS1DS2 = ds1.crossJoin(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]

scala> crossDS1DS2.count
res3: Long = 100

或者,可以使用没有连接条件的传统JOIN语法。使用此配置选项可以避免后面的错误。

spark.conf.set("spark.sql.crossJoin.enabled", true)

省略该配置时出错(具体使用“join”语法):

scala> val crossDS1DS2 = ds1.join(ds2)
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint]

scala> crossDS1DS2.count
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
...
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;

相关:spark.sql.crossJoin.enabled for Spark 2.x