我正在努力获得2个数据帧的CROSS JOIN。我正在使用spark 2.0。如何用2个数据帧实现CROSSS JOIN。?
编辑:
val df=df.join(df_t1, df("Col1")===df_t1("col")).join(df2,joinType=="cross join").where(df("col2")===df2("col2"))
答案 0 :(得分:6)
升级到spark-sql_2.11版本2.1.0的最新版本并使用数据集的.crossJoin函数
答案 1 :(得分:3)
如果不需要指定条件,请使用crossJoin
这里是工作代码的摘录:
people.crossJoin(area).show()
答案 2 :(得分:2)
使用其他数据框调用加入,而不使用加入条件。
查看以下示例。 给出第一个人的数据框:
+---+------+-------+------+
| id| name| mail|idArea|
+---+------+-------+------+
| 1| Jack|j@j.com| 1|
| 2|Valery|x@v.com| 1|
| 3| Karl|k@k.com| 2|
| 4| Nick|n@n.com| 2|
| 5| Luke|l@f.com| 3|
| 6| Marek|a@b.com| 3|
+---+------+-------+------+
和第二个区域数据框:
+------+--------------+
|idArea| areaName|
+------+--------------+
| 1|Amministration|
| 2| Public|
| 3| Store|
+------+--------------+
交叉连接简单地由下式给出:
val cross = people.join(area)
+---+------+-------+------+------+--------------+
| id| name| mail|idArea|idArea| areaName|
+---+------+-------+------+------+--------------+
| 1| Jack|j@j.com| 1| 1|Amministration|
| 1| Jack|j@j.com| 1| 3| Store|
| 1| Jack|j@j.com| 1| 2| Public|
| 2|Valery|x@v.com| 1| 1|Amministration|
| 2|Valery|x@v.com| 1| 3| Store|
| 2|Valery|x@v.com| 1| 2| Public|
| 3| Karl|k@k.com| 2| 1|Amministration|
| 3| Karl|k@k.com| 2| 2| Public|
| 3| Karl|k@k.com| 2| 3| Store|
| 4| Nick|n@n.com| 2| 3| Store|
| 4| Nick|n@n.com| 2| 2| Public|
| 4| Nick|n@n.com| 2| 1|Amministration|
| 5| Luke|l@f.com| 3| 2| Public|
| 5| Luke|l@f.com| 3| 3| Store|
| 5| Luke|l@f.com| 3| 1|Amministration|
| 6| Marek|a@b.com| 3| 1|Amministration|
| 6| Marek|a@b.com| 3| 2| Public|
| 6| Marek|a@b.com| 3| 3| Store|
+---+------+-------+------+------+--------------+
答案 3 :(得分:1)
您可能需要在 spark confs 中启用 crossJoin。 示例:
spark = SparkSession
.builder
.appName("distance_matrix")
.config("spark.sql.crossJoin.enabled",True)
.getOrCreate()
并使用这样的东西:
df1.join(df2, <condition>)
答案 4 :(得分:1)
如果区域数据很小,您可以通过 explode
来完成而无需改组:
val df1 = Seq(
(1,"Jack","j@j.com",1),
(2,"Valery","x@v.com",1),
(3,"Karl","k@k.com",2),
(4,"Nick","n@n.com",2),
(5,"Luke","l@f.com",3),
(6,"Marek","a@b.com",3)
).toDF("id","name","mail","idArea")
val arr = array(
Seq(
(1,"Amministration"),
(2,"Public"),
(3,"Store")
)
.map(r => struct(lit(r._1).as("idArea"), lit(r._2).as("areaName"))):_*
)
val cross = df1
.withColumn("d", explode(arr))
.withColumn("idArea", $"d.idArea")
.withColumn("areaName", $"d.areaName")
.drop("d")
df1.show
cross.show
输出
+---+------+-------+------+
| id| name| mail|idArea|
+---+------+-------+------+
| 1| Jack|j@j.com| 1|
| 2|Valery|x@v.com| 1|
| 3| Karl|k@k.com| 2|
| 4| Nick|n@n.com| 2|
| 5| Luke|l@f.com| 3|
| 6| Marek|a@b.com| 3|
+---+------+-------+------+
+---+------+-------+------+--------------+
| id| name| mail|idArea| areaName|
+---+------+-------+------+--------------+
| 1| Jack|j@j.com| 1|Amministration|
| 1| Jack|j@j.com| 2| Public|
| 1| Jack|j@j.com| 3| Store|
| 2|Valery|x@v.com| 1|Amministration|
| 2|Valery|x@v.com| 2| Public|
| 2|Valery|x@v.com| 3| Store|
| 3| Karl|k@k.com| 1|Amministration|
| 3| Karl|k@k.com| 2| Public|
| 3| Karl|k@k.com| 3| Store|
| 4| Nick|n@n.com| 1|Amministration|
| 4| Nick|n@n.com| 2| Public|
| 4| Nick|n@n.com| 3| Store|
| 5| Luke|l@f.com| 1|Amministration|
| 5| Luke|l@f.com| 2| Public|
| 5| Luke|l@f.com| 3| Store|
| 6| Marek|a@b.com| 1|Amministration|
| 6| Marek|a@b.com| 2| Public|
| 6| Marek|a@b.com| 3| Store|
+---+------+-------+------+--------------+