我已经定义了两个这样的表:
val tableName = "table1"
val tableName2 = "table2"
val format = new SimpleDateFormat("yyyy-MM-dd")
val data = List(
List("mike", 26, true),
List("susan", 26, false),
List("john", 33, true)
)
val data2 = List(
List("mike", "grade1", 45, "baseball", new java.sql.Date(format.parse("1957-12-10").getTime)),
List("john", "grade2", 33, "soccer", new java.sql.Date(format.parse("1978-06-07").getTime)),
List("john", "grade2", 32, "golf", new java.sql.Date(format.parse("1978-06-07").getTime)),
List("mike", "grade2", 26, "basketball", new java.sql.Date(format.parse("1978-06-07").getTime)),
List("lena", "grade2", 23, "baseball", new java.sql.Date(format.parse("1978-06-07").getTime))
)
val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
val rdd2 = sparkContext.parallelize(data2).map(Row.fromSeq(_))
val schema = StructType(Array(
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("isBoy", BooleanType, false)
))
val schema2 = StructType(Array(
StructField("name", StringType, true),
StructField("grade", StringType, true),
StructField("howold", IntegerType, true),
StructField("hobby", StringType, true),
StructField("birthday", DateType, false)
))
val df = sqlContext.createDataFrame(rdd, schema)
val df2 = sqlContext.createDataFrame(rdd2, schema2)
df.createOrReplaceTempView(tableName)
df2.createOrReplaceTempView(tableName2)
我正在尝试构建查询以返回table1中没有table2中匹配行的行。 我尝试使用此查询来执行此操作:
Select * from table1 LEFT JOIN table2 ON table1.name = table2.name AND table1.age = table2.howold AND table2.name IS NULL AND table2.howold IS NULL
但这只是给了我table1的所有行:
列表({ “名称”: “约翰”, “年龄”:33, “isBoy”:真} { “名”: “苏珊”, “年龄”:26, “isBoy”:假}, { “名称”: “迈克”, “年龄”:26, “isBoy”:真})
如何有效地在Spark中进行这种类型的连接?
我正在寻找一个SQL查询,因为我需要能够指定要在两个表之间进行比较的列,而不是像在其他推荐的问题中那样逐行进行比较。比如使用减法,除了等等。
答案 0 :(得分:18)
你可以使用"左反"连接类型 - 使用DataFrame API或SQL(DataFrame API支持SQL支持的所有内容,包括您需要的任何连接条件):
DataFrame API:
df.as("table1").join(
df2.as("table2"),
$"table1.name" === $"table2.name" && $"table1.age" === $"table2.howold",
"leftanti"
)
SQL:
sqlContext.sql(
"""SELECT table1.* FROM table1
| LEFT ANTI JOIN table2
| ON table1.name = table2.name AND table1.age = table2.howold
""".stripMargin)
注意:还值得注意的是,使用元组和隐式{{1},可以更简洁,更简洁地创建示例数据,而无需单独指定架构方法,然后"修复"需要时自动推断的架构:
toDF
答案 1 :(得分:3)
你可以使用内置函数except
来完成
(我会使用你提供的代码,但你没有包含导入,所以我不能只是c / p它:()
val a = sc.parallelize(Seq((1,"a",123),(2,"b",456))).toDF("col1","col2","col3")
val b= sc.parallelize(Seq((4,"a",432),(2,"t",431),(2,"b",456))).toDF("col1","col2","col3")
scala> a.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 123|
| 2| b| 456|
+----+----+----+
scala> b.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 4| a| 432|
| 2| t| 431|
| 2| b| 456|
+----+----+----+
scala> a.except(b).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 123|
+----+----+----+
答案 2 :(得分:0)
您可以使用左防。
dfRcc20.as("a").join(dfClientesDuplicados.as("b")
,col("a.eteerccdiid")===col("b.eteerccdiid")&&
col("a.eteerccdinr")===col("b.eteerccdinr")
,"left_anti")
答案 3 :(得分:-1)
在SQL中,您只需在下面查询即可(不确定它是否在SPARK中工作)
Select * from table1 LEFT JOIN table2 ON table1.name = table2.name AND table1.age = table2.howold where table2.name IS NULL
这将返回table1的所有连接失败的行
答案 4 :(得分:-1)
Left Anti Join in dataset spark java:
A left anti join returns that all rows from the first dataset which do not have a match in the second dataset.
Example with code:
/*Read data from Employee.csv */
Dataset<Row> employee = sparkSession.read().option("header", "true")
.csv("C:\\Users\\Desktop\\Spark\\Employee.csv");
employee.show();
/*Read data from Employee1.csv */
Dataset<Row> employee1 = sparkSession.read().option("header", "true")
.csv("C:\\Users\\Desktop\\Spark\\Employee1.csv");
employee1.show();
/*Apply left anti join*/
Dataset<Row> leftAntiJoin = employee.join(employee1, employee.col("name").equalTo(employee1.col("name")), "leftanti");
leftAntiJoin.show();
Output:
1) Employee dataset
+-------+--------+-------+
| name| address| salary|
+-------+--------+-------+
| Arun| Indore| 500|
|Shubham| Indore| 1000|
| Mukesh|Hariyana| 10000|
| Kanha| Bhopal| 100000|
| Nandan|Jabalpur|1000000|
| Raju| Rohtak|1000000|
+-------+--------+-------+
2) Employee1 dataset
+-------+--------+------+
| name| address|salary|
+-------+--------+------+
| Arun| Indore| 500|
|Shubham| Indore| 1000|
| Mukesh|Hariyana| 10000|
+-------+--------+------+
3) Applied leftanti join and final data
+------+--------+-------+
| name| address| salary|
+------+--------+-------+
| Kanha| Bhopal| 100000|
|Nandan|Jabalpur|1000000|
| Raju| Rohtak|1000000|
+------+--------+-------+