例如,数据是
customer = spark.createDataFrame([
(0, "Bill Chambers"),
(1, "Matei Zaharia"),
(2, "Michael Armbrust")])\
.toDF("customerid", "name")
order = spark.createDataFrame([
(0, 0, "Product 0"),
(1, 1, "Product 1"),
(2, 1, "Product 2"),
(3, 3, "Product 3"),
(4, 1, "Product 4")])\
.toDF("orderid", "customerid", "product_name")
通过订单获取客户,我可以使用left semi
customer.join(order, ['customerid'], "left_semi").show()
可以返回
现在出于比较原因,我想添加一个标志列,而不是直接过滤掉一些行。所需的输出如下所示:
+----------+----------------+---------+
|customerid| name|has_order|
+----------+----------------+---------+
| 0| Bill Chambers | true|
| 1| Matei Zaharia | true|
| 2|Michael Armbrust| false|
+----------+----------------+---------+
我该怎么办?有没有优雅的方法呢?我试图搜索但没找到相关的东西,也许我得错了关键词?
可以使用SQL存在/ in?:Spark replacement for EXISTS and IN
答案 0 :(得分:0)
您可以执行左连接,并使用pyspark.sql.Column.isNull()
根据has_order
列是否为空来创建orderid
列。然后使用distinct()
删除重复的行。
import pyspark.sql.functions as f
customer.alias("c").join(order.alias("o"), on=["customerid"], how="left")\
.select(
"c.*",
f.col("o.orderid").isNull().alias("has_order")
)\
.distinct()\
.show()
#+----------+----------------+---------+
#|customerid| name|has_order|
#+----------+----------------+---------+
#| 0| Bill Chambers| true|
#| 1| Matei Zaharia| true|
#| 2|Michael Armbrust| false|
#+----------+----------------+---------+
如果你想使用与你正在使用的左半连接类似的东西,你可以结合左半连接和左反连接的结果:
cust_left_semi = customer.join(order, ['customerid'], "leftsemi")\
.withColumn('has_order', f.lit(True))
cust_left_semi.show()
#+----------+-------------+---------+
#|customerid| name|has_order|
#+----------+-------------+---------+
#| 0|Bill Chambers| true|
#| 1|Matei Zaharia| true|
#+----------+-------------+---------+
cust_left_anti = customer.join(order, ['customerid'], "leftanti")\
.withColumn('has_order', f.lit(False))
cust_left_anti.show()
#+----------+----------------+---------+
#|customerid| name|has_order|
#+----------+----------------+---------+
#| 2|Michael Armbrust| false|
#+----------+----------------+---------+
cust_left_semi.union(cust_left_anti).show()
#+----------+----------------+---------+
#|customerid| name|has_order|
#+----------+----------------+---------+
#| 0| Bill Chambers| true|
#| 1| Matei Zaharia| true|
#| 2|Michael Armbrust| false|
#+----------+----------------+---------+