Pyspark / SQL:添加一个标志列,如左半连接

时间:2018-05-24 13:10:55

标签: sql apache-spark pyspark

例如,数据是

customer = spark.createDataFrame([
    (0, "Bill Chambers"),
    (1, "Matei Zaharia"),
    (2, "Michael Armbrust")])\
  .toDF("customerid", "name")

order = spark.createDataFrame([
    (0, 0, "Product 0"),
    (1, 1, "Product 1"),
    (2, 1, "Product 2"),
    (3, 3, "Product 3"),
    (4, 1, "Product 4")])\
  .toDF("orderid", "customerid", "product_name")

通过订单获取客户,我可以使用left semi

来完成
customer.join(order, ['customerid'], "left_semi").show()

可以返回

enter image description here

现在出于比较原因,我想添加一个标志列,而不是直接过滤掉一些行。所需的输出如下所示:

+----------+----------------+---------+ 
|customerid|            name|has_order| 
+----------+----------------+---------+ 
|         0| Bill Chambers  |     true| 
|         1| Matei Zaharia  |     true| 
|         2|Michael Armbrust|    false| 
+----------+----------------+---------+

我该怎么办?有没有优雅的方法呢?我试图搜索但没找到相关的东西,也许我得错了关键词?

可以使用SQL存在/ in?:Spark replacement for EXISTS and IN

1 个答案:

答案 0 :(得分:0)

您可以执行左连接,并使用pyspark.sql.Column.isNull()根据has_order列是否为空来创建orderid列。然后使用distinct()删除重复的行。

import pyspark.sql.functions as f
customer.alias("c").join(order.alias("o"), on=["customerid"], how="left")\
    .select(
        "c.*",
        f.col("o.orderid").isNull().alias("has_order")
    )\
    .distinct()\
    .show()
#+----------+----------------+---------+
#|customerid|            name|has_order|
#+----------+----------------+---------+
#|         0|   Bill Chambers|     true|
#|         1|   Matei Zaharia|     true|
#|         2|Michael Armbrust|    false|
#+----------+----------------+---------+

如果你想使用与你正在使用的左半连接类似的东西,你可以结合左半连接和左反连接的结果:

cust_left_semi = customer.join(order, ['customerid'], "leftsemi")\
    .withColumn('has_order', f.lit(True))
cust_left_semi.show()
#+----------+-------------+---------+
#|customerid|         name|has_order|
#+----------+-------------+---------+
#|         0|Bill Chambers|     true|
#|         1|Matei Zaharia|     true|
#+----------+-------------+---------+

cust_left_anti = customer.join(order, ['customerid'], "leftanti")\
    .withColumn('has_order', f.lit(False))
cust_left_anti.show()
#+----------+----------------+---------+
#|customerid|            name|has_order|
#+----------+----------------+---------+
#|         2|Michael Armbrust|    false|
#+----------+----------------+---------+

cust_left_semi.union(cust_left_anti).show()
#+----------+----------------+---------+
#|customerid|            name|has_order|
#+----------+----------------+---------+
#|         0|   Bill Chambers|     true|
#|         1|   Matei Zaharia|     true|
#|         2|Michael Armbrust|    false|
#+----------+----------------+---------+