通过列中的唯一值连接两个pyspark数据帧

时间:2019-10-16 09:55:19

标签: dataframe pyspark

比方说,我有两个pyspark数据帧,usersshops。这两个数据框的一些示例行如下所示。

用户数据框:

+---------+-------------+---------+
| idvalue | day-of-week | geohash |
+---------+-------------+---------+
| id-1    |           2 | gcutjjn |
| id-1    |           3 | gcutjjn |
| id-1    |           5 | gcutjht |
+---------+-------------+---------+

商店数据框

+---------+-----------+---------+
| shop-id | shop-name | geohash |
+---------+-----------+---------+
| sid-1   | kfc       | gcutjjn |
| sid-2   | mcd       | gcutjhq |
| sid-3   | starbucks | gcutjht |
+---------+-----------+---------+

我需要在geohash列上同时加入这两个数据框。我可以肯定地做一个天真的等值连接,但是 users 数据框很大,包含数十亿行,并且geohash可能在idvalues内和idids之间重复。因此,我想知道是否有一种方法可以对 users 数据框中的唯一地理位置和在 shops 数据框中的地理位置进行联接。如果我们能够做到这一点,那么很容易在结果数据框中复制商店条目以匹配地理哈希。

大概可以用熊猫udf来实现,我将在 users.idvalue 上执行groupby,在udf中与 shops 进行联接,只需组中的第一行(因为组中的所有ID始终相同),并创建一个行数据框。从逻辑上讲,这似乎应该可行,但不确定性能,因为udf通常比spark本机转换慢。任何想法都欢迎。

2 个答案:

答案 0 :(得分:1)

您说您的Users数据框很大,并且“ geohashes可能在idvalues之内和之内重复”。但是,您没有提到商店数据框中是否可能存在重复的哈希。

如果后者中没有重复的哈希,我认为简单的连接将解决您的问题:

val userDf = Seq(("id-1",2,"gcutjjn"),("id-2",2,"gcutjjn"),("id-1",3,"gcutjjn"),("id-1",5,"gcutjht")).toDF("idvalue","day_of_week","geohash")
val shopDf = Seq(("sid-1","kfc","gcutjjn"),("sid-2","mcd","gcutjhq"),("sid-3","starbucks","gcutjht")).toDF("shop_id","shop_name","geohash")

userDf.show
+-------+-----------+-------+
|idvalue|day_of_week|geohash|
+-------+-----------+-------+
|   id-1|          2|gcutjjn|
|   id-2|          2|gcutjjn|
|   id-1|          3|gcutjjn|
|   id-1|          5|gcutjht|
+-------+-----------+-------+

shopDf.show
+-------+---------+-------+
|shop_id|shop_name|geohash|
+-------+---------+-------+
|  sid-1|      kfc|gcutjjn|
|  sid-2|      mcd|gcutjhq|
|  sid-3|starbucks|gcutjht|
+-------+---------+-------+

shopDf
    .join(userDf,Seq("geohash"),"inner")
    .groupBy($"geohash",$"shop_id",$"idvalue")
    .agg(collect_list($"day_of_week").alias("days"))
    .show
+-------+-------+-------+------+
|geohash|shop_id|idvalue|  days|
+-------+-------+-------+------+
|gcutjjn|  sid-1|   id-1|[2, 3]|
|gcutjht|  sid-3|   id-1|   [5]|
|gcutjjn|  sid-1|   id-2|   [2]|
+-------+-------+-------+------+

如果您在shop数据框中有重复的哈希值,则一种可能的方法是从shop数据框中删除那些重复的哈希(如果您的要求允许),然后执行相同的联接操作。

val userDf = Seq(("id-1",2,"gcutjjn"),("id-2",2,"gcutjjn"),("id-1",3,"gcutjjn"),("id-1",5,"gcutjht")).toDF("idvalue","day_of_week","geohash")
val shopDf = Seq(("sid-1","kfc","gcutjjn"),("sid-2","mcd","gcutjhq"),("sid-3","starbucks","gcutjht"),("sid-4","burguer king","gcutjjn")).toDF("shop_id","shop_name","geohash")

userDf.show
+-------+-----------+-------+
|idvalue|day_of_week|geohash|
+-------+-----------+-------+
|   id-1|          2|gcutjjn|
|   id-2|          2|gcutjjn|
|   id-1|          3|gcutjjn|
|   id-1|          5|gcutjht|
+-------+-----------+-------+

shopDf.show
+-------+------------+-------+
|shop_id|   shop_name|geohash|
+-------+------------+-------+
|  sid-1|         kfc|gcutjjn|  <<  Duplicated geohash
|  sid-2|         mcd|gcutjhq|
|  sid-3|   starbucks|gcutjht|
|  sid-4|burguer king|gcutjjn|  <<  Duplicated geohash
+-------+------------+-------+

//Dataframe with hashes to exclude:
val excludedHashes = shopDf.groupBy("geohash").count.filter("count > 1")
excludedHashes.show
+-------+-----+
|geohash|count|
+-------+-----+
|gcutjjn|    2|
+-------+-----+

//Create a dataframe of shops without the ones with duplicated hashes
val cleanShopDf = shopDf.join(excludedHashes,Seq("geohash"),"left_anti")
cleanShopDf.show
+-------+-------+---------+
|geohash|shop_id|shop_name|
+-------+-------+---------+
|gcutjhq|  sid-2|      mcd|
|gcutjht|  sid-3|starbucks|
+-------+-------+---------+

//Perform the same join operation
cleanShopDf.join(userDf,Seq("geohash"),"inner")
    .groupBy($"geohash",$"shop_id",$"idvalue")
    .agg(collect_list($"day_of_week").alias("days"))
    .show
+-------+-------+-------+----+
|geohash|shop_id|idvalue|days|
+-------+-------+-------+----+
|gcutjht|  sid-3|   id-1| [5]|
+-------+-------+-------+----+

提供的代码是用Scala编写的,但可以轻松转换为Python。

希望这会有所帮助!

答案 1 :(得分:0)

这是一个主意,如果您可以使用pyspark SQL选择不同的geohash并创建到临时表。然后从该表而不是数据帧中加入。