使用Java中的Spark流,我试图将两个数据帧非规范化为单个扁平化数据帧。 做法数据框可以包含重复记录(对于主键 practice_id ),因此我想加入以过滤掉较旧的记录(基于 updated_ts >列)。
做法数据框:
+--------------------+----------------+-----------+------------------+
| updated_ts| practice_name|practice_id|primary_address_id|
+--------------------+----------------+-----------+------------------+
|2019-03-23T17:08:42Z|Fal Vet Shop | 1| 1|
|2019-03-29T03:06:42Z|Fal Vet Shop AAA| 1| 1|
|2019-03-27T01:45:26Z|Test Shop | 2| 2|
+--------------------+----------------+-----------+------------------+
地址数据框:
+--------------------+------------+------------+--------+------------+----------+----+----------+-----------+
| updated_ts|country_code|address_type| city| address1|address_id|state_code|postal_code|
+--------------------+------------+------------+--------+------------+----------+----+----------+-----------+
|2019-01-20T20:10:39Z| US| HOME|Falmouth|5 Country Ln| 1| ME| 04105|
|2019-01-20T15:09:09Z| US| BIZ|Falmouth|13 Main St. | 2| ME| 04105|
+--------------------+------------+------------+--------+------------+----------+----+----------+-----------+
如何获取下面的记录?在加入两个数据框之前,我尝试使用dfPractices.dropDuplicates("practice_id")
,但它保留了较旧的练习记录(practice_name = Fal Vet Shop)与下面要指出的记录(practice_name = Fal Vet Shop AAA)。
+--------------------+----------------+-----------+------------------+--------------------+------------+------------+--------+------------+----------+----+----------+-----------+
| updated_ts| practice_name|practice_id|primary_address_id| updated_ts|country_code|address_type| city| address1|address_id|state_code|postal_code|
+--------------------+----------------+-----------+------------------+--------------------+------------+------------+--------+------------+----------+----+----------+-----------+
|2019-03-29T03:06:42Z|Fal Vet Shop AAA| 1| 1|2019-01-20T20:10:39Z| US| HOME|Falmouth|5 Country Ln| 1| ME| 04105|
|2019-03-27T01:45:26Z|Test Shop | 2| 2|2019-01-20T15:09:09Z| US| BIZ|Falmouth|13 Main St. | 2| ME| 04105|
+--------------------+----------------+-----------+------------------+--------------------+------------+------------+--------+------------+----------+----+----------+-----------+