我想根据我可能会或可能不会从pyspark中相同记录的新实例获得的信息来更新旧记录。 这就是旧表/数据框的样子
| FirstName | LastName | JoinDate | SnapshotBeginDate | SnapshotEndDate |
-------------------------------------------------------------------------
| John | Doe |2017-04-05 | 2017-05-04 | 2099-12-31 |
-------------------------------------------------------------------------
| Jane | Smith |2018-04-05 | 2017-05-04 | 2099-12-31 |
-------------------------------------------------------------------------
我不想仅将新数据附加到现有数据框。我也不想覆盖现有记录。相反,我想更新旧记录的snapshotEndDate。
例如:
| FirstName | LastName | JoinDate | SnapshotBeginDate | SnapshotEndDate |
-------------------------------------------------------------------------
| John | Doe |2017-04-05 | 2017-05-04 | 2019-04-03 |
-------------------------------------------------------------------------
| Jane | Smith |2018-04-05 | 2017-05-04 | 2019-04-03|
-------------------------------------------------------------------------
| John | Doe |2017-04-05 | 2019-04-03 | 2099-12-31|
-------------------------------------------------------------------------
| Jane | Smith |2018-04-05 | 2019-04-03 | 2099-12-31|
-------------------------------------------------------------------------
答案 0 :(得分:0)
要做的第一件事是根据数据创建两个数据框(在下面的示例中为dfold
和dfnew
)
import datetime
import pyspark.sql.functions as F
l = [
('John', 'Doe' , '2017-04-05', '2017-05-04' , '2099-12-31'),
('Jane' , 'Smith' , '2018-04-05', '2017-05-04' , '2099-12-31')
]
columns = [ 'FirstName' , 'LastName', 'JoinDate' , 'SnapshotBeginDate' , 'SnapshotEndDate']
dfold=spark.createDataFrame(l, columns)
dfold = dfold.withColumn('SnapshotBeginDate', F.to_date(dfold.SnapshotBeginDate, 'yyyy-MM-dd'))
dfold = dfold.withColumn('SnapshotEndDate', F.to_date(dfold.SnapshotEndDate, 'yyyy-MM-dd'))
dfnew = dfold
您可以使用withColumn函数来更新dfold
的SnapshotEndDate列和dfnew
的SnapshotBeginDate。此功能允许您将操作应用于列。您还需要当前日期来更新值。 python模块的datetime提供了这样的功能(以防您不需要当前日期,只需将其他任何日期指定为字符串),但不会返回任何列。要将返回对象变成一列,我们可以使用pyspark lit函数。
dfold= dfold.withColumn('SnapshotEndDate', F.lit(datetime.date.today()))
dfnew= dfnew.withColumn('SnapshotBeginDate', F.lit(datetime.date.today()))
dfold.union(dfnew).show()
输出:
+---------+--------+----------+-----------------+---------------+
|FirstName|LastName| JoinDate|SnapshotBeginDate|SnapshotEndDate|
+---------+--------+----------+-----------------+---------------+
| John| Doe|2017-04-05| 2017-05-04| 2019-06-01|
| Jane| Smith|2018-04-05| 2017-05-04| 2019-06-01|
| John| Doe|2017-04-05| 2019-06-01| 2099-12-31|
| Jane| Smith|2018-04-05| 2019-06-01| 2099-12-31|
+---------+--------+----------+-----------------+---------------+