在pyspark中使用新信息更新旧记录,而不会覆盖

时间:2019-05-29 22:51:24

标签: python dataframe pyspark

我想根据我可能会或可能不会从pyspark中相同记录的新实例获得的信息来更新旧记录。 这就是旧表/数据框的样子

| FirstName | LastName | JoinDate  | SnapshotBeginDate | SnapshotEndDate |           
-------------------------------------------------------------------------
| John      | Doe      |2017-04-05 | 2017-05-04        | 2099-12-31      |
-------------------------------------------------------------------------
| Jane      | Smith    |2018-04-05 | 2017-05-04        | 2099-12-31 |
-------------------------------------------------------------------------

我不想仅将新数据附加到现有数据框。我也不想覆盖现有记录。相反,我想更新旧记录的snapshotEndDate。

例如:

| FirstName | LastName | JoinDate  | SnapshotBeginDate | SnapshotEndDate |           
-------------------------------------------------------------------------
| John      | Doe      |2017-04-05 | 2017-05-04        | 2019-04-03      |
-------------------------------------------------------------------------
| Jane      | Smith    |2018-04-05 | 2017-05-04        | 2019-04-03|
-------------------------------------------------------------------------
| John      | Doe      |2017-04-05 | 2019-04-03        | 2099-12-31|
-------------------------------------------------------------------------
| Jane      | Smith    |2018-04-05 | 2019-04-03        | 2099-12-31|
-------------------------------------------------------------------------

1 个答案:

答案 0 :(得分:0)

要做的第一件事是根据数据创建两个数据框(在下面的示例中为dfolddfnew

import datetime
import pyspark.sql.functions as F
l = [
 ('John',      'Doe'    ,  '2017-04-05',  '2017-05-04' ,        '2099-12-31'),
 ('Jane' ,      'Smith'  ,  '2018-04-05',  '2017-05-04' ,        '2099-12-31')
    ]

columns = [     'FirstName' , 'LastName', 'JoinDate'  , 'SnapshotBeginDate' , 'SnapshotEndDate']

dfold=spark.createDataFrame(l, columns)
dfold = dfold.withColumn('SnapshotBeginDate',   F.to_date(dfold.SnapshotBeginDate,  'yyyy-MM-dd'))
dfold = dfold.withColumn('SnapshotEndDate',   F.to_date(dfold.SnapshotEndDate,  'yyyy-MM-dd'))
dfnew = dfold

您可以使用withColumn函数来更新dfold的SnapshotEndDate列和dfnew的SnapshotBeginDate。此功能允许您将操作应用于列。您还需要当前日期来更新值。 python模块的datetime提供了这样的功能(以防您不需要当前日期,只需将其他任何日期指定为字符串),但不会返回任何列。要将返回对象变成一列,我们可以使用pyspark lit函数。

dfold= dfold.withColumn('SnapshotEndDate', F.lit(datetime.date.today()))
dfnew= dfnew.withColumn('SnapshotBeginDate', F.lit(datetime.date.today()))
dfold.union(dfnew).show()

输出:

+---------+--------+----------+-----------------+---------------+ 
|FirstName|LastName|  JoinDate|SnapshotBeginDate|SnapshotEndDate| 
+---------+--------+----------+-----------------+---------------+ 
|     John|     Doe|2017-04-05|       2017-05-04|     2019-06-01| 
|     Jane|   Smith|2018-04-05|       2017-05-04|     2019-06-01| 
|     John|     Doe|2017-04-05|       2019-06-01|     2099-12-31| 
|     Jane|   Smith|2018-04-05|       2019-06-01|     2099-12-31| 
+---------+--------+----------+-----------------+---------------+