如何将DF的新结果附加到旧的输出文件中(使用Datetime)

时间:2019-07-19 05:17:28

标签: python pyspark

我创建了一个DF并将其用于实时报告,但是我正在处理的项目已请求保存它的单独副本以用于跟踪历史记录。

最终的数据帧每周生成一次(并覆盖),因此我想在构建时复制它,并使用当前的datetimestamp附加先前的构建数据

说实话,我不知道从哪里开始。

    df = my_input

    #add a datetime column
    timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    type(timestamp)

    df = df.withColumn('build_time', unix_timestamp(lit(timestamp), 'yyyy-MM-dd HH:mm:ss').cast("timestamp"))

    # df = df.append()
    my_output.write_dataframe(
        df.union(my_output.dataframe())
    )
Original DF
+---+-----+
|age|name |
+---+-----+
|1  |Alice|
|2  |Bob  |
+---+-----+

New file
Week 1
+---+-----+---------------------+
|age|name |time                 |
+---+-----+---------------------+
|1  |Alice|2017-08-02 16:16:14.0|
|2  |Bob  |2017-08-02 16:16:14.0|
+---+-----+---------------------+

Week 2
+---+-----+---------------------+
|age|name |time                 |
+---+-----+---------------------+
|1  |Alice|2017-08-02 16:16:14.0|
|2  |Bob  |2017-08-02 16:16:14.0|
|1  |Alice|2017-08-09 16:16:14.0|
|2  |Bob  |2017-08-09 16:16:14.0|
+---+-----+---------------------+

Week 3
+---+-----+---------------------+
|age|name |time                 |
+---+-----+---------------------+
|1  |Alice|2017-08-02 16:16:14.0|
|2  |Bob  |2017-08-02 16:16:14.0|
|1  |Alice|2017-08-09 16:16:14.0|
|2  |Bob  |2017-08-09 16:16:14.0|
|1  |Alice|2017-08-16 16:16:14.0|
|2  |Bob  |2017-08-16 16:16:14.0|
+---+-----+---------------------+

最受帮助的任何帮助

0 个答案:

没有答案