Question

我有两个数据集： df1 =将旧数据保存为.csv文件，并以以下结构加载了熊猫：

df1：

                    Date     Open     High      Low    Close  Volume
0    2019-12-13 11:29:00  19.6804  19.6955  19.6755  19.6804     744
1    2019-12-13 11:27:00  19.6600  19.6600  19.6400  19.6400      64
.
.
.
305  2019-12-09 03:19:00  19.3400  19.4000  19.3400  19.4000    1604
306  2019-12-09 03:00:00  19.4000  19.4000  19.4000  19.4000       0

............................................... ...............................................

df2 =熊猫格式的新数据，具有相同的结构，但带有其他时间戳：

df2：

                    Date   Open   High    Low  Close  Volume
0    2019-12-16 04:32:00  19.60  19.60  19.60  19.60     204
1    2019-12-16 04:24:00  19.62  19.62  19.62  19.62     200
.
.
.
249  2019-12-10 03:08:00  19.20  19.20  19.12  19.12     235
250  2019-12-10 03:00:00  19.30  19.30  19.30  19.30       0

df1中有df2的一些数据集，但df2较新。我不想丢失旧数据集，并根据日期将其与新数据集合并。如何根据日期将数据集链接到一个带有熊猫的数据（df12）中？如何用缺少的数据集重新填充新数据集（df_accu）像：

df12：

    Date                  Open              High              Low               Close             Volume
0    2019-12-13 11:29:00  19.6804           19.6955           19.6755           19.6804           744
1    2019-12-13 11:28:00  [previous value]  [previous value]  [previous value]  [previous value]  0
2    2019-12-13 11:27:00  19.6600           19.6600           19.6400           19.6400           64

。。

Answer 1

我将同时连接df1和df2，然后构建一个包含所有日期的新系列，并将其合并回并连接到df1和{{1} }并填写所有df2。为了简单起见，我的设置只包含一个value列，但是即使有多个value列，代码也应运行。

设置：

nan

如果import pandas as pd df1 = pd.DataFrame({'Date': ['2019-12-13 11:29:00', '2019-12-09 03:19:00', '2019-12-09 03:00:00'], 'Value': [1, 2, 3]}) df2 = pd.DataFrame({'Date': ['2019-12-16 04:32:00', '2019-12-10 03:00:00', '2019-12-10 03:08:00'], 'Value': [1, 2, 3]})和df1之间没有重复的日期，只需在它们上调用df2即可解决问题。如果您的pd.concat列已经是assign，则可以忽略'Date'调用：

datetime

如果有重复项，请在连接后将其删除，请注意df12 = pd.concat((df1, df2)) df12 = df12.assign(Date=pd.to_datetime(df12['Date']))参数，因为这将决定是保留keep还是保留df1的值：

df2

对于第二部分，我们将构建一个包含所有可能日期的新系列，并将其合并回原始的df12 = df12.drop_duplicates('Date', keep='first') # if keeping values from df1 df12 = df12.drop_duplicates('Date', keep='last') # if keeping values from df2。然后，我们按日期排序，并用先前的值填充所有df12的值（我假设通过'previous_value'表示先前日期的值，但是如果不是这种情况，则按nan进行排序降序）：

'Date'

结果：

s1 = pd.Series(
    pd.date_range(df12['Date'].min(), df12['Date'].max(), freq='T'), # freq 'T' = minutes, I'm assuming this given your example
    name='Date'
) 

df12 = df12.merge(s1, 'outer', left_on='Date', right_on='Date').sort_values('Date')
df12[['Value']] = df12[['Value']].fillna(method='ffill')
# Uncomment the next line to fill the values in 'Volume' col with 0 instead of the previous one
# df12[['Volume']] = df12[['Volume']].fillna(0)

Answer 2

Volume只能是最后一个值，而不能是先前的值，而应该是零。

Python Pandas-两个具有不同时间戳的df的时间序列合并，并使用最后的值重新填充缺少的时间戳

2 个答案: