熊猫结合了两个不同长度的时间序列数据帧

时间:2020-09-26 10:04:12

标签: python pandas dataframe

我正在尝试将熊猫数据帧的两个不同时间范围合并。第一个数据帧具有1小时的时间序列。第二个数据帧具有1分钟的时间序列。

1 hour dataframe
     get_time   value
0  1599739200  123.10
1  1599742800  136.24
2  1599750000  224.14

1 minute dataframe
       get_time  value
0   1599739200   2.11
1   1599739260   3.11
2   1599739320   3.12
3   1599742800   4.23
4   1599742860   2.22
5   1599742920   1.11
6   1599746400   7.24
7   1599746460  22.10
8   1599746520   2.13
9   1599750000   5.14
10  1599750060  12.10
11  1599750120  21.30

我要合并这两个数据帧,因此1小时数据帧的值将映射到1分钟数据帧中。如果没有1小时值,则映射的值为nan。

Desired Result:
     get_time    value         1h mapped value
0   1599739200   2.11                  123.10
1   1599739260   3.11                  123.10
2   1599739320   3.12                  123.10
3   1599742800   4.23                  136.24
4   1599742860   2.22                  136.24
5   1599742920   1.11                  136.24
6   1599746400   7.24                     NaN
7   1599746460  22.10                     NaN
8   1599746520   2.13                     NaN
9   1599750000   5.14                  224.14
10  1599750060  12.10                  224.14
11  1599750120  21.30                  224.14

基本上我想将这些数据框与以下逻辑结合起来:

if (1m_get_time >= 1h_get_time) and (1m_get_time < 1h_get_time+60minutes)
   1h mapped value = 1h value
else:
    1h mapped value = nan

当前,我使用递归方法。但是大数据量需要很长时间。这是数据框的示例:

dfhigh_ = pd.DataFrame({
    'get_time' : [1599739200, 1599742800, 1599750000],
    'value' : [123.1, 136.24, 224.14],
})

dflow_ = pd.DataFrame({
    'get_time' : [1599739200, 1599739260, 1599739320, 1599742800, 1599742860, 1599742920, 1599746400, 1599746460, 1599746520, 1599750000, 1599750060, 1599750120],
    'value' : [2.11, 3.11, 3.12, 4.23, 2.22, 1.11, 7.24, 22.1, 2.13, 5.14, 12.1, 21.3],
})

2 个答案:

答案 0 :(得分:2)

get_time中的dflow_设置为最近的小时数,然后根据此舍入时间戳记,使用Series.map将值从dfhigh_映射到dflow_:< / p>

hr = dflow_['get_time'] // 3600 * 3600
dflow_['mapped_value'] = hr.map(dfhigh_.set_index('get_time')['value'])

      get_time  value  mapped_value
0   1599739200   2.11        123.10
1   1599739260   3.11        123.10
2   1599739320   3.12        123.10
3   1599742800   4.23        136.24
4   1599742860   2.22        136.24
5   1599742920   1.11        136.24
6   1599746400   7.24           NaN
7   1599746460  22.10           NaN
8   1599746520   2.13           NaN
9   1599750000   5.14        224.14
10  1599750060  12.10        224.14
11  1599750120  21.30        224.14

答案 1 :(得分:1)

这应该可以工作(对于边缘情况也是如此):

import pandas as pd
from datetime import datetime
import numpy as np

dfhigh_ = dfhigh_.rename(columns={'value': '1h mapped value'})
df_new = pd.merge(dflow_, dfhigh_, how='outer', on=['get_time'])
df_new.get_time = [datetime.fromtimestamp(x) for x in df_new['get_time']]

for idx,row in df_new.iterrows():
    if not np.isnan(row['1h mapped value']):
        current_hour, current_1h_mapped_value = row['get_time'].hour, row['1h mapped value']
        for sub_idx,sub_row in df_new.loc[(df_new.get_time.dt.hour == current_hour) & np.isnan(df_new['1h mapped value'])].iterrows():
            df_new.loc[sub_idx, '1h mapped value'] = current_1h_mapped_value