从旧数据帧创建具有新大小的新数据帧

时间:2021-06-22 15:53:22

标签: python pandas

我有一个 df_train 如下:

             X1  
01-01-2020 | 1     
01-02-2020 | 2     
01-03-2020 | 3      
01-04-2020 | 4  

现在我想用日期时间索引构建另一个 df

我将获得日期时间索引:

future_dates = pd.date_range(df_train.index.max(), periods=12, freq='M')

我想得到一个新的 df,它在开始时有一个 df_train 的副本,而对于其余的日期,我们将得到 df_train 的平均值。

预期结果:

               X1  
  01-05-2020 | 1     
  01-06-2020 | 2     
  01-07-2020 | 3      
  01-08-2020 | 4 
  01-09-2020 | 2.5     
  01-10-2020 | 2.5     
  01-11-2020 | 2.5      
  01-12-2020 | 2.5 
  01-01-2021 | 2.5     
  01-02-2021 | 2.5     
  01-03-2021 | 2.5      
  01-04-2021 | 2.5  

3 个答案:

答案 0 :(得分:2)

如果尚未转换索引 to_datetime

df_train.index = pd.to_datetime(df_train.index, dayfirst=True)

然后尝试使用 MonthBeginMS 偏移索引:

future_dates = pd.date_range(
    df_train.index.max() + pd.tseries.offsets.MonthBegin(1),
    periods=12,
    freq='MS'
)
DatetimeIndex(['2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01',
               '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01',
               '2021-01-01', '2021-02-01', '2021-03-01', '2021-04-01'],
              dtype='datetime64[ns]', freq='MS')

然后创建一个新框架并根据 df_train 的长度替换第一个值:

new_df = pd.DataFrame({'X1': df_train['X1'].mean()}, index=future_dates)
new_df.iloc[:df_train.shape[0], new_df.columns.get_loc('X1')] = df_train['X1'].values

new_df

             X1
2020-05-01  1.0
2020-06-01  2.0
2020-07-01  3.0
2020-08-01  4.0
2020-09-01  2.5
2020-10-01  2.5
2020-11-01  2.5
2020-12-01  2.5
2021-01-01  2.5
2021-02-01  2.5
2021-03-01  2.5
2021-04-01  2.5

或者从列表推导式构建:

new_df = pd.DataFrame({
    'X1': [*df_train['X1'],
           *(len(future_dates) - len(df_train)) * [df_train['X1'].mean()]]
}, index=future_dates)

new_df

             X1
2020-05-01  1.0
2020-06-01  2.0
2020-07-01  3.0
2020-08-01  4.0
2020-09-01  2.5
2020-10-01  2.5
2020-11-01  2.5
2020-12-01  2.5
2021-01-01  2.5
2021-02-01  2.5
2021-03-01  2.5
2021-04-01  2.5

然后用DatetimeIndex.strftime恢复原来的格式:

new_df.index = new_df.index.strftime('%d-%m-%Y')
             X1
01-05-2020  1.0
01-06-2020  2.0
01-07-2020  3.0
01-08-2020  4.0
01-09-2020  2.5
01-10-2020  2.5
01-11-2020  2.5
01-12-2020  2.5
01-01-2021  2.5
01-02-2021  2.5
01-03-2021  2.5
01-04-2021  2.5

一起:

import pandas as pd

df_train = pd.DataFrame({
    'X1': {'01-01-2020': 1, '01-02-2020': 2, '01-03-2020': 3, '01-04-2020': 4}
})

df_train.index = pd.to_datetime(df_train.index, dayfirst=True)
future_dates = pd.date_range(
    df_train.index.max() + pd.tseries.offsets.MonthBegin(1),
    periods=12,
    freq='MS'
)
new_df = pd.DataFrame({'X1': df_train['X1'].mean()}, index=future_dates)
new_df.iloc[:df_train.shape[0], new_df.columns.get_loc('X1')] = \
    df_train['X1'].values
new_df.index = new_df.index.strftime('%d-%m-%Y')

print(new_df)

答案 1 :(得分:0)

  • set_index() 的现有行
  • 为新行创建数据框
  • concat() 他们
import io

df_train = pd.read_csv(io.StringIO("""             X1  
01-01-2020 | 1     
01-02-2020 | 2     
01-03-2020 | 3      
01-04-2020 | 4  """), sep="|")
df_train = df_train.set_index(pd.to_datetime(df_train.index,  format="%d-%m-%Y "))
df_train.columns = [c.strip() for c in df_train.columns]

future_dates = pd.date_range(df_train.index.max(), periods=12, freq='M')
pd.concat([
    df_train.set_index(future_dates[0:len(df_train)]),
    pd.DataFrame(index=future_dates[len(df_train):]).assign(X1=df_train["X1"].mean())
])

答案 2 :(得分:0)

这是另一种方式:

future_dates = pd.date_range(df.index.max(), periods=12, freq='M') + pd.tseries.offsets.MonthBegin()
df2 = pd.DataFrame(index = future_dates).assign(X1 = pd.Series(df['X1'].to_numpy(),index=future_dates[0:4])).fillna(df.mean())