如何在不迭代的情况下向多索引熊猫数据帧添加新列

时间:2021-01-10 16:05:24

标签: pandas multi-index

我正在通过 yfinance 库将一些股票数据导入数据框。 数据框如下所示:

In [43]:stockinfo['AAPL']
Out[43]: 
                  Open        High  ...   Adj Close       Volume
Date                                ...                         
2011-01-10         NaN         NaN  ...         NaN          NaN
2011-01-11   12.317142   12.320000  ...   10.525573  444108000.0
2011-01-12   12.258928   12.301071  ...   10.611223  302590400.0
2011-01-13   12.327143   12.380000  ...   10.650039  296780400.0
2011-01-14   12.353214   12.445714  ...   10.736307  308840000.0
               ...         ...  ...         ...          ...
2021-01-04  133.520004  133.610001  ...  129.410004  143301900.0
2021-01-05  128.889999  131.740005  ...  131.009995   97664900.0
2021-01-06  127.720001  131.050003  ...  126.599998  155088000.0
2021-01-07  128.360001  131.630005  ...  130.919998  109578200.0
2021-01-08  132.429993  132.630005  ...  132.050003  104965400.0

[2585 rows x 6 columns]

数据框列的结构如下:

IN [44]:stockinfo.columns
Out[44]: 
MultiIndex([('AAPL',      'Open'),
            ('AAPL',      'High'),
            ('AAPL',       'Low'),
            ('AAPL',     'Close'),
            ('AAPL', 'Adj Close'),
            ('AAPL',    'Volume'),
            (    'GOOG',      'Open'),
            (    'GOOG',      'High'),
            (    'GOOG',       'Low'),
            (    'GOOG',     'Close'),
            ...
            (     'TSLA',      'Open'),
            (     'TSLA',      'High'),
            (     'TSLA',       'Low'),
            (     'TSLA',     'Close'),
            (     'TSLA', 'Adj Close'),
            (     'TSLA',    'Volume')],
           length=24822)

我想添加一个额外的字段。这可以通过以下命令完成:

update['AAPL', 'NEWFIELD'] =  update['AAPL','Close'].rolling(window=50, min_periods=1).mean()

但是,我需要为所有条目(AAPL、GOOG、TSLA 等)执行此操作。遍历它非常慢,所以我需要一种方法来用 1 行完成更新。 对于普通的数据帧,我可以这样做:update['NEWFIELD'] = value,但是由于这个涉及多维标头,我不知道语法,而且我还没有在 stackoverflow 上看到这个问题的答案...... 有什么想法吗?

这是迭代解决方案:

for a in range(len(stockinfo.columns)):
stockinfo[stockinfo.columns[a][0], '50SMA'] =  stockinfo[stockinfo.columns[a][0],'Close'].rolling(window=50, min_periods=1).mean()

...但这样会减慢...

约翰。

2 个答案:

答案 0 :(得分:0)

如果这是您的数据框:

                 TSLA                                                                  AAPL
                  Open        High         Low       Close   Adj Close    Volume        Open        High         Low       Close   Adj Close     Volume
2020-12-30  672.000000  696.599976  668.359985  694.780029  694.780029  42846000  135.580002  135.990005  133.399994  133.720001  133.720001   96452100
2020-12-31  699.989990  718.719971  691.119995  705.669983  705.669983  49570900  134.080002  134.740005  131.720001  132.690002  132.690002   98990400
2021-01-04  719.460022  744.489990  717.190002  729.770020  729.770020  48638200  133.520004  133.610001  126.760002  129.410004  129.410004  143301900
2021-01-05  723.659973  740.840027  719.200012  735.109985  735.109985  32245200  128.889999  131.740005  128.429993  131.009995  131.009995   97664900
2021-01-06  758.489990  774.000000  749.099976  755.979980  755.979980  44700000  127.720001  131.050003  126.379997  126.599998  126.599998  155088000
2021-01-07  777.630005  816.989990  775.200012  816.039978  816.039978  51498900  128.360001  131.630005  127.860001  130.919998  130.919998  109578200
2021-01-08  856.000000  884.489990  838.390015  880.020020  880.020020  74661000  132.429993  132.630005  130.229996  132.050003  132.050003  104965400

这应该可以完成工作

>>> df_new = df.unstack().to_frame().unstack(1).reset_index(col_level=1).droplevel(0, axis=1)
>>> df_new['CloseMean']=df_new.groupby(['level_0'])['level_1','Close'].rolling(window=50, min_periods=1).mean().reset_index()['Close']
>>> df_new.set_index(['level_0','level_1']).unstack(0).swaplevel(1,0, axis=1).stack().unstack().rename_axis(None)



level_0           AAPL                                                                                 TSLA
             Adj Close       Close   CloseMean        High         Low        Open       Volume   Adj Close       Close   CloseMean        High         Low        Open      Volume
2020-12-30  133.720001  133.720001  133.720001  135.990005  133.399994  135.580002   96452100.0  694.780029  694.780029  694.780029  696.599976  668.359985  672.000000  42846000.0
2020-12-31  132.690002  132.690002  133.205002  134.740005  131.720001  134.080002   98990400.0  705.669983  705.669983  700.225006  718.719971  691.119995  699.989990  49570900.0
2021-01-04  129.410004  129.410004  131.940002  133.610001  126.760002  133.520004  143301900.0  729.770020  729.770020  710.073344  744.489990  717.190002  719.460022  48638200.0
2021-01-05  131.009995  131.009995  131.707500  131.740005  128.429993  128.889999   97664900.0  735.109985  735.109985  716.332504  740.840027  719.200012  723.659973  32245200.0
2021-01-06  126.599998  126.599998  130.686000  131.050003  126.379997  127.720001  155088000.0  755.979980  755.979980  724.262000  774.000000  749.099976  758.489990  44700000.0
2021-01-07  130.919998  130.919998  130.725000  131.630005  127.860001  128.360001  109578200.0  816.039978  816.039978  739.558329  816.989990  775.200012  777.630005  51498900.0
2021-01-08  132.050003  132.050003  130.914286  132.630005  130.229996  132.429993  104965400.0  880.020020  880.020020  759.624285  884.489990  838.390015  856.000000  74661000.0

答案 1 :(得分:0)

这是我的输出:

df
Out[97]: 
                  TSLA              ...        AAPL           
                  Open        High  ...   Adj Close     Volume
Date                                ...                       
2020-12-30  672.000000  696.599976  ...  133.720001   96452100
2020-12-31  699.989990  718.719971  ...  132.690002   98990400
2021-01-04  719.460022  744.489990  ...  129.410004  143301900
2021-01-05  723.659973  740.840027  ...  131.009995   97664900
2021-01-06  758.489990  774.000000  ...  126.599998  155088000
2021-01-07  777.630005  816.989990  ...  130.919998  109578200
2021-01-08  856.000000  884.489990  ...  132.050003  104965400

[7 rows x 12 columns]

df_new = df.unstack().to_frame().unstack(1).reset_index(col_level=1).droplevel(0, axis=1)

df_new
Out[99]: 
   level_0       Date        Open  ...       Close   Adj Close       Volume
0     TSLA 2020-12-30  672.000000  ...  694.780029  694.780029   42846000.0
1     TSLA 2020-12-31  699.989990  ...  705.669983  705.669983   49570900.0
2     TSLA 2021-01-04  719.460022  ...  729.770020  729.770020   48638200.0
3     TSLA 2021-01-05  723.659973  ...  735.109985  735.109985   32245200.0
4     TSLA 2021-01-06  758.489990  ...  755.979980  755.979980   44700000.0
5     TSLA 2021-01-07  777.630005  ...  816.039978  816.039978   51498900.0
6     TSLA 2021-01-08  856.000000  ...  880.020020  880.020020   74661000.0
7     AAPL 2020-12-30  135.580002  ...  133.720001  133.720001   96452100.0
8     AAPL 2020-12-31  134.080002  ...  132.690002  132.690002   98990400.0
9     AAPL 2021-01-04  133.520004  ...  129.410004  129.410004  143301900.0
10    AAPL 2021-01-05  128.889999  ...  131.009995  131.009995   97664900.0
11    AAPL 2021-01-06  127.720001  ...  126.599998  126.599998  155088000.0
12    AAPL 2021-01-07  128.360001  ...  130.919998  130.919998  109578200.0
13    AAPL 2021-01-08  132.429993  ...  132.050003  132.050003  104965400.0

[14 rows x 8 columns]

但是第二个命令失败了,因为没有 level_1:

KeyError: 'level_1'

你能在第一个命令后做一个 df_new 吗?

相关问题