熊猫:根据条件设置列的值到索引值

时间:2015-02-12 16:23:58

标签: python pandas

我希望这是一个简单的问题....我只想跟踪数据框中最后一次满足条件的情况。我的计划是首先添加一行,当满足条件时,该行将获取索引的值。然后,我计划使用fillna来填充额外的行,以便每行都有最后一次满足条件。但是,我似乎找不到任何方法可以根据条件将新列的值设置为索引的值,而不会获得不正确的数据或错误。下面是一个包含所需结果的示例,但我得到ValueError: array is not broadcastable to correct shape

rows = 50
df = pd.DataFrame(np.random.randn(rows,2), columns=list('AB'), index=pd.date_range('1/1/2000', periods=rows, freq='1H'))

df.loc[df.A > 0.5, 'LAST_TIME_A_ABOVE_X'] = df.index
# ValueError: array is not broadcastable to correct shape

df['LAST_TIME_A_ABOVE_X'] = df['LAST_TIME_A_ABOVE_X'].fillna(method='ffill')

期望的结果:

print df.tail()

                            A         B LAST_TIME_A_ABOVE_X
2000-01-02 19:00:00  0.952454  0.046514 2000-01-02 19:00:00
2000-01-02 20:00:00 -0.216546 -0.254344 2000-01-02 19:00:00
2000-01-02 21:00:00 -0.237128 -0.830337 2000-01-02 19:00:00
2000-01-02 22:00:00  0.889550  0.060698 2000-01-02 22:00:00
2000-01-02 23:00:00  0.172436 -0.566921 2000-01-02 22:00:00
2000-01-03 00:00:00  1.092696  1.053605 2000-01-03 00:00:00
2000-01-03 01:00:00  1.284858  0.117552 2000-01-03 01:00:00

由于

1 个答案:

答案 0 :(得分:0)

您还需要屏蔽rhs,因此将df.loc[df.A > 0.5, 'LAST_TIME_A_ABOVE_X'] = df.index更改为df.loc[df.A > 0.5, 'LAST_TIME_A_ABOVE_X'] = df.loc[df.A > 0.5].index

In [175]:

rows = 50
df = pd.DataFrame(np.random.randn(rows,2), columns=list('AB'), index=pd.date_range('1/1/2000', periods=rows, freq='1H'))

df.loc[df.A > 0.5, 'LAST_TIME_A_ABOVE_X'] = df.loc[df.A > 0.5].index
# ValueError: array is not broadcastable to correct shape

df['LAST_TIME_A_ABOVE_X'] = df['LAST_TIME_A_ABOVE_X'].fillna(method='ffill')
df
Out[175]:
                                                A  \
2000-01-01 00:00:00           1970-01-01 00:00:00   
2000-01-01 01:00:00 1970-01-01 00:00:00.000000001   
2000-01-01 02:00:00 1970-01-01 00:00:00.000000001   
2000-01-01 03:00:00 1969-12-31 23:59:59.999999999   
2000-01-01 04:00:00           1970-01-01 00:00:00   
2000-01-01 05:00:00 1970-01-01 00:00:00.000000001   
2000-01-01 06:00:00           1970-01-01 00:00:00   
2000-01-01 07:00:00           1970-01-01 00:00:00   
2000-01-01 08:00:00           1970-01-01 00:00:00   
2000-01-01 09:00:00 1969-12-31 23:59:59.999999999   
2000-01-01 10:00:00           1970-01-01 00:00:00   
2000-01-01 11:00:00           1970-01-01 00:00:00   
2000-01-01 12:00:00           1970-01-01 00:00:00   
2000-01-01 13:00:00 1969-12-31 23:59:59.999999999   
2000-01-01 14:00:00 1969-12-31 23:59:59.999999999   
2000-01-01 15:00:00           1970-01-01 00:00:00   

                                                B LAST_TIME_A_ABOVE_X  
2000-01-01 00:00:00           1970-01-01 00:00:00                 NaT  
2000-01-01 01:00:00           1970-01-01 00:00:00 2000-01-01 01:00:00  
2000-01-01 02:00:00           1970-01-01 00:00:00 2000-01-01 02:00:00  
2000-01-01 03:00:00           1970-01-01 00:00:00 2000-01-01 02:00:00  
2000-01-01 04:00:00           1970-01-01 00:00:00 2000-01-01 02:00:00  
2000-01-01 05:00:00           1970-01-01 00:00:00 2000-01-01 05:00:00  
2000-01-01 06:00:00           1970-01-01 00:00:00 2000-01-01 05:00:00  
2000-01-01 07:00:00 1969-12-31 23:59:59.999999999 2000-01-01 05:00:00  
2000-01-01 08:00:00           1970-01-01 00:00:00 2000-01-01 05:00:00  
2000-01-01 09:00:00           1970-01-01 00:00:00 2000-01-01 05:00:00  
2000-01-01 10:00:00           1970-01-01 00:00:00 2000-01-01 05:00:00  
2000-01-01 11:00:00           1970-01-01 00:00:00 2000-01-01 05:00:00  
2000-01-01 12:00:00 1970-01-01 00:00:00.000000001 2000-01-01 05:00:00  

上面产生了一个错误,这与它是一个datetimeindex这个事实有关,如果你重置索引并执行相同的掩码然后设置索引,那么分配值就会在整个df上广播,这是错误的回来你得到了所需的输出:

In [192]:

rows = 50
df = pd.DataFrame(np.random.randn(rows,2), columns=list('AB'), index=pd.date_range('1/1/2000', periods=rows, freq='1H'))
df.reset_index(inplace=True)
temp = df.loc[df.A > 0.5,'index']
df.loc[df.A > 0.5, 'LAST_TIME_A_ABOVE_X'] = temp
df['LAST_TIME_A_ABOVE_X'] = df['LAST_TIME_A_ABOVE_X'].fillna(method='ffill')
df.set_index('index', inplace=True)
df
Out[192]:
                            A         B LAST_TIME_A_ABOVE_X
index                                                      
2000-01-01 00:00:00 -1.015624  1.156609                 NaT
2000-01-01 01:00:00 -1.223371 -1.378067                 NaT
2000-01-01 02:00:00  1.012627 -0.324465 2000-01-01 02:00:00
2000-01-01 03:00:00  1.298507 -1.216586 2000-01-01 03:00:00
2000-01-01 04:00:00  0.985638  0.058768 2000-01-01 04:00:00
2000-01-01 05:00:00 -0.815905  0.586401 2000-01-01 04:00:00
2000-01-01 06:00:00 -1.185344  2.177858 2000-01-01 04:00:00
2000-01-01 07:00:00 -0.638001  0.046314 2000-01-01 04:00:00
2000-01-01 08:00:00 -0.134608  0.294528 2000-01-01 04:00:00
2000-01-01 09:00:00  0.425651  0.709888 2000-01-01 04:00:00
2000-01-01 10:00:00 -0.378901 -0.877367 2000-01-01 04:00:00
2000-01-01 11:00:00 -0.504592  0.322824 2000-01-01 04:00:00
2000-01-01 12:00:00  1.442753 -1.145960 2000-01-01 12:00:00
2000-01-01 13:00:00  0.437722 -0.445725 2000-01-01 12:00:00
2000-01-01 14:00:00  2.509730 -0.106108 2000-01-01 14:00:00
2000-01-01 15:00:00 -0.618179 -1.079270 2000-01-01 14:00:00
2000-01-01 16:00:00 -1.377722 -1.445645 2000-01-01 14:00:00
2000-01-01 17:00:00  0.529527 -2.500947 2000-01-01 17:00:00
2000-01-01 18:00:00 -0.263954 -0.576484 2000-01-01 17:00:00
2000-01-01 19:00:00 -0.177062  0.422974 2000-01-01 17:00:00
2000-01-01 20:00:00  0.173764  2.116644 2000-01-01 17:00:00
2000-01-01 21:00:00 -1.248605 -0.594601 2000-01-01 17:00:00
2000-01-01 22:00:00 -1.138183 -0.282523 2000-01-01 17:00:00
2000-01-01 23:00:00  0.047580  0.496086 2000-01-01 17:00:00
2000-01-02 00:00:00  1.618901 -1.910404 2000-01-02 00:00:00
2000-01-02 01:00:00  0.127997  0.783554 2000-01-02 00:00:00
2000-01-02 02:00:00  0.702277  1.720010 2000-01-02 02:00:00
2000-01-02 03:00:00 -0.801874 -2.302547 2000-01-02 02:00:00
2000-01-02 04:00:00  1.636838 -0.940251 2000-01-02 04:00:00
2000-01-02 05:00:00 -1.204564  0.517969 2000-01-02 04:00:00
2000-01-02 06:00:00 -0.700013  0.075867 2000-01-02 04:00:00
2000-01-02 07:00:00 -0.234283 -1.899428 2000-01-02 04:00:00
2000-01-02 08:00:00  0.730711  0.254155 2000-01-02 08:00:00
2000-01-02 09:00:00 -0.188994  2.035390 2000-01-02 08:00:00
2000-01-02 10:00:00  1.384640 -1.319800 2000-01-02 10:00:00
2000-01-02 11:00:00 -0.288324 -1.219386 2000-01-02 10:00:00
2000-01-02 12:00:00 -0.642150 -0.449078 2000-01-02 10:00:00
2000-01-02 13:00:00  1.615771  0.497375 2000-01-02 13:00:00
2000-01-02 14:00:00 -1.422133  1.934081 2000-01-02 13:00:00
2000-01-02 15:00:00 -1.541841  1.202525 2000-01-02 13:00:00
2000-01-02 16:00:00 -2.463243  0.020996 2000-01-02 13:00:00
2000-01-02 17:00:00 -0.445203  0.462241 2000-01-02 13:00:00
2000-01-02 18:00:00  0.376458 -1.190448 2000-01-02 13:00:00
2000-01-02 19:00:00  1.040431  0.006403 2000-01-02 19:00:00
2000-01-02 20:00:00 -0.145096 -0.961192 2000-01-02 19:00:00
2000-01-02 21:00:00 -0.127414  0.604989 2000-01-02 19:00:00
2000-01-02 22:00:00 -0.054637  0.070836 2000-01-02 19:00:00
2000-01-02 23:00:00 -0.581572  0.634429 2000-01-02 19:00:00
2000-01-03 00:00:00  0.021646  0.837573 2000-01-02 19:00:00
2000-01-03 01:00:00 -1.785810  2.178076 2000-01-02 19:00:00

修改pandas github网站上发布此结论后,结论是这是一个错误,为此,安全方法是执行以下操作:

df.loc[df.A > 0.5,'LAST_TIME_A_ABOVE_X'] = df.loc[df.A > 0.5].index.tolist()