pandas数据帧由布尔值,索引和整数组成

时间:2016-02-12 14:43:58

标签: python pandas indexing boolean-expression

我遇到类似的问题(dataframe by index and by integer)

我想要的是通过布尔索引(简单)获取DataFrame的一部分并向后看几个值,比如前一个索引,可能还有一些。不幸的是,与get_loc相关联的问题中的建议答案会导致我的代码片段阻塞(在以下代码段中输入错误),然后才能获得实际的整数位置。

采用与其他问题的答案相同的例子,这是我尝试的:

df = pd.DataFrame(index=pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2015,2,1)), data={'a':np.arange(32)})
df.index.get_loc(df.index[df['a'] == 1])
*** TypeError: Cannot convert input to TimeStamp

上一个答案使用get_loc的字符串,我只想传递一个普通的索引值(这里是一个DateTime)

1 个答案:

答案 0 :(得分:2)

Using a slice:

import numpy as np
import pandas as pd
import datetime as DT
index = pd.date_range(start=DT.datetime(2015,1,1), end=DT.datetime(2015,2,1))
df = pd.DataFrame({'a':np.arange(len(index))}, index=index)

mask = df['a'] == 1
idx = np.flatnonzero(mask)[0]
lookback = 3
print(df.iloc[max(idx-lookback, 0):idx+1])

yields

             a
2015-01-08   7
2015-01-09   8
2015-01-10   9
2015-01-11  10

Note that if idx-lookback is negative, then the index refers to elements near the tail of df, just like with Python lists:

In [163]: df.iloc[-3:2]
Out[163]: 
Empty DataFrame
Columns: [a]
Index: []

In [164]: df.iloc[0:2]
Out[164]: 
            a
2015-01-01  0
2015-01-02  1

Thus, to grab elements relative to the head of df, use max(idx-lookback, 0).


Using a boolean mask:

As you know, if you have a boolean array or boolean Series such as

mask = df['a'] == 10

you can select the corresponding rows with

df.loc[mask]

If you wish to select previous or succeeding rows shifted by a fixed amount, you could use mask.shift to shift the mask:

df.loc[mask.shift(-lookback).fillna(False)]

If you wish to select lookback preceeding rows, then you could expand the mask by unioning it with its shifts:

lookback = 3
for i in range(1, lookback):
    mask |= mask.shift(-i)

or, equivalently, use cumsum:

mask = (mask.shift(-lookback) - mask.shift(1)).cumsum().fillna(False).astype(bool)

The for-loop is clearer, but the cumsum expression is faster, particularly if lookback is large.


For example,

import numpy as np
import pandas as pd
import datetime as DT
df = pd.DataFrame(
    index=pd.date_range(start=DT.datetime(2015,1,1), end=DT.datetime(2015,2,1)), 
    data={'a':np.arange(32)})

mask = df['a'] == 10
lookback = 3
for i in range(1, lookback):
    mask |= mask.shift(-i)

# alternatively,
# mask = (mask.shift(-lookback) - mask.shift(1)).cumsum().fillna(False).astype(bool)

print(df.loc[mask])

yields

             a
2015-01-08   7
2015-01-09   8
2015-01-10   9
2015-01-11  10