我想根据Pandas DataFrame中的索引条件分配值。
class test():
def __init__(self):
self.l = 1396633637830123000
self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = arange(self.l,self.l+10))
self.dfb = pd.DataFrame([[self.l+1,self.l+3], [self.l+6,self.l+9]], columns = ['beg', 'end'])
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = np.nan
for i, beg, end in zip(self.dfb.index, self.dfb['beg'], self.dfb['end']):
self.dfa.ix[beg:end]['true'] = True
self.dfa.ix[beg:end]['idx'] = i
def do(self):
self.update()
print self.dfa
t = test()
t.do()
结果:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True NaN
1396633637830123002 4 5 True NaN
1396633637830123003 6 7 True NaN
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True NaN
1396633637830123007 14 15 True NaN
1396633637830123008 16 17 True NaN
1396633637830123009 18 19 True NaN
true
列已正确分配,而idx
列未正确分配。此外,这似乎取决于列的初始化方式,因为如果我这样做:
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = False
true
列也未正确分配。
我做错了什么?
P.S。预期的结果是:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True 0
1396633637830123002 4 5 True 0
1396633637830123003 6 7 True 0
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True 1
1396633637830123007 14 15 True 1
1396633637830123008 16 17 True 1
1396633637830123009 18 19 True 1
编辑:我尝试使用loc和iloc分配,但它似乎不起作用: LOC:
self.dfa.loc[beg:end]['true'] = True
self.dfa.loc[beg:end]['idx'] = i
ILOC:
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['true'] = True
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['idx'] = i
答案 0 :(得分:1)
您是链式索引,请参阅here。该警告不会保证发生。
你应该这样做。实际上不需要实际跟踪b中的索引,顺便说一句。
In [44]: dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = np.arange(l,l+10))
In [45]: dfb = pd.DataFrame([[l+1,l+3], [l+6,l+9]], columns = ['beg', 'end'])
In [46]: dfa['in_b'] = False
In [47]: for i, s in dfb.iterrows():
....: dfa.loc[s['beg']:s['end'],'in_b'] = True
....:
或者如果您有非整数dtypes
In [36]: for i, s in dfb.iterrows():
dfa.loc[(dfa.index>=s['beg']) & (dfa.index<=s['end']),'in_b'] = True
In [48]: dfa
Out[48]:
A B in_b
1396633637830123000 0 1 False
1396633637830123001 2 3 True
1396633637830123002 4 5 True
1396633637830123003 6 7 True
1396633637830123004 8 9 False
1396633637830123005 10 11 False
1396633637830123006 12 13 True
1396633637830123007 14 15 True
1396633637830123008 16 17 True
1396633637830123009 18 19 True
[10 rows x 3 columns
如果b为巨大,这可能不是那么高效。
另外,这些看起来像纳秒次。通过转换它们可以更友好。
In [49]: pd.to_datetime(dfa.index)
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-04 17:47:17.830123, ..., 2014-04-04 17:47:17.830123009]
Length: 10, Freq: None, Timezone: None