将值分配给Pandas数据帧中的行子集

时间:2014-04-04 18:00:13

标签: python pandas

我想根据Pandas DataFrame中的索引条件分配值。

class test():
    def __init__(self):
        self.l = 1396633637830123000
        self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = arange(self.l,self.l+10))
        self.dfb = pd.DataFrame([[self.l+1,self.l+3], [self.l+6,self.l+9]], columns = ['beg', 'end'])

    def update(self):
        self.dfa['true'] = False
        self.dfa['idx'] = np.nan
        for i, beg, end in zip(self.dfb.index, self.dfb['beg'], self.dfb['end']):
            self.dfa.ix[beg:end]['true'] = True
            self.dfa.ix[beg:end]['idx'] = i

    def do(self):
        self.update()
        print self.dfa

t = test()
t.do()

结果:

                      A   B   true  idx
1396633637830123000   0   1  False  NaN
1396633637830123001   2   3   True  NaN
1396633637830123002   4   5   True  NaN
1396633637830123003   6   7   True  NaN
1396633637830123004   8   9  False  NaN
1396633637830123005  10  11  False  NaN
1396633637830123006  12  13   True  NaN
1396633637830123007  14  15   True  NaN
1396633637830123008  16  17   True  NaN
1396633637830123009  18  19   True  NaN

true列已正确分配,而idx列未正确分配。此外,这似乎取决于列的初始化方式,因为如果我这样做:

    def update(self):
        self.dfa['true'] = False
        self.dfa['idx'] = False

true列也未正确分配。

我做错了什么?

P.S。预期的结果是:

                      A   B   true  idx
1396633637830123000   0   1  False  NaN
1396633637830123001   2   3   True  0
1396633637830123002   4   5   True  0
1396633637830123003   6   7   True  0
1396633637830123004   8   9  False  NaN
1396633637830123005  10  11  False  NaN
1396633637830123006  12  13   True  1
1396633637830123007  14  15   True  1
1396633637830123008  16  17   True  1
1396633637830123009  18  19   True  1

编辑:我尝试使用loc和iloc分配,但它似乎不起作用: LOC:

self.dfa.loc[beg:end]['true'] = True
self.dfa.loc[beg:end]['idx'] = i

ILOC:

self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['true'] = True
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['idx'] = i

1 个答案:

答案 0 :(得分:1)

您是链式索引,请参阅here。该警告不会保证发生。

你应该这样做。实际上不需要实际跟踪b中的索引,顺便说一句。

In [44]: dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = np.arange(l,l+10))

In [45]: dfb = pd.DataFrame([[l+1,l+3], [l+6,l+9]], columns = ['beg', 'end'])

In [46]: dfa['in_b'] = False

In [47]: for i, s in dfb.iterrows():
   ....:     dfa.loc[s['beg']:s['end'],'in_b'] = True
   ....:     

或者如果您有非整数dtypes

In [36]: for i, s in dfb.iterrows():
             dfa.loc[(dfa.index>=s['beg']) & (dfa.index<=s['end']),'in_b'] = True


In [48]: dfa
Out[48]: 
                      A   B  in_b
1396633637830123000   0   1  False
1396633637830123001   2   3  True
1396633637830123002   4   5  True
1396633637830123003   6   7  True
1396633637830123004   8   9  False
1396633637830123005  10  11  False
1396633637830123006  12  13  True
1396633637830123007  14  15  True
1396633637830123008  16  17  True
1396633637830123009  18  19  True

[10 rows x 3 columns

如果b为巨大,这可能不是那么高效。

另外,这些看起来像纳秒次。通过转换它们可以更友好。

In [49]: pd.to_datetime(dfa.index)
Out[49]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-04 17:47:17.830123, ..., 2014-04-04 17:47:17.830123009]
Length: 10, Freq: None, Timezone: None