Question

假设我有以下数据框：

df = pd.DataFrame({'a':[0,0,0,1,0,0], 'b':[0,0,1,0,0,0], 'c':[0,1,1,0,0,0]})
df.index = pd.date_range('2000-03-02', periods=6, freq='D')

看起来像这样：

            a  b  c
2000-03-02  0  0  0
2000-03-03  0  0  1
2000-03-04  0  1  1
2000-03-05  1  0  0
2000-03-06  0  0  0
2000-03-07  0  0  0

现在我想在最后一个之后发生的给定列中将每个值设置为2.所需的结果如下所示：

a b c 2000-03-02 0 0 0 2000-03-03 0 0 1 2000-03-04 0 1 1 2000-03-05 2 2 2 2000-03-06 2 2 2 2000-03-07 2 2 2

我有这个代码，有效：

cols = df.columns for col in cols: s = df[col] x = s[s==1].index[-1] df[col][(x + 1):] = 2

但它看起来很尴尬，与大熊猫的精神相反（非潘多尼克？）。有关更好方法的任何建议吗？

Answer 1

一种方法是replace使用NaNs的低零：

In [11]: df.replace(0, np.nan).bfill()  # maybe neater way to do this?
Out[11]:
             a   b   c
2000-03-02   1   1   1
2000-03-03   1   1   1
2000-03-04   1   1   1
2000-03-05   1 NaN NaN
2000-03-06 NaN NaN NaN
2000-03-07 NaN NaN NaN

现在您可以使用where将这些更改为2：

In [12]: df.where(df.replace(0, np.nan).bfill(), 2)
Out[12]:
            a  b  c
2000-03-02  0  0  0
2000-03-03  0  0  1
2000-03-04  0  1  1
2000-03-05  1  2  2
2000-03-06  2  2  2
2000-03-07  2  2  2

编辑：在这里使用cumsum的技巧可能会更快：

In [21]: %timeit df.where(df.replace(0, np.nan).bfill(), 2)
100 loops, best of 3: 2.34 ms per loop

In [22]: %timeit df.where(df[::-1].cumsum()[::-1], 2)
1000 loops, best of 3: 1.7 ms per loop

In [23]: %timeit pd.DataFrame(np.where(np.cumsum(df.values[::-1], 0)[::-1], df.values, 2), df.index)
10000 loops, best of 3: 186 µs per loop

Answer 2

这是一个非常普通的解决方案。（例如，如果索引不连续，您将失败）。第一部分，让索引器非常简洁！

In [64]: indexer = Series(df.index.get_indexer(df.diff().idxmin().values),index=df.columns)

In [65]: indexer
Out[65]: 
a    4
b    3
c    3
dtype: int64

我认为它们是一种矢量化的方式，所有你需要做的就是根据上面的索引器构造正确的布尔矩阵，但是让我的大脑受伤。

In [66]: def f(x):
    x.iloc[indexer[x.name]:] = 2
    return x
   ....: 

In [67]: df.apply(f)
Out[67]: 
            a  b  c
2000-03-02  0  0  0
2000-03-03  0  0  1
2000-03-04  0  1  1
2000-03-05  1  2  2
2000-03-06  2  2  2
2000-03-07  2  2  2

[6 rows x 3 columns]

如何根据设置条件转发在pandas数据帧中填充非空值

2 个答案: