pandas.filter是否有值(不是行或列标签)?

时间:2016-06-09 20:18:10

标签: python performance numpy pandas dataframe

现在我必须做以下事情:

ix=None
for ixi in [res[col].str.contains('string') for col in res.columns]:
    if ix is not None:
        ix = ix | ixi
    else:
        ix = ixi
res[ix]

这是笔记本:

https://gist.github.com/denfromufa/12379b62ef6eec9252f4c9a77e46e2b1

生成输入DF的代码:

import pandas as pd
from string import ascii_letters as ascl
import numpy as np

res = pd.DataFrame(np.array([''.join(_) for _ in 
                             zip(ascl[:9],ascl[9:18],ascl[18:27])]).reshape((3,3)),
                   columns='ca cb cc'.split(),
                   index='ra rb rc'.split())

输入DF:

     ca   cb   cc
ra  ajs  bkt  clu
rb  dmv  enw  fox
rc  gpy  hqz  irA

期望(已过滤)DF:

     ca   cb   cc
rb  dmv  enw  fox
rc  gpy  hqz  irA

1 个答案:

答案 0 :(得分:1)

您可以使用sum(axis=1)

In [59]: res[res.sum(axis=1).str.contains('e|A')]
Out[59]:
     ca   cb   cc
rb  dmv  enw  fox
rc  gpy  hqz  irA

apply().str.contains()any()

一起使用
In [51]: res[res.apply(lambda x: x.str.contains('e|A')).any(axis=1)]
Out[51]:
     ca   cb   cc
rb  dmv  enw  fox
rc  gpy  hqz  irA

针对300K行的时间DF:

In [95]: df = pd.concat([res] * 10**5)

In [96]: df.shape
Out[96]: (300000, 3)

In [97]: %timeit res[res.sum(axis=1).str.contains('e|A')]
1000 loops, best of 3: 664 µs per loop

In [98]: %timeit res[res.apply(lambda x: x.str.contains('e|A')).any(axis=1)]
1000 loops, best of 3: 1.86 ms per loop

<强>解释

In [57]: res.sum(axis=1)
Out[57]:
ra    ajsbktclu
rb    dmvenwfox
rc    gpyhqzirA
dtype: object

In [58]: res.sum(axis=1).str.contains('e|A')
Out[58]:
ra    False
rb     True
rc     True
dtype: bool

应用

In [53]: res.apply(lambda x: x.str.contains('e|A'))
Out[53]:
       ca     cb     cc
ra  False  False  False
rb  False   True  False
rc  False  False   True

In [54]: res.apply(lambda x: x.str.contains('e|A')).any(axis=1)
Out[54]:
ra    False
rb     True
rc     True
dtype: bool