有没有办法在任何列中查询包含某个字符串的行的DataFrame?像Series.str
这样的东西除了DataFrame?这是我到目前为止所做的:
In [2]: s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est"
In [3]: df = pd.DataFrame(np.array(s.split(' ')).reshape((-1, 4)), columns=['one', 'two', 'three', 'four'])
In [4]: df
Out[4]:
one two three four
0 Lorem ipsum dolor sit
1 amet, consectetur adipisicing elit,
2 sed do eiusmod tempor
3 incididunt ut labore et
4 dolore magna aliqua. Ut
5 enim ad minim veniam,
6 quis nostrud exercitation ullamco
7 laboris nisi ut aliquip
8 ex ea commodo consequat.
9 Duis aute irure dolor
10 in reprehenderit in voluptate
11 velit esse cillum dolore
12 eu fugiat nulla pariatur.
13 Excepteur sint occaecat cupidatat
14 non proident, sunt in
15 culpa qui officia deserunt
16 mollit anim id est
[17 rows x 4 columns]
In [5]: mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')
In [6]: df[mask]
Out[6]:
one two three four
0 Lorem ipsum dolor sit
4 dolore magna aliqua. Ut
9 Duis aute irure dolor
11 velit esse cillum dolore
[4 rows x 4 columns]
理想情况下,我想用类似的东西替换最后两行:
df[df.ix[:, 'one':'four'].str.contains('dolor')]
这可能吗?
答案 0 :(得分:2)
您可以使用pd.np.char.array()
:
a = pd.np.char.array(df.values)
mask = a.find('dolor')!=-1
df2 = df.iloc[np.any(mask, axis=1)]
,df2
的内容将是:
one two three four
0 Lorem ipsum dolor sit
4 dolore magna aliqua. Ut
9 Duis aute irure dolor
11 velit esse cillum dolore
答案 1 :(得分:1)
Pandas没有DataFrame.str方法(至少目前还没有)。 但是,您可以使用
import numpy as np
mask = np.logical_or.reduce(
[df[col].str.contains('dolor')
for col in df.loc[:, 'one':'four'].columns])
这比写作少一点,比
快一点mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')
In [29]: %timeit mask = np.logical_or.reduce([df[col].str.contains('dolor') for col in df.loc[:, 'one':'four'].columns]); df[mask]
1000 loops, best of 3: 761 µs per loop
In [30]: %timeit mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor'); df[mask]
1000 loops, best of 3: 1.13 ms per loop
答案 2 :(得分:0)
如果theres' dolor'这将为您提供信息。在任何一栏中:
df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1)
将为任何列的每一行提供true / false值
如果您将此项与另一项申请相结合,您将获得总列数的信息
df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1).apply(lambda x: True in x.values, axis=1)
并使用它作为列掩码将给出结果:
df[df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1).apply(lambda x: True in x.values, axis=1)]
然而,这大约慢了3-4倍:(那是unutbu解决方案。