我有一个pandas数据帧,其条目都是字符串:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
等。我想选择包含某个字符串的所有行,比如'banana'。我不知道每次都会出现哪一列。当然,我可以写一个for循环并迭代所有行。但有更简单或更快的方法吗?
答案 0 :(得分:4)
使用NumPy,它可以被矢量化以搜索任意数量的字符串,就像这样 -
def select_rows(df,search_strings):
unq,IDs = np.unique(df,return_inverse=True)
unqIDs = np.searchsorted(unq,search_strings)
return df[((IDs.reshape(df.shape) == unqIDs[:,None,None]).any(-1)).all(0)]
示例运行 -
In [393]: df
Out[393]:
A B C
0 apple banana pear
1 pear pear apple
2 banana pear pear
3 apple apple pear
In [394]: select_rows(df,['apple','banana'])
Out[394]:
A B C
0 apple banana pear
In [395]: select_rows(df,['apple','pear'])
Out[395]:
A B C
0 apple banana pear
1 pear pear apple
3 apple apple pear
In [396]: select_rows(df,['apple','banana','pear'])
Out[396]:
A B C
0 apple banana pear
答案 1 :(得分:4)
对于单个搜索值
df[df.values == "banana"]
或
df[df.isin(['banana'])]
对于多个搜索字词:
df[(df.values == "banana")|(df.values == "apple" ) ]
或
df[df.isin(['banana', "apple"])]
# A B C
# 1 apple banana NaN
# 2 NaN NaN apple
# 3 banana NaN NaN
# 4 apple apple NaN
来自Divakar:返回两行。
select_rows(df,['apple','banana'])
# A B C
# 0 apple banana pear
答案 2 :(得分:3)
您可以通过将整个df与字符串进行比较来创建布尔掩码,并调用dropna
传递参数how='all'
来删除字符串未出现在所有列中的行:
In [59]:
df[df == 'banana'].dropna(how='all')
Out[59]:
A B C
1 NaN banana NaN
3 banana NaN NaN
要测试多个值,您可以使用多个蒙版:
In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana
Out[90]:
A B C
1 NaN banana NaN
3 banana NaN NaN
In [91]:
apple = df[(df=='apple')].dropna(how='all')
apple
Out[91]:
A B C
1 apple NaN NaN
2 NaN NaN apple
4 apple apple NaN
您可以使用index.intersection
仅索引公共索引值:
In [93]:
df.loc[apple.index.intersection(banana.index)]
Out[93]:
A B C
1 apple banana pear
答案 3 :(得分:0)
如果您希望df
的所有行包含values
中的任何个值,请使用:
df[df.isin(values).any(1)]
示例:
In [2]: df
Out[2]:
0 1 2
0 7 4 9
1 8 2 7
2 1 9 7
3 3 8 5
4 5 1 1
In [3]: df[df.isin({1, 9, 123}).any(1)]
Out[3]:
0 1 2
0 7 4 9
2 1 9 7
4 5 1 1