在我的项目中,我需要检查整个数据框列中是否存在某些值。数据框示例:
df=pd.DataFrame([['abc', 'a'], ['def', 'x'], ['aef', 'f']])
df.columns=['a', 'b']
>>>df
a b
0 abc a
1 def x
2 aef f
此静态代码运行良好:
df['a'].str.contains('f').any()
True
我需要遍历行并检查“ b”列中的每个值是否包含在整个“ a”列中。我没有找到一种方法。这是我预期的工作方式,但返回错误:
df['c']=df.apply(lambda row:df['a'].str.contains(row['b']).any())
...
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 128, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: ('b', 'occurred at index a')
有什么主意吗?
更新: 如我所见,我的df不是一个很好的例子。这是更好的一种(包括预期结果):
a b c
0 abc a True
1 def b True
2 aef x False
答案 0 :(得分:1)
使用Series.str.extractall
和Series.isin
进行测试:
df=pd.DataFrame([['#123 - some text', '', False],
['#124 - some text', '123', True],
['#125 - some text', '', False],
['#126 - some text', '126', True],
['#127 - some text', '123', True],
['#128 - some text', '129', False]],columns=['Text', 'ID', 'Expected result'])
s = df['Text'].str.extractall("(" + '|'.join(set(df['ID'])) + ")")[0].dropna()
df['new'] = df['ID'].isin(s)
print (df)
Text ID Expected result new
0 #123 - some text False False
1 #124 - some text 123 True True
2 #125 - some text False False
3 #126 - some text 126 True True
4 #127 - some text 123 True True
5 #128 - some text 129 False False
详细信息:
首先使用|
为正则表达式OR
设置集合来创建所有唯一值的模式:
print ("(" + '|'.join(set(df['ID'])) + ")")
(|123|129|126)
然后从Text
中提取所有匹配的值,通过Series.dropna
删除丢失的值,并通过isin
删除最后的测试成员资格:
print (df['Text'].str.extractall("(" + '|'.join(set(df['ID'])) + ")")[0].dropna())
match
0 2 123
3 2 126
Name: 0, dtype: object