Question

我需要一些帮助来对某些数据运行一些过滤器。我有一个由文本组成的数据集。我也有一个单词列表。我想过滤我的数据的每一行，以便行中的剩余文本只由列表对象中的单词组成

words

(cell, CDKs, lung, mutations monomeric, Casitas, Background, acquired, evidence, kinases, small, evidence, Oncogenic )


data

ID  Text

0   Cyclin-dependent kinases CDKs regulate a 

1   Abstract Background Non-small cell lung  

2   Abstract Background Non-small cell lung 

3   Recent evidence has demonstrated that acquired

4   Oncogenic mutations in the monomeric Casitas

所以在我的过滤器后，我希望数据框看起来像这样

data

ID  Text

0    kinases CDKs  

1   Background cell lung  

2   Background small cell lung 

3   evidence acquired

4   Oncogenic mutations monomeric Casitas

我尝试使用iloc和类似的功能，但我似乎没有得到它。对此有何帮助？

Answer 1

您可以简单地使用apply()以及简单的列表理解：

>>> df['Text'].apply(lambda x: ' '.join([i for i in x.split() if i in words]))
0                             kinases CDKs
1                     Background cell lung
2                     Background cell lung
3                        evidence acquired
4    Oncogenic mutations monomeric Casitas

另外，为了提高效果（O(1)平均查询时间），我创建了一个set字样，我建议您也这样做。

Answer 2

我不确定这是最优雅的解决方案，但你可以做到：

to_remove = ['foo', 'bar']
df = pd.DataFrame({'Text': [
    'spam foo& eggs', 
    'foo bar eggs bacon and lettuce', 
    'spam and foo eggs'
]})

df['Text'].str.replace('|'.join(to_remove), '')

数据框文本过滤文本

2 个答案: