Question

我有一个数据框df，其中包含未清除的文本字符串

                             phrase
 0           the quick brown br fox
 1   jack and jill went up the hill

我还有一个单词和字母分组列表，我想remove称为remove，它看起来像：

['br', and]

在此示例中，我想要以下输出：

                         phrase
 0          the quick brown fox
 1   jack jill went up the hill

请注意，它的“棕色”中的br不会保留在df中，而是一个较大单词的一部分，而是会自动删除“ br”。

我尝试过：

df['phrase']=[re.sub(r"\b%remove\b", "", sent) for sent in df['phrase']]

但是无法使其正常工作。有人可以指导我正确的方向吗？

谢谢

Answer 1

对split使用嵌套列表推导，对in使用tes成员身份，然后将拆分后的值重新加入：

L = ['br', 'and']

df['phrase']=[' '.join(x for x in sent.split() if x not in L) for sent in df['phrase']]
print (df)
                       phrase
0         the quick brown fox
1  jack jill went up the hill

Answer 2

我觉得replace会降低

s=[r'\b'+x+r'\b' for x in L]

df.phrase.str.replace('|'.join(s),'')
Out[176]: 
0           the quick brown  fox
1    jack  jill went up the hill
Name: phrase, dtype: object

从填充有句子的数据框中删除字母分组和单词的列表

2 个答案: