Question

我有一个基于文本的字符串，并且只想保留特定的单词。

sample = "This is a test text. Test text should pass the test"
approved_list = ["test", "text"]

预期输出：

"test text Test text test"

我已经阅读了很多基于regex的答案，遗憾的是他们没有解决这个具体问题。

该解决方案是否也可以扩展到熊猫系列？

Answer 1

您不需要pandas。使用正则表达式模块re

import re

re.findall('|'.join(approved_list), sample, re.IGNORECASE)

['test', 'text', 'Test', 'text', 'test']

如果您有pd.Series

sample = pd.Series(["This is a test text. Test text should pass the test"] * 5)
approved_list = ["test", "text"]

使用str字符串访问者

sample.str.findall('|'.join(approved_list), re.IGNORECASE)

0    [test, text, Test, text, test]
1    [test, text, Test, text, test]
2    [test, text, Test, text, test]
3    [test, text, Test, text, test]
4    [test, text, Test, text, test]
dtype: object

python中不匹配的单词删除

1 个答案: