Question

我有一个字符串列表，我需要从中删除与另一个列表中的子字符串匹配的所有元素。我试图用列表，嵌套循环和正则表达式来做这个。

以下代码段的输出产生[“我们没有”，“不需要”，“教育”]而不是所需的[“教育”]。我是Python的新手，这是我第一个使用正则表达式的实验，而且我一直坚持使用sytax。

import re

testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
dellist = []

for x in range(len(testfile)):
    for y in range(len(stopwords)):
        if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):
            dellist.append(testfile[x])

for x in range(len(dellist)):
    if dellist[x] in testfile:
        del testfile[testfile.index(dellist[x])]

print testfile

该行

if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):

对于循环中的所有迭代，

返回“None”，所以我猜这是我的问题所在......

Answer 1

这是因为re.match测试了字符串 start 的匹配。

请尝试使用re.search。此外，您错过了第二个r上的'\b'：

if re.search(r'\b' + stopwords[y] + r'\b', testfile[x], re.I):

另外，你可以使用列表理解来构建dellist（你可以使用列表理解来完全构建新的testfile，但它现在逃脱了我：

dellist = [w for w in testfile for test in stopwords if re.search(test,w,re.I)]

另一个想法 - 既然您正在使用re模块，为什么不将stopwords合并到\b(We|no)\b，然后您可以只测试testfile 一个正则表达式？

regex = r'\b(' + '|'.join(stopwords) + r')\b'  # r'\b(We|no)\b'

现在你只需要查找不匹配正则表达式的单词：

newtestfile = [w for w in testfile if re.search(regex,w,re.I) is None]
# newtestfile is ['education']

Answer 2

为什么不使用基本的in运算符？应该比正则表达式快得多。

for line in testfile:
    for word in stopwords:
        if word in line:
            do stuff

或者，一个漂亮的list comprehension;）

[line for line in testfile if not [word for word in stopwords if word in line]]

Answer 3

使用in而不是正则表达式更漂亮，但如果停用词包含在另一个单词中，则上述示例会中断。此示例仅匹配完整的单词：

testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
output = []

for sentence in testfile:
    bad = false

    for word in sentence.split(' '):
        if word in stopwords:
            bad = true
            break

    if not bad:
        output.append(sentence)

如何在Python中使用正则表达式匹配列表引用？

3 个答案: