Question

这是对问题here

的扩展

现在，如所链接的问题中一样，答案使用space?作为正则表达式模式来匹配带有空格或无空格的字符串。

问题陈述：

我有一个字符串和一组短语。

input_string = 'alice is a character from a fairy tale that lived in a wonder land. A character about whome no-one knows much about'

phrases_to_remove = ['wonderland', 'character', 'noone']

现在我要做的是从phrases_to_remove中删除数组input_string中单词的最后出现。

output_string = 'alice is a character from a fairy tale that lived in a. A about whome knows much about'

注意事项：要删除的单词可能会出现在字符串中，也可能不会出现，如果出现，它们可能以相同的形式出现（{wonderland”或“ character”，“ noone”}或它们之间可能出现空格或连字符（-），例如奇迹之地，没人，性格。

代码存在的问题是，我无法删除具有space或-不匹配的单词。例如wonder land和wonderland和wonder-land。

我尝试将(-)?|( )?作为正则表达式使用，但无法正常工作。

我需要帮助

Answer 1

由于您不知道分隔符在哪里，因此可以生成由ORed正则表达式构成的正则表达式（使用单词边界以避免匹配子单词）。

这些正则表达式将在每个字符上使用[\s\-]*来替换单词和str.join的字母（零匹配几次出现的“空格”或“破折号”）

import re

input_string = 'alice is a character from a fairy tale that lived in a wonder - land. A character about whome no one knows much about'

phrases_to_remove = ['wonderland', 'character', 'noone']

the_regex = "|".join(r"\b{}\b".format('[\s\-]*'.join(x)) for x in phrases_to_remove)

现在处理“替换除第一个匹配项以外的所有内容”部分：让我们定义一个对象，该对象将替换除第一个匹配项之外的所有内容（使用内部计数器）

class Replacer:
    def __init__(self):
        self.__counter = 0

    def replace(self,m):
        if self.__counter:
            return ""
        else:
            self.__counter += 1
            return m.group(0)

现在将replace方法传递给re.sub：

print(re.sub(the_regex,Replacer().replace,input_string))

结果：

alice is a character from a fairy tale that lived in a . A  about whome  knows much about

（生成的正则表达式非常复杂，顺便说一句：\bw[\s\-]*o[\s\-]*n[\s\-]*d[\s\-]*e[\s\-]*r[\s\-]*l[\s\-]*a[\s\-]*n[\s\-]*d\b|\bc[\s\-]*h[\s\-]*a[\s\-]*r[\s\-]*a[\s\-]*c[\s\-]*t[\s\-]*e[\s\-]*r\b|\bn[\s\-]*o[\s\-]*o[\s\-]*n[\s\-]*e\b）

Answer 2

您的正则表达式存在问题。使用(-)?|( )?作为分隔符并不能达到您的预期。

考虑当单词列表为a,b时会发生什么：

>>> regex = "(-)?|( )?".join(["a", "b"])
>>> regex
'a(-)?|( )?b'

您希望此正则表达式与ab或a b或a-b匹配，但显然它不这样做。而是匹配a，a-，b或<space>b！

>>> re.match(regex, 'a')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, 'a-')
<_sre.SRE_Match object at 0x7f68c9f3b718>
>>> re.match(regex, 'b')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, ' b')
<_sre.SRE_Match object at 0x7f68c9f3b718>

要解决此问题，您可以将分隔符放在自己的组中：([- ])?。

如果您还想匹配wonder - land之类的单词（即连字符前后都有空格），则应使用以下(\s*-?\s*)?。

Answer 3

您一次可以使用一个：

对于空间：

对于“-”：

^[ \t]+
@"[^0-9a-zA-Z]+

删除Python中带空格或“-”的单词

3 个答案: