最终解决方案

Question

我对正则表达式感到不舒服，所以我需要你帮助我，这对我来说似乎很棘手。

我们说我有以下字符串：

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

获取title:hello，title:world的正则表达式是什么，从原始字符串中删除这些字符串并在其中留下"title:quoted"，因为它被双引号括起来？

我已经看过this similar SO answer了，这就是我最终的结果：

import re

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

def replace(m):
    if m.group(1) is None:
        return m.group()

    return m.group().replace(m.group(1), "")

regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'

当然，它不起作用，我并不感到惊讶，因为正则表达式对我来说是深奥的。

感谢您的帮助！

最终解决方案

感谢您的回答，这是最终解决方案，满足我的需求：

import re
matches = []

def replace(m):
    matches.append(m.group())
    return ""

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)

# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"

Answer 1

您可以检查字边界（\b）：

>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
>>> re.sub(r'\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2   "title:quoted" keyword3'

或者，您也可以使用negative look behind and ahead assertions检查title:\w+周围没有引号：

>>> re.sub(r'(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2   "title:quoted" keyword3'

Answer 2

这种情况与"regex-match a pattern unless..."

非常相似

我们可以通过一个非常简单的正则表达式解决它：

"[^"]*"|(\btitle:\S+)

交替|的左侧与完整的"double quoted strings"标记相匹配。我们将忽略这些匹配。右侧匹配并将title:hello字符串捕获到第1组，我们知道它们是正确的，因为它们与左侧的表达式不匹配。

此程序显示了如何使用正则表达式（请参阅online demo底部的结果）：

import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r'"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
    if m.group(1):
        return ""
    else:
        return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)

参考

How to match (or replace) a pattern except in situations s1, s2, s3...

Answer 3

 re.sub('[^"]title:\w+',"",string)
keyword1 keyword2 "title:quoted" keyword3

替换以title:开头的任何子字符串，后跟任何字母 - ＆gt; w+

Answer 4

有点暴力，但在所有情况下都有效，没有灾难性的回溯：

import re

string = r'''keyword1 keyword2 title:hello title:world "title:quoted"title:foo
       "abcd \" title:bar"title:foobar keyword3 keywordtitle:keyword
       "non balanced quote title:foobar'''

pattern = re.compile(
    r'''(?:
            (      # other content
                (?:(?=(
                    " (?:(?=([^\\"]+|\\.))\3)* (?:"|$) # quoted content
                  |
                    [^t"]+             # all that is not a "t" or a quote
                  |
                    \Bt                # "t" preceded by word characters
                  |
                    t (?!itle:[a-z]+)  # "t" not followed by "itle:" + letters 
                )  )\2)+
            )
          |     # OR
            (?<!") # not preceded by a double quote
        )
        (?:\btitle:[a-z]+)?''',
    re.VERBOSE)

print re.sub(pattern, r'\1', string)

Python正则表达式匹配模式不包含双引号

最终解决方案

4 个答案: