Python正则表达式匹配模式不包含双引号

时间:2014-06-08 21:35:03

标签: python regex

我对正则表达式感到不舒服,所以我需要你帮助我,这对我来说似乎很棘手。

我们说我有以下字符串:

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

获取title:hellotitle:world的正则表达式是什么,从原始字符串中删除这些字符串并在其中留下"title:quoted",因为它被双引号括起来?

我已经看过this similar SO answer了,这就是我最终的结果:

import re

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

def replace(m):
    if m.group(1) is None:
        return m.group()

    return m.group().replace(m.group(1), "")

regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'

当然,它不起作用,我并不感到惊讶,因为正则表达式对我来说是深奥的。

感谢您的帮助!

最终解决方案

感谢您的回答,这是最终解决方案,满足我的需求:

import re
matches = []

def replace(m):
    matches.append(m.group())
    return ""

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)

# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"

4 个答案:

答案 0 :(得分:6)

您可以检查字边界(\b):

>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
>>> re.sub(r'\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2   "title:quoted" keyword3'

或者,您也可以使用negative look behind and ahead assertions检查title:\w+周围没有引号:

>>> re.sub(r'(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2   "title:quoted" keyword3'

答案 1 :(得分:3)

这种情况与"regex-match a pattern unless..."

非常相似

我们可以通过一个非常简单的正则表达式解决它:

"[^"]*"|(\btitle:\S+)

交替|的左侧与完整的"double quoted strings"标记相匹配。我们将忽略这些匹配。右侧匹配并将title:hello字符串捕获到第1组,我们知道它们是正确的,因为它们与左侧的表达式不匹配。

此程序显示了如何使用正则表达式(请参阅online demo底部的结果):

import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r'"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
    if m.group(1):
        return ""
    else:
        return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)

参考

How to match (or replace) a pattern except in situations s1, s2, s3...

答案 2 :(得分:1)

 re.sub('[^"]title:\w+',"",string)
keyword1 keyword2 "title:quoted" keyword3

替换以title:开头的任何子字符串,后跟任何字母 - &gt; w+

答案 3 :(得分:0)

有点暴力,但在所有情况下都有效,没有灾难性的回溯:

import re

string = r'''keyword1 keyword2 title:hello title:world "title:quoted"title:foo
       "abcd \" title:bar"title:foobar keyword3 keywordtitle:keyword
       "non balanced quote title:foobar'''

pattern = re.compile(
    r'''(?:
            (      # other content
                (?:(?=(
                    " (?:(?=([^\\"]+|\\.))\3)* (?:"|$) # quoted content
                  |
                    [^t"]+             # all that is not a "t" or a quote
                  |
                    \Bt                # "t" preceded by word characters
                  |
                    t (?!itle:[a-z]+)  # "t" not followed by "itle:" + letters 
                )  )\2)+
            )
          |     # OR
            (?<!") # not preceded by a double quote
        )
        (?:\btitle:[a-z]+)?''',
    re.VERBOSE)

print re.sub(pattern, r'\1', string)