我对正则表达式感到不舒服,所以我需要你帮助我,这对我来说似乎很棘手。
我们说我有以下字符串:
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
获取title:hello
,title:world
的正则表达式是什么,从原始字符串中删除这些字符串并在其中留下"title:quoted"
,因为它被双引号括起来?
我已经看过this similar SO answer了,这就是我最终的结果:
import re
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
def replace(m):
if m.group(1) is None:
return m.group()
return m.group().replace(m.group(1), "")
regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)
assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
当然,它不起作用,我并不感到惊讶,因为正则表达式对我来说是深奥的。
感谢您的帮助!
感谢您的回答,这是最终解决方案,满足我的需求:
import re
matches = []
def replace(m):
matches.append(m.group())
return ""
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)
# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())
assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"
答案 0 :(得分:6)
您可以检查字边界(\b
):
>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
>>> re.sub(r'\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2 "title:quoted" keyword3'
或者,您也可以使用negative look behind and ahead assertions检查title:\w+
周围没有引号:
>>> re.sub(r'(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2 "title:quoted" keyword3'
答案 1 :(得分:3)
这种情况与"regex-match a pattern unless..."
非常相似我们可以通过一个非常简单的正则表达式解决它:
"[^"]*"|(\btitle:\S+)
交替|
的左侧与完整的"double quoted strings"
标记相匹配。我们将忽略这些匹配。右侧匹配并将title:hello
字符串捕获到第1组,我们知道它们是正确的,因为它们与左侧的表达式不匹配。
此程序显示了如何使用正则表达式(请参阅online demo底部的结果):
import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r'"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
参考
How to match (or replace) a pattern except in situations s1, s2, s3...
答案 2 :(得分:1)
re.sub('[^"]title:\w+',"",string)
keyword1 keyword2 "title:quoted" keyword3
替换以title:
开头的任何子字符串,后跟任何字母 - &gt; w+
答案 3 :(得分:0)
有点暴力,但在所有情况下都有效,没有灾难性的回溯:
import re
string = r'''keyword1 keyword2 title:hello title:world "title:quoted"title:foo
"abcd \" title:bar"title:foobar keyword3 keywordtitle:keyword
"non balanced quote title:foobar'''
pattern = re.compile(
r'''(?:
( # other content
(?:(?=(
" (?:(?=([^\\"]+|\\.))\3)* (?:"|$) # quoted content
|
[^t"]+ # all that is not a "t" or a quote
|
\Bt # "t" preceded by word characters
|
t (?!itle:[a-z]+) # "t" not followed by "itle:" + letters
) )\2)+
)
| # OR
(?<!") # not preceded by a double quote
)
(?:\btitle:[a-z]+)?''',
re.VERBOSE)
print re.sub(pattern, r'\1', string)