我试图在python中使用正则表达式过滤文本。目标是: 检查 文本是否包含单词W,前面没有X ,或者后面没有Y. 所以我们说:
W = “day”,X = “糟糕”,Y = “光”
"what a beautiful day it is" => should pass
"nice day" => should pass
"awful day" => should fail
"such an awful day" => should fail
"the day light" => should fail
"awful day light" => should fail
"day light" => should fail
我尝试了几件事:
r".*\b(?!awful\b)day\b.*"
r"\W*\b(?!awful\b)day\b.*" => to be able to include \n \r since '.' doesnt
r".*\b(day)\b(?!light\b).*"
r"\W*\b(day)\b(?!light\b)\W*" => to be able to include \n \r since '.' doesnt
如此完整的例子,(应该失败)
if (re.search(r".*\b(?!awful\b)day\b.*", "such an awful day", re.UNICODE):
print "Found awful day! no good!"
仍然想知道该怎么做! 任何想法?
答案 0 :(得分:2)
这样的东西?
# ^(?s)((?!X).)*W((?!Y).)*$
^
(?s)
(
(?! X )
.
)*
W
(
(?! Y )
.
)*
$
或者,带有单词边界
# ^(?s)((?!\bX\b).)*\bW\b((?!\bY\b).)*$
^
(?s)
(
(?! \b X \b )
.
)*
\b W \b
(
(?! \b Y \b )
.
)*
$
编辑 - 目前还不清楚你是否认为X< - > W< - > Y被空格分开了
或任意数量的字符。这个扩展的评论示例显示了两种方式
祝你好运!
注意 - (?add-remove)
构造是一个修饰符组。通常它是一种方式
在正则表达式中嵌入s(Dot-All),i(忽略大小写等)等选项
其中(?s)
表示添加Dot-All修饰符,(?si)
相同,但也包含忽略大小写。
# ^(?s)(?!.*(?:\bX\b\s+\bW\b|\bW\b\s+\bY\b))(?:.*\b(W)\b.*|.*)$
# This regex validates W is not preceded by X
# nor followed by Y.
# It also optionally finds W.
# Only fails if its invalid.
# If passed, can check if W present by
# examining capture group 1.
^ # Beginning of string
(?s) # Modifier group, with s = DOT_ALL
(?! # Negative looahead assertion
.* # 0 or more any character (dot-all is set, so we match newlines too)
(?:
\b X \b \s+ \b W \b # Trying to match X, 1 or more whitespaces, then W
| \b W \b \s+ \b Y \b # Or, Trying to match W, 1 or more whitespaces, then Y
# Substitute this to find any interval between X<->W<->Y
# \b X \b .* \b W \b <- Trying to match X, 0 or more any char, then W
# | \b W \b .* \b Y \b <- Or, Trying to match W, 0 or more any char, then Y
)
)
# Still at start of line.
# If here, we didn't find any X<->W, nor W<->Y.
# Opotioinally finds W in group 1.
(?:
.* \b
( W ) # (1), W
\b .*
|
.*
)
$ # End of string
答案 1 :(得分:2)
你快到了。尝试:
(?<!\bawful\b )\bday\b(?!\s+\blight\b)
演示:
st='''\
"what a beautiful day it is" => should pass
"nice day" => should pass
"awful day" => should fail
"such an awful day" => should fail
"the day light" => should fail
"awful day light" => should fail
"day light" => should fail'''
W, X, Y = 'day', 'awful', 'light'
pat=r'(?<!\b{}\b )\b{}\b(?!\s+\b{}\b)'.format(X, W, Y)
import re
for line in st.splitlines():
m=re.search(pat, line)
if m:
print line