Question

我有一个常规的表现，一行最多可以找到10个单词。也就是说，它应该包括换行前面的单词，而不是换行后的单词。我使用“\ n”使用负面的lookbehind。

a = re.compile(r"((\w)+[\s /]){0,10}(?<!\n)")
r = a.search("THe car is parked in the garage\nBut the sun is shining hot.")

当我执行这个正则表达式并调用方法r.group（）时，我将返回整个句子，但是包含句点的最后一个单词。我只期待新行之前的完整字符串。也就是说，“车停在车库里”。我在这里犯下的负面看法是什么错误？？

Answer 1

我不知道为什么你会使用负向前瞻。您说在换行前最多需要10个单词。下面的正则表达式应该有效。它使用正向前瞻来确保单词后面的换行符。此外，在搜索单词时，请使用`b \ w + \ b`而不是您正在使用的单词。

/(\b\w+\b)*(?=.*\\n)/

Python：

result = re.findall(r"(\b\w+\b)*(?=.*\\n)", subject)

说明：

# (\b\w+\b)*(?=.*\\n)
# 
# Match the regular expression below and capture its match into backreference number 1 «(\b\w+\b)*»
#    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
#    Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «*»
#    Assert position at a word boundary «\b»
#    Match a single character that is a “word character” (letters, digits, etc.) «\w+»
#       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
#    Assert position at a word boundary «\b»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*\\n)»
#    Match any single character that is not a line break character «.*»
#       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
#    Match the character “\” literally «\\»
#    Match the character “n” literally «n»

你可能还想考虑一下你的字符串中没有\ n的事实。

Answer 2

对于此任务，有一个锚点$可以找到字符串的结尾，并与修饰符re.MULTILINE / re.M一起找到该行的结尾。所以你最终会得到像这样的东西

(\b\w+\b[.\s /]{0,2}){0,10}$

见here on Regexr

\b是一个单词边界。我包含[.\s /]{0,2}来匹配我的示例中的一个点后跟一个空格。如果你不想要这个部分，你需要使这个部分至少是可选的[\s /]?，否则它会在最后一个单词中丢失，然后\s与\n匹配。

更新/提示2

好的，也许我用我的第一个解决方案误解了你的问题。

如果您只想不匹配换行符并继续第二行，那么就不要允许它。问题是换行符与角色类中的\s匹配。 \s是一个用于空格的类，其中还包括换行符\r和\n

您已经在课程中有空格，只需将\s替换为\t，以防您想要允许标签，然后您应该没有后顾之处。当然，要使字符类可选，否则最后一个字也不会匹配。

((\w)+[\t /]?){0,10}

见here on Regexr

Answer 3

如果我看对了你，你想要阅读最多10个单词或第一个换行符，以先到者为准：

((?:(?<!\n)\w+\b[\s.]*){0,10})

这使用了一个负面的lookbehind，但只是之前单词匹配，因此它阻止在换行后获取任何单词。

这需要对不完美的输入进行一些调整，但这是一个开始。

Answer 4

我认为你根本不应该使用lookbehind。如果您想匹配最多十个不包括换行符的单词，请尝试以下操作：

\S+(?:[ \t]+\S+){0,9}

这里将单词定义为一个或多个非空白字符，其中包括句点，撇号和其他句子标点符号以及字母。如果你知道你所匹配的文本是常规散文，那么就没有必要将自己限制在\w+，这无论如何都不能与自然语言单词相匹配。

在第一个单词之后，它重复匹配一个或多个水平空白字符（空格或TAB），后跟另一个单词，最多十个单词。如果它在第十个单词之前遇到换行符，则只会在该点停止匹配。根本没有必要在正则表达式中提及换行符。

Python，正则表达式负面的lookbehind行为

4 个答案: