Question

I have a big block of text within which I am trying to look for a phrase. The phrase can be structured in a number of different ways.

First I want to look for a word from a set of words, let's call it set 1.
After that there must be a space or comma (or maybe something else that separates words)
Then there may be 0 or more words from set 2
Again followed by the word separation characters as in point 2 above
finally there should be a word from set 3

Ideally all of these should be in the same sentence.

set 2 = (for|to|of|full|a|be|complete|Internal)

So I have this regex expression

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Now this will match a phrase where there is 0 or 1 words from set 2 but not if there are multiples. e.g "provides a wonderful opportunity for someone to add their own stamp as the property needs complete renovation throughout."

as soon as I add in 'a' before 'complete' it fails. The same as if I add another 'complete'.

How do I specify to look for 0 or multiple words from a set?

Answer 1

Set 1: Matches any of the words in set 1 with 1 separator.

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]

Set 2: Matches any of the words in set 2 with 1 separator, 0 or more times.

((for|to|of|full|a|be|complete|Internal)[ ,])*

Set 3: Matches any of the words in set 3

(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Full:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]((for|to|of|full|a|be|complete|Internal)[ ,])*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Answer 2

正则表达式can be quite slow中的长替代项。我建议采取另一种方法。首先将文本细分（分割为多个单词），然后对单词数组进行迭代，以检查随后的3个单词集是否满足您的要求

类似的东西（不是真正的python的伪代码）：

def check(text):
  words = segment(text)
  for i in range(0, len(text)-2):
      check_word1(text[i]) and check_word1(text[i+1]) and check_word3(text[i+2])

Answer 3

您必须使用此正则表达式：

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,](for|to|of|full|a|be|complete|Internal)*[ ,](renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

因为从第一组开始只有一个字。之后，您将有一个空格或逗号。您附近有2个集合中的0个或多个单词，然后是另一个空格或逗号，最后是最后一个集合中的一个单词。

Answer 4

Just in case you didn't know, you can use sites like https://regex101.com/ to test your regular expressions, and see why it works/it doesn't.

In this case, you need the "zero or more" (*) operator on your second group. The result would be:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)*[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

However, considering you probably want the words to be separated, just include the space on the operator (you can use a non-capturing group for that), resulting on:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(?:(for|to|of|full|a|be|complete|Internal)[ ,]*)*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Python regular expressions 0 or more words from set

4 个答案: