Python regular expressions 0 or more words from set

时间:2019-01-07 13:49:36

标签: python regex

I have a big block of text within which I am trying to look for a phrase. The phrase can be structured in a number of different ways.

  1. First I want to look for a word from a set of words, let's call it set 1.
  2. After that there must be a space or comma (or maybe something else that separates words)
  3. Then there may be 0 or more words from set 2
  4. Again followed by the word separation characters as in point 2 above
  5. finally there should be a word from set 3

Ideally all of these should be in the same sentence.

set 1 = (Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)

set 2 = (for|to|of|full|a|be|complete|Internal)

set 3 = (renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

So I have this regex expression

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Now this will match a phrase where there is 0 or 1 words from set 2 but not if there are multiples. e.g "provides a wonderful opportunity for someone to add their own stamp as the property needs complete renovation throughout."

as soon as I add in 'a' before 'complete' it fails. The same as if I add another 'complete'.

How do I specify to look for 0 or multiple words from a set?

4 个答案:

答案 0 :(得分:3)

Set 1: Matches any of the words in set 1 with 1 separator.

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]

Set 2: Matches any of the words in set 2 with 1 separator, 0 or more times.

((for|to|of|full|a|be|complete|Internal)[ ,])*

Set 3: Matches any of the words in set 3

(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Full:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]((for|to|of|full|a|be|complete|Internal)[ ,])*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

答案 1 :(得分:2)

正则表达式can be quite slow中的长替代项。我建议采取另一种方法。首先将文本细分(分割为多个单词),然后对单词数组进行迭代,以检查随后的3个单词集是否满足您的要求

类似的东西(不是真正的python的伪代码):

def check(text):
  words = segment(text)
  for i in range(0, len(text)-2):
      check_word1(text[i]) and check_word1(text[i+1]) and check_word3(text[i+2])

答案 2 :(得分:1)

您必须使用此正则表达式:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,](for|to|of|full|a|be|complete|Internal)*[ ,](renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

因为从第一组开始只有一个字。之后,您将有一个空格或逗号。 您附近有2个集合中的0个或多个单词,然后是另一个空格或逗号,最后是最后一个集合中的一个单词。

答案 3 :(得分:0)

Just in case you didn't know, you can use sites like https://regex101.com/ to test your regular expressions, and see why it works/it doesn't.

In this case, you need the "zero or more" (*) operator on your second group. The result would be:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)*[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

However, considering you probably want the words to be separated, just include the space on the operator (you can use a non-capturing group for that), resulting on:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(?:(for|to|of|full|a|be|complete|Internal)[ ,]*)*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)