I have a big block of text within which I am trying to look for a phrase. The phrase can be structured in a number of different ways.
Ideally all of these should be in the same sentence.
set 1 = (Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)
set 2 = (for|to|of|full|a|be|complete|Internal)
set 3 = (renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)
So I have this regex expression
(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)
Now this will match a phrase where there is 0 or 1 words from set 2 but not if there are multiples. e.g "provides a wonderful opportunity for someone to add their own stamp as the property needs complete renovation throughout."
as soon as I add in 'a' before 'complete' it fails. The same as if I add another 'complete'.
How do I specify to look for 0 or multiple words from a set?
答案 0 :(得分:3)
Set 1: Matches any of the words in set 1 with 1 separator.
(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]
Set 2: Matches any of the words in set 2 with 1 separator, 0 or more times.
((for|to|of|full|a|be|complete|Internal)[ ,])*
Set 3: Matches any of the words in set 3
(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)
Full:
(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]((for|to|of|full|a|be|complete|Internal)[ ,])*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)
答案 1 :(得分:2)
正则表达式can be quite slow中的长替代项。我建议采取另一种方法。首先将文本细分(分割为多个单词),然后对单词数组进行迭代,以检查随后的3个单词集是否满足您的要求
类似的东西(不是真正的python的伪代码):
def check(text):
words = segment(text)
for i in range(0, len(text)-2):
check_word1(text[i]) and check_word1(text[i+1]) and check_word3(text[i+2])
答案 2 :(得分:1)
您必须使用此正则表达式:
(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,](for|to|of|full|a|be|complete|Internal)*[ ,](renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)
因为从第一组开始只有一个字。之后,您将有一个空格或逗号。 您附近有2个集合中的0个或多个单词,然后是另一个空格或逗号,最后是最后一个集合中的一个单词。
答案 3 :(得分:0)
Just in case you didn't know, you can use sites like https://regex101.com/ to test your regular expressions, and see why it works/it doesn't.
In this case, you need the "zero or more" (*
) operator on your second group. The result would be:
(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)*[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)
However, considering you probably want the words to be separated, just include the space on the operator (you can use a non-capturing group for that), resulting on:
(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(?:(for|to|of|full|a|be|complete|Internal)[ ,]*)*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)