从第一句中找到相关的特定单词

时间:2017-10-09 20:50:22

标签: python regex python-3.x

示例:

我在一个文件中有这个字符串(数字$ 1one bla bla $ 2second ) 首先,我使用正则表达式找到这一行

  

$ 1one bla bla $ 2second

然后在找到该行之后,我需要在另一行中匹配包含'$'的单词 示例:

数字 $ 1one bla bla $ 2second

=> $ 1one

=>的 $2秒

另一条线:

  

这是bla bla bla $ 2second bla bla

     bla bla bla bla $ 2second

     

另一个bla bla $ 1one bla $ 3third bla $ 2second

     

$ 1one bla bla bla

找到上面的行后,再找一个包含'$'的单词(上例: $ 3third

  

另一个bla bla bla $ 3third bla $ 2second

     

$ 3third bla bla bla

直到找到所有'$'字样(不再有新单词包含'$')

我已经使用正则表达式进行了第一步,问题是我不知道在使用正则表达式后如何搜索另一个特定单词。我应该再次使用正则表达式还是有其他方法可以找到它?

1 个答案:

答案 0 :(得分:0)

在澄清OP后更新

#import regex the new regex library that helps with \K flag
import regex as re

s="""this is bla bla bla $2second bla bla
And then after that line was found, I need to match the word that contains '$' in another line Example :
number $1one bla bla $2second
And then after that line was found, I need to match the word that contains '$' in another line Example :
number $1one bla bla $2second
=> $1one
=> $2second
Another line:
this is bla bla bla $2second bla bla
bla bla $2second in bla bla
another bla bla bla $5fifth bla $4fourth
another bla bla $1one bla $3third bla $2second
$1one bla bla bla
After above line was found, find another word that contains '$' again (example above : $3third)
$4fourth bla bla bla'"""


def findre(s,valids=set([])):
    p=re.compile(r'(?=' + '|'.join(valids) + r').*?\K\$\w+\b|\$\w+\b(?=.*?(?:' + '|'.join(valids) + '))')
    l=p.findall(s)
    for x in l:
        valids.add(re.escape(x))
    return l

ss=s.split('\n')
dw=[x for s in ss for x in findre(s)]
print(dw)

<强>输出

['$2second', '$1one', '$2second', '$1one', '$2second', '$1one', '$2second', '$2second', '$2second', '$1one', '$3third', '$2second', '$1one', '$3third']

请注意,它跳过了没有早期代币支持的美元单词$4fourth$5fifth(美元单词)。
我在这里假设的另一件事是第一个支持令牌是我们从头开始匹配的第一个$word。如果情况并非如此,那么您需要做的就是改变

def findre(s,valids=set([])):

def findre(s,valids=set([$myfirsttoken,$mysecondtoken])):

这将确保匹配从这些令牌的第一次匹配开始/ $words

我刚刚使用\w您可以将其替换为[\w-]

<强>解释

                    #START WITH A VALID
(?=                 #START Positive Lookahead forces start with one of following words, but does not include in match         
'|'.join(valids)    #Creates a string of $1one|$2two and builds as it picks up valids
)                   #End positive lookahead
.*?\K               #Any characters and \K drops them all, you can put this inside positive lookahead if you want
\$\w+\b             #Any $word after the conditions of starting with key word before this
|                   #OR if $validword comes after any $word
\$\w+\b             #Any $word that match condition that follows
(?=                 #Positive Lookahead forces the match to end with
.*?                 #Any characters
(?:                 #non capturing group->don't include in match
'|'.join(valids)    #Creates a string of $1one|$2two and builds as it picks up valids
)                   #End non capturing group
)                   #End positive lookahead

在正则表达式“|”隐含OR,因此a|b|c表示匹配abc

<强>更新 在新的正则表达式库中使用List选项......下面的内容使其更易于使用。

def addToValid(m):
    valids.add(m)
    return m

valids=set([])
dw=[addToValid(x) for s in ss for x in re.findall(r'(?=\L<l>.*?)\$\w+\b|\$\w+\b(?=.*?\L<l>)',s,l=valids)]
print(dw)