不要将括号中的单词边界与python正则表达式匹配

时间:2014-03-24 09:37:02

标签: python regex boundary

我实际上有:

 regex = r'\bon the\b'

但只有当这个关键字(实际上“在”上)不在文本中的括号之间时才需要我的正则表达式匹配:

应匹配:

john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)

不匹配:

(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)

3 个答案:

答案 0 :(得分:0)

在UNIX中,使用以下正则表达式的grep实用程序就足够了,

grep " on the " input_file_name | grep -v "\(.* on the .*\)"

答案 1 :(得分:0)

这样的事情:^(.*)(?:\(.*\))(.*)$ see it in action

根据您的要求,它“仅匹配文本中括号之间的单词”

所以,来自:

  

一些文字(括号中的文字更多),一些不在括号中

匹配:some text + and some not in parentheses

上面链接中的更多示例。


编辑:自问题发生变化后更改答案。

要在括号中捕获所有提及而不是,我会使用一些代码而不是一个巨大的正则表达式。

这样的事情会让你接近:

import re

pattern = r"(on the)"

test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''

match_list = test_text.split('\n')

for line in match_list:
    print line, "->",

    bracket_pattern = r"(\(.*\))" #remove everything between ()
    brackets = re.findall(bracket_pattern, line)
    for match in brackets:
        line = line.replace(match,"")

    matches = re.findall(pattern, line)
    for match in matches:
        print match

    print "\r"

输出:

john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach -> 
bob is at the pool (berkeley) -> 
the spon (is on the table) -> 

答案 2 :(得分:0)

我不认为正则表达式可以帮助你解决一般情况。 对于您的示例,此正则表达式将按您的要求运行:

((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])

描述:

(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below 
                 can be matched
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally
    .{3} matches any character (except newline)
        Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below 
                can be matched
    .{3} matches any character (except newline)
        Quantifier: Exactly 2 times
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally

如果要将问题概括为括号和要搜索的字符串之间的任何字符串,则此方法不适用于此正则表达式。 问题是括号和字符串之间的字符串的长度。在正则表达式中,Lookbehind量词不允许是无限期的。

在我的正则表达式中,我使用了积极的Lookahead和积极的Lookbehind,同样的结果也可以用负面的结果实现,但问题仍然存在。

建议:编写一个小的python代码,如果它包含不在括号之间的文本,则可以检查整行,因为单独的正则表达式无法完成工作。

示例:

import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
    for item in unWanted:
        if item in line:
            mylist.remove(line)
# look for what you want
for line in mylist:
    if mystr in line:
        print line

其中:

mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.

希望这会有所帮助。