如何在python中使用正则表达式从文件中提取模式

时间:2014-07-14 12:11:55

标签: python regex

我有一个像下面这样的输入文件,需要提取以nsub,rcmod,ccomp,acomp开头的单词模式,并打印出两个输出文件,如下图所示,我是python的新手我没有得到如何在这里使用正则表达式

输入文件

nsubj(believe-4, i-1)
aux(believe-4, ca-2)
neg(believe-4, n't-3)
root(ROOT-0, believe-4)
acomp(believe-4, @mistamau-5)
aux(know-8, does-6)
neg(know-8, n't-7)
ccomp(@mistamau-5, know-8)
dobj(is-12, who-9)
amod(tatum-11, channing-10)
nsubj(is-12, tatum-11)
ccomp(know-8, is-12)
root(ROOT-0, What-1)
cop(What-1, is-2)
amod(people-4, worse-3)
xsubj(hear-9, I-5)
aux(talking-7, am-6)
rcmod(people-4, talking-7)
xcomp(talking-7, hear-9)
dobj(hear-9, me-10)
advmod(poorly-12, very-11)

输出file_1

nsubj(believe-4, i-1)
nsubj(is-12, tatum-11)
acomp(believe-4, @mistamau-5)
rcmod(people-4, talking-7)
ccomp(know-8, is-12)
ccomp(@mistamau-5, know-8)

输出file_2

believe, i
is, tatum
believe, @mistamau
people, talking
know, is
@mistamau, know

2 个答案:

答案 0 :(得分:0)

这是一个从stdin接收单词并打印'匹配'或'不匹配'的程序,具体取决于单词是以'Big'还是'Daddy'开头。

import re
import sys
prog = re.compile('((Big)|(Daddy))[a-z]*')
while True:
    line = sys.stdin.readline()
    if not line: break
    if prog.match(line):
        print 'matched'
    else:
        print 'not matched'

只需用您自己的正则表达式模式替换正文表达式,而不是标准输入,您应该设置〜。

答案 1 :(得分:0)

regex = re.compile(r"""
    ^          # Start of line (re.M modifier set!)
    (          # Start of capturing group 1:
     (?:nsubj|rcmod|ccomp|acomp) # Match one of these
     \(        # Match (
     ([^-]*)   # Match and capture in group 2 any no. of non-dash characters
     -\d+,[ ]  # Match a dash and a number, a comma and a space
     ([^-]*)   # Match and capture in group 3 any no. of non-dash characters
     -\d+      # Match a dash and a number
     \)        # Match )
    )          # End of group 1""", re.M|re.X)
如果我正确理解您的要求,

应该有效。

当应用于整个文件(s = myfile.read())时,您会得到以下结果:

>>> regex.findall(s)
[('nsubj(believe-4, i-1)', 'believe', 'i'), 
 ('acomp(believe-4, @mistamau-5)', 'believe', '@mistamau'), 
 ('ccomp(@mistamau-5, know-8)', '@mistamau', 'know'), 
 ('nsubj(is-12, tatum-11)', 'is', 'tatum'), 
 ('ccomp(know-8, is-12)', 'know', 'is'), 
 ('rcmod(people-4, talking-7)', 'people', 'talking')]