我有一个像下面这样的输入文件,需要提取以nsub,rcmod,ccomp,acomp开头的单词模式,并打印出两个输出文件,如下图所示,我是python的新手我没有得到如何在这里使用正则表达式
输入文件
nsubj(believe-4, i-1)
aux(believe-4, ca-2)
neg(believe-4, n't-3)
root(ROOT-0, believe-4)
acomp(believe-4, @mistamau-5)
aux(know-8, does-6)
neg(know-8, n't-7)
ccomp(@mistamau-5, know-8)
dobj(is-12, who-9)
amod(tatum-11, channing-10)
nsubj(is-12, tatum-11)
ccomp(know-8, is-12)
root(ROOT-0, What-1)
cop(What-1, is-2)
amod(people-4, worse-3)
xsubj(hear-9, I-5)
aux(talking-7, am-6)
rcmod(people-4, talking-7)
xcomp(talking-7, hear-9)
dobj(hear-9, me-10)
advmod(poorly-12, very-11)
输出file_1
nsubj(believe-4, i-1)
nsubj(is-12, tatum-11)
acomp(believe-4, @mistamau-5)
rcmod(people-4, talking-7)
ccomp(know-8, is-12)
ccomp(@mistamau-5, know-8)
输出file_2
believe, i
is, tatum
believe, @mistamau
people, talking
know, is
@mistamau, know
答案 0 :(得分:0)
这是一个从stdin接收单词并打印'匹配'或'不匹配'的程序,具体取决于单词是以'Big'还是'Daddy'开头。
import re
import sys
prog = re.compile('((Big)|(Daddy))[a-z]*')
while True:
line = sys.stdin.readline()
if not line: break
if prog.match(line):
print 'matched'
else:
print 'not matched'
只需用您自己的正则表达式模式替换正文表达式,而不是标准输入,您应该设置〜。
答案 1 :(得分:0)
regex = re.compile(r"""
^ # Start of line (re.M modifier set!)
( # Start of capturing group 1:
(?:nsubj|rcmod|ccomp|acomp) # Match one of these
\( # Match (
([^-]*) # Match and capture in group 2 any no. of non-dash characters
-\d+,[ ] # Match a dash and a number, a comma and a space
([^-]*) # Match and capture in group 3 any no. of non-dash characters
-\d+ # Match a dash and a number
\) # Match )
) # End of group 1""", re.M|re.X)
如果我正确理解您的要求,应该有效。
当应用于整个文件(s = myfile.read()
)时,您会得到以下结果:
>>> regex.findall(s)
[('nsubj(believe-4, i-1)', 'believe', 'i'),
('acomp(believe-4, @mistamau-5)', 'believe', '@mistamau'),
('ccomp(@mistamau-5, know-8)', '@mistamau', 'know'),
('nsubj(is-12, tatum-11)', 'is', 'tatum'),
('ccomp(know-8, is-12)', 'know', 'is'),
('rcmod(people-4, talking-7)', 'people', 'talking')]