我最近开始学习正则表达式并且在这个项目中陷入困境。实际上我正在尝试将文本文件转换为xml,文本文件的内容为:
姓名:Alex公司:X
姓名:Braun Company Y
根据要求,xml中的所需结果应如下所示:
from intervaltree import IntervalTree
from collections import defaultdict
binding_factor = some.txt
genome = dict()
with open('file2', 'r') as rows:
for row in rows:
#print row
if row.startswith('>'):
row = row.strip().split('|')
chrom_name = row[5]
start = int[row[3]
end = int(row[3])
# one interval tree per chromosome
if chrom_name not in genome:
genome[chrom_name] = IntervalTree()
# first time we've encountered this chromosome, createtree
# index the feature
genome[chrom_name].addi(start,end,row[2])
#for key,value in genome.iteritems():
#print key, ":", value
mast = defaultdict(list)
with open(file1', 'r') as f:
for row in f:
row = row.strip().split()
row[0] = row[0].replace('chr', '') if row[0].startswith('chr') else row[0]
row[0] = 'MT' if row[0] == 'M' else row[0]
#print row[0]
mast[row[0]].append({
'start':int(row[1]),
'end':int(row[2])
})
#for k,v in mast.iteritems():
#print k, ":", v
with open(binding_factor, 'w') as f :
for k,v in mast.iteritems():
for i in v:
g = genome[k].search(i['start'],i['end'])
if g:
print g
l = gene
f.write(str(l)`enter code here` + '\n')
我已经尝试了很多次,截至目前我的正则表达式代码是这样的:
<celldata>
<name>Braun</name>
<company>Y</company>
</celldata>
执行此操作后,我将结果显示为:
rex = re.compile(r'''(?P<title>Name
|Company)
\s*:?\s*
(?P<value>.*)
''',re.VERBOSE)
请告诉我如何做到这一点,因为我被卡住了。我不知道什么是正则表达式模式来遍历我想要的xml模式。
答案 0 :(得分:0)
$ cat data
Name: Alex Company: X
Name: Braun Company Y
$ cat p.py
import re
with open('data', 'r') as f:
for line in f:
print(re.sub(r'^\s*Name\s*:?\s*(.*)Company\s*:?\s*(.*)$', "<celldata><name>\\1</name><company>\\2</company></celldata>", line.strip()))
$ python3 p.py
<celldata><name>Alex </name><company>X</company></celldata>
<celldata><name>Braun </name><company>Y</company></celldata>
$
答案 1 :(得分:0)
尝试类似
的内容rex = re.compile(r'''
^Name:?
\s*
(?P<name>\w+)
\s+
Company:?
\s*
(?P<company>\w+)
$
''',re.VERBOSE)
如果:
之前可以有空格我会使用[\s:]*
(即使技术上会匹配多个冒号......)
用法通常类似于:
for line in lines:
m = rex.match(line)
if m:
output.write("""
<celldata>
<name>{name}</name>
<company>{company}</company>
</celldata>
""".format(**m.groupdict())