在字符串上应用re.compile以解析所需的结果

时间:2016-04-29 05:26:07

标签: python regex xml text-files

我最近开始学习正则表达式并且在这个项目中陷入困境。实际上我正在尝试将文本文件转换为xml,文本文件的内容为:

姓名:Alex公司:X

姓名:Braun Company Y

根据要求,xml中的所需结果应如下所示:

from intervaltree import IntervalTree
from collections import defaultdict
binding_factor = some.txt
genome = dict()
with open('file2', 'r') as rows:
     for row in rows:
     #print row
     if row.startswith('>'):
        row = row.strip().split('|')
        chrom_name = row[5]
        start = int[row[3]
        end = int(row[3])
        # one interval tree per chromosome
        if chrom_name not in genome:
           genome[chrom_name] = IntervalTree()                
            # first time we've encountered this chromosome, createtree                    
            # index the feature
           genome[chrom_name].addi(start,end,row[2])
           #for key,value in genome.iteritems():
           #print key, ":", value

mast = defaultdict(list)
with open(file1', 'r') as f:
     for row in f:
     row = row.strip().split()
     row[0] = row[0].replace('chr', '') if row[0].startswith('chr') else row[0]
     row[0] = 'MT' if row[0] == 'M' else row[0]
     #print row[0]
     mast[row[0]].append({
     'start':int(row[1]),
     'end':int(row[2])
     })
     #for k,v in mast.iteritems():
     #print k, ":", v  

with open(binding_factor, 'w') as f :
     for k,v in mast.iteritems():
         for i in v:
             g = genome[k].search(i['start'],i['end'])
             if g:
                 print g
                 l = gene
                 f.write(str(l)`enter code here` + '\n')

我已经尝试了很多次,截至目前我的正则表达式代码是这样的:

    <celldata>
    <name>Braun</name>
    <company>Y</company>
    </celldata>

执行此操作后,我将结果显示为:

    rex = re.compile(r'''(?P<title>Name
        |Company)
        \s*:?\s*
        (?P<value>.*)
        ''',re.VERBOSE)

请告诉我如何做到这一点,因为我被卡住了。我不知道什么是正则表达式模式来遍历我想要的xml模式。

2 个答案:

答案 0 :(得分:0)

$ cat data
Name: Alex Company: X
Name: Braun Company Y
$ cat p.py 
import re

with open('data', 'r') as f:
    for line in f:
        print(re.sub(r'^\s*Name\s*:?\s*(.*)Company\s*:?\s*(.*)$', "<celldata><name>\\1</name><company>\\2</company></celldata>", line.strip()))
$ python3 p.py 
<celldata><name>Alex </name><company>X</company></celldata>
<celldata><name>Braun </name><company>Y</company></celldata>
$

答案 1 :(得分:0)

尝试类似

的内容
rex = re.compile(r'''
    ^Name:?
    \s*
    (?P<name>\w+)
    \s+
    Company:?
    \s*
    (?P<company>\w+)
    $
    ''',re.VERBOSE)

如果:之前可以有空格我会使用[\s:]*(即使技术上会匹配多个冒号......)

用法通常类似于:

for line in lines:
    m = rex.match(line)
    if m:
        output.write("""
        <celldata>
          <name>{name}</name>
          <company>{company}</company>
        </celldata>
        """.format(**m.groupdict())