Python转到下一行并保存/编辑内容

时间:2011-10-20 05:14:40

标签: python string line next

此代码是在之前的帖子中建立的。我正在尝试调整它以适应我们的数据。但它不起作用..这是我们文件的一个例子:

read:1424:2165 TGACCA/1:2165 TGACCA/2 
1..100  +chr1:3033296..3033395 #just this line
1..100  -chr1:3127494..3127395  
1..100  +chr1:3740372..3740471  

1 concordant    read:1483:2172 TGACCA/1:2172 TGACCA/2 
1..100  -chr7:94887644..94887545 #and just this line

此代码应执行以下操作:

  1. 搜索每一行
  2. 识别字符串'read:'
  3. 转到下一行并提取“+ chr:number..number”的内容 就一次 !然后搜索下一个'读:'等...
  4. 所以如果我在“读取:”之后多次使用“-chr:no..no”,那么只需要第一个。

    不幸的是我无法弄清楚如何让它发挥作用......

        import re
    
        infile='myfile.txt'
        outfile='outfile.txt'
    
        pat1 = re.compile(r'read:')
        pat2 = re.compile(r'([+-])chr([^:]+):(\d+)\.\.(\d+)')
    
        with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
            for line in in_f.readlines():
                if '\t' not in line.rstrip():
                    continue
                a = pat1.search(line)
                if a:
                m = pat2.search(line)
                out_f.write(' '.join(m.groups()) + '\n')
                if not a:
                    continue
    

    输出应该如下:

      1 3033293 3033395 
      7 94887644 94887545
    

    请有人给我一块骨头

    从下面的答案更新

    好吧我正在上传我使用的Tim McNamara的略微修改版本。它运行良好,但输出无法识别“chr”后的两位数字,并在最后一个数字后打印一个字符串

    with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
        lines = [line for line in in_f.readlines()]
        for i, line in enumerate(lines):
           if 'read' in line:
                data = lines[i+1].replace(':', '..').split('..')
                try:
                    out_f.write('{} {} {}\n'.format(data[1][-1], data[2], data[3])) #Here I tried to remove data[3] to avoid to have "start" in the output file.. didn't work .. 
                except IndexError:
                    continue
    

    以下是使用此代码获得的输出:

    6 140302505 140302604 start  # 'start' is a string in our data after this number
    5 46605561 46605462 start    # I don't understand why it grabs it thou...
    5 46605423 46605522 start    # I tried to modify the code to avoid this, but ... didn't work out
    6 29908310 29908409 start
    6 29908462 29908363 start
    4 12712132 12712231 start
    

    如何修复这两个错误?

1 个答案:

答案 0 :(得分:1)

你最大的错误是你需要先包含readlines才能迭代'in_f':

with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
    for line in in_f.readlines():
        ...

然而,整段代码可能会被整理一下。

with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
    lines = [line for line in in_f.readlines()]
    for i, line in enumerate(lines):
        if 'read' in line:
            data = lines[i+1].replace(':', '..').split('..')
            try:
                a = data[1].split('chr')[-1]
                b = data[2]
                c = data[3].split()[0]
                out_f.write('{} {} {}\n'.format(a, b, c))
            except IndexError:
                pass