基于匹配字符的文件分离

时间:2015-07-16 12:39:06

标签: python python-2.7

  ATOM    856  CE ALYS A 104       0.809   0.146  26.161  0.54 29.14           C
  ATOM    857  CE BLYS A 104       0.984  -0.018  26.394  0.46 31.19           C
  ATOM    858  NZ ALYS A 104       1.988   0.923  26.662  0.54 33.17           N
  ATOM    859  NZ BLYS A 104       1.708   0.302  27.659  0.46 37.61           N
  ATOM    860  OXT LYS A 104      -0.726  -6.025  27.180  1.00 26.53           O
  ATOM    862  N   LYS B 276      17.010 -16.138   9.618  1.00 41.00           N
  ATOM    863  CA  LYS B 276      16.764 -16.524  11.005  1.00 31.05           C
  ATOM    864  C   LYS B 276      16.428 -15.306  11.884  1.00 26.93           C
  ATOM    865  O   LYS B 276      16.258 -15.447  13.090  1.00 29.67           O
  ATOM    866  CB  LYS B 276      17.863 -17.347  11.617  1.00 33.62           C

我有上面的文本文件,需要根据第21行的差异制作两个文本文件。我写了一个可以打印所需结果的脚本。但如果我不知道第21栏的角色是什么,我该怎么做才能做到这一点。以下是我试过的脚本。假设我不知道第21行是" A"和" B"或" B"和" G"或任何其他组合,需要在第21行的基础上分开。我该怎么做?

  import sys

  for fn in sys.argv[1:]:
     f=open(fn,'r')

     while 1:
        line=f.readline()
        if not line: break
        if line[21:22] == 'B':
           chns = line[0:80]
           print chns

4 个答案:

答案 0 :(得分:1)

使用str.split并比较第5个元素(即第21个字符)

while 1:
    line = f.readline()
    if not line: 
        break

    # get character in 5th column
    ch = line.split()[4]
    if ch == 'B':
        chns = line[0:80]
        print chns
    else: # not sure what the character is
        pass # do something

答案 1 :(得分:1)

您可以将值初始化为None并查看其是否更改:

import sys

for fn in sys.argv[1:]:
    old = None
    f=open(fn,'r')

    for line in f:
        if not line: break
        if (old is None) or (line[21] == old):
           old = line[21]
           chns = line[0:80]
           print chns

答案 2 :(得分:1)

不确定您要实现的目标。但是下面的代码将按字典lines中的第21个字符对所有文件中的行进行排序。

import sys

lines = dict()
for fn in sys.argv[1:]:
    f = open(fn,'r')

    for line in f:
        if not line:
            break
        key = line.split()[4]
        if key not in lines.keys():
            lines[key] = list()
        lines[key].append(line)

然后,您可以使用lines.keys()获取所有发生的第21个字符,并从字典中获取包含所有相应行的列表()。

答案 3 :(得分:1)

  • 从上一行存储第21个字符的上一个值,然后为每个不匹配 添加换行符(这意味着另一个组)相同的字母)根据第21个字符打印分组的行。

  • 请注意,它仅根据文件中的行序列对具有匹配的第21个字符的行进行分组,这意味着未排序的行将具有多个单独的分组同样的第21个字符

    修改后的文件以显示此案例:

    ATOM    856  CE ALYS A 104       0.809   0.146  26.161  0.54 29.14           C
    ATOM    857  CE BLYS A 104       0.984  -0.018  26.394  0.46 31.19           C
    ATOM    862  N   LYS B 276      17.010 -16.138   9.618  1.00 41.00           N
    ATOM    863  CA  LYS B 276      16.764 -16.524  11.005  1.00 31.05           C
    ATOM    864  C   LYS B 276      16.428 -15.306  11.884  1.00 26.93           C
    ATOM    865  O   LYS B 276      16.258 -15.447  13.090  1.00 29.67           O
    ATOM    866  CB  LYS B 276      17.863 -17.347  11.617  1.00 33.62           C
    ATOM    858  NZ ALYS A 104       1.988   0.923  26.662  0.54 33.17           N
    ATOM    859  NZ BLYS A 104       1.708   0.302  27.659  0.46 37.61           N
    ATOM    860  OXT LYS A 104      -0.726  -6.025  27.180  1.00 26.53           O
    

    生成此案例的代码(不对行进行排序):

    import sys
    
    for fn in sys.argv[1:]:
    
        with open(fn,'r') as file:
            prev = 0
            for line in file:
                line = line.strip()
                if line[21:22] != prev:
                    # new line separator for each group
                    print ''
                print line
                prev = line[21:22]
    

    显示此案例的示例输出:

    ATOM    856  CE ALYS A 104       0.809   0.146  26.161  0.54 29.14           C
    ATOM    857  CE BLYS A 104       0.984  -0.018  26.394  0.46 31.19           C
    
    ATOM    862  N   LYS B 276      17.010 -16.138   9.618  1.00 41.00           N
    ATOM    863  CA  LYS B 276      16.764 -16.524  11.005  1.00 31.05           C
    ATOM    864  C   LYS B 276      16.428 -15.306  11.884  1.00 26.93           C
    ATOM    865  O   LYS B 276      16.258 -15.447  13.090  1.00 29.67           O
    ATOM    866  CB  LYS B 276      17.863 -17.347  11.617  1.00 33.62           C
    
    ATOM    858  NZ ALYS A 104       1.988   0.923  26.662  0.54 33.17           N
    ATOM    859  NZ BLYS A 104       1.708   0.302  27.659  0.46 37.61           N
    ATOM    860  OXT LYS A 104      -0.726  -6.025  27.180  1.00 26.53           O
    
  • 因此,如果您希望每个相同的第21个字符只有一个,请使用{将所有行放在list排序中{1}}会这样做。

    代码(在分组前先对行进行排序):

    list.sort()

    输出到:

    import sys
    
    for fn in sys.argv[1:]:
    
        with open(fn,'r') as file:
    
            lines = file.readlines()
    
            # creates a list or pairs (21st char, line) within a list
            lines = [ [line[21:22], line.strip() ] for line in lines ]
    
            # sorts lines based on key (21st char)
            lines.sort()
    
            # brings back list of lines to its original state, 
            # but the order is not reverted since it is already sorted
            lines = [ line[1] for line in lines ]
    
            prev = 0
            for line in lines:
                if line[21:22] != prev:
                    # new line separator for each group
                    print ''
                print line
                prev = line[21:22]
    

修改

在不同文件中写入分组行实际上不需要检查上一行的值,因为根据第21个字符更改文件名会打开一个新文件,从而分隔行。但是在这里,我使用了ATOM 856 CE ALYS A 104 0.809 0.146 26.161 0.54 29.14 C ATOM 857 CE BLYS A 104 0.984 -0.018 26.394 0.46 31.19 C ATOM 858 NZ ALYS A 104 1.988 0.923 26.662 0.54 33.17 N ATOM 859 NZ BLYS A 104 1.708 0.302 27.659 0.46 37.61 N ATOM 860 OXT LYS A 104 -0.726 -6.025 27.180 1.00 26.53 O ATOM 862 N LYS B 276 17.010 -16.138 9.618 1.00 41.00 N ATOM 863 CA LYS B 276 16.764 -16.524 11.005 1.00 31.05 C ATOM 864 C LYS B 276 16.428 -15.306 11.884 1.00 26.93 C ATOM 865 O LYS B 276 16.258 -15.447 13.090 1.00 29.67 O ATOM 866 CB LYS B 276 17.863 -17.347 11.617 1.00 33.62 C ,这样任何以前创建的具有相同文件名的文件都不会被附加,这可能会导致文件内容混乱或不一致。

prev

如果追加以前创建的文件不是问题,则可以简化文件编写部分。但是,它有可能写入具有相同文件名的文件,该文件不是由脚本创建的,或者是在早期执行/会话期间由脚本创建的。

import sys

for fn in sys.argv[1:]:
    with open(fn,'r') as file:

        lines = file.readlines()

        # creates a list or pairs (21st char, line) within a list
        lines = [ [line[21:22], line ] for line in lines ]

        # sorts lines based on key (21st char)
        lines.sort()

        # brings back list of lines to its original state, 
        # but the order is not reverted since it is already sorted
        lines = [ line[1] for line in lines ]

        filename = 'file'
        prev = 0
        for line in lines:
            if line[21:22] != prev:
                # creates a new file
                file = open(filename + line[21:22] + '.txt', 'w')
            else:
                # appends to the file
                file = open(filename + line[21:22] + '.txt', 'a')

            file.write(line)
            prev = line[21:22]