Question

我有一个大文件，我想以某种方式格式化。文件输入示例：

DVL1    03220   NP_004412.2 VANGL2  02758   Q9ULK5  in vitro    12490194
PAX3    09421   NP_852124.1 MEOX2   02760   NP_005915.2 in vitro;yeast 2-hybrid 11423130
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254.1  in vitro;in vivo    15195140

这就是我希望它成为：

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254

总结：

如果该行有1个点，则该点与其后的数字一起被删除并添加\ t，因此输出行将只有6个制表符分隔值
如果该行有2个点，那些点将与它们后面的数字一起删除并添加\ t，因此输出行将只有6个制表符分隔值
如果该行没有点，则保留前6个制表符分隔值

我的想法目前是这样的：

for line in infile:
    if "." in line: # thought about this and a line.count('.') might be better, just wasn't capable to make it work
        transformed_line = line.replace('.', '\t', 2) # only replaces the dot; want to replace dot plus next first character
        columns = transformed_line.split('\t')
        outfile.write('\t'.join(columns[:8]) + '\n') # if i had a way to know the position of the dot(s), i could join only the desired columns
    else:
        columns = line.split('\t')
        outfile.write('\t'.join(columns[:5]) + '\n') # this is fine

希望我解释自己好。谢谢你们的努力。

Answer 1

import re
with open(filename,'r') as f:
    newlines=(re.sub(r'\.\d+','',old_line) for old_line in f)
    newlines=['\t'.join(line.split()[:6]) for line in newlines]

现在你有一个删除'.number'部分的行列表。据我所知，你的问题不足以限制使用正则表达式完成整个过程，但它可以用2。

Answer 2

你可以尝试这样的事情：

    with open('data1.txt') as f:
        for line in f:
            line=line.split()[:6]
            line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)  #if an element has '.' then
                                                                         #remove that dot else keep the element as it is
            print('\t'.join(line))

<强>输出：

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254

修改

正如@mgilson建议，行line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)可以简单地替换为line=map(lambda x:x.split('.')[0],line)

Answer 3

我认为有人应该用一个正则表达式做到这一点，所以......

import re
beast_regex = re.compile(r'(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+in.*')
with open('data.txt') as infile:
    for line in infile:
        match = beast_regex.match(line)
        print('\t'.join(match.groups())

Answer 4

你可以用一个简单的正则表达式来做到这一点：

import re
for line in infile:
    line=re.sub(r'\.\d+','\t',line)
columns = line.split('\t')
outfile.write('\t'.join(columns[:5]) + '\n')

这取代任何“。”后跟一个或多个带制表符的数字。

使用Python来区分具有一个点的线和具有两个点的线

4 个答案: