我有一个大文件,我想以某种方式格式化。文件输入示例:
DVL1 03220 NP_004412.2 VANGL2 02758 Q9ULK5 in vitro 12490194
PAX3 09421 NP_852124.1 MEOX2 02760 NP_005915.2 in vitro;yeast 2-hybrid 11423130
VANGL2 02758 Q9ULK5 MAGI3 11290 NP_001136254.1 in vitro;in vivo 15195140
这就是我希望它成为:
DVL1 03220 NP_004412 VANGL2 02758 Q9ULK5
PAX3 09421 NP_852124 MEOX2 02760 NP_005915
VANGL2 02758 Q9ULK5 MAGI3 11290 NP_001136254
总结:
我的想法目前是这样的:
for line in infile:
if "." in line: # thought about this and a line.count('.') might be better, just wasn't capable to make it work
transformed_line = line.replace('.', '\t', 2) # only replaces the dot; want to replace dot plus next first character
columns = transformed_line.split('\t')
outfile.write('\t'.join(columns[:8]) + '\n') # if i had a way to know the position of the dot(s), i could join only the desired columns
else:
columns = line.split('\t')
outfile.write('\t'.join(columns[:5]) + '\n') # this is fine
希望我解释自己好。 谢谢你们的努力。
答案 0 :(得分:3)
import re
with open(filename,'r') as f:
newlines=(re.sub(r'\.\d+','',old_line) for old_line in f)
newlines=['\t'.join(line.split()[:6]) for line in newlines]
现在你有一个删除'.number'部分的行列表。据我所知,你的问题不足以限制使用正则表达式完成整个过程,但它可以用2。
答案 1 :(得分:2)
with open('data1.txt') as f:
for line in f:
line=line.split()[:6]
line=map(lambda x:x[:x.index('.')] if '.' in x else x,line) #if an element has '.' then
#remove that dot else keep the element as it is
print('\t'.join(line))
<强>输出:强>
DVL1 03220 NP_004412 VANGL2 02758 Q9ULK5
PAX3 09421 NP_852124 MEOX2 02760 NP_005915
VANGL2 02758 Q9ULK5 MAGI3 11290 NP_001136254
修改强>
正如@mgilson建议,行line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)
可以简单地替换为line=map(lambda x:x.split('.')[0],line)
答案 2 :(得分:1)
我认为有人应该用一个正则表达式做到这一点,所以......
import re
beast_regex = re.compile(r'(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+in.*')
with open('data.txt') as infile:
for line in infile:
match = beast_regex.match(line)
print('\t'.join(match.groups())
答案 3 :(得分:0)
你可以用一个简单的正则表达式来做到这一点:
import re
for line in infile:
line=re.sub(r'\.\d+','\t',line)
columns = line.split('\t')
outfile.write('\t'.join(columns[:5]) + '\n')
这取代任何“。”后跟一个或多个带制表符的数字。